Importing datasets into OpenGWAS
import_pipeline.RmdAs a Contributor
This role allows you to contribute datasets to OpenGWAS. You will create metadata, upload file for QC, check QC report and submit dataset for approval. You will need to contact us and request to be added as a Contributor.
| Actions \ Tools | Web portal | R/GwasDataImport |
|---|---|---|
| List all your draft datasets and check progress | Yes | Not supported yet |
| Create metadata Edit metadata |
Yes | Yes |
| Upload file for QC | Not supported | Yes |
| Delete QC files but keep metadata Delete QC files and metadata |
Yes | Yes |
| Check QC report | Yes | Not supported yet |
| Submit for approval | Yes | Not supported yet |
| List all your approved (released) datasets | Yes | Not supported yet |
Prerequisites
You will need:
- OpenGWAS Token (JWT) from https://api.opengwas.io/profile/
- Web portal - you should have been granted access to it and received its URL
- R/GwasDataImport installed and your OpenGWAS Token set up in your R environment. It’s the same token you would use in R/ieugwasr, R/TwoSampleMR etc.
- Metadata of each dataset
- Summary stats files of each dataset - see below
- Optionally, the OpenGWAS ID of each dataset - see below
Summary stats file format
Each dataset will be a .txt or .txt.gz
file. The content looks like this:
ID ALT REF BETA SE PVALUE AF N ZVALUE INFO CHROM POS
rs10399878 G A 0.0118 0.016 0.4608 0.9569 124787 NA NA 1 1239953
rs1039063 G T 0.0026 0.0036 0.4702 0.55 236102 NA NA 1 2281978
rs1039100 G A 0.0033 0.0047 0.4826 NA 221290 NA NA 1 2286947
rs10158583 A G 0.0099 0.0059 0.09446 0.075 321197 NA NA 1 3144068
rs10157420 C T -0.0038 0.0075 0.6124 0.05 234171 NA NA 1 3146497
The header row is not required - you will need to specify the mapping in later steps. The header row (if any) will be ignored by the pipeline so it’s okay to leave it as-is.
The order of the columns doesn’t matter now (e.g. you can have chromosome as the first column) - anyway you will need to specify the column mapping later in step 3.
These columns are required for each dataset:
- chromosome
- position
- beta
- se
- effect allele
- other allele
- pval
Note: you need to remove any leading 0 from the
chromosome values, e.g. 07 -> 7.
Note: a row will be removed by the pipeline entirely if it has NA/Inf value in any required columns.
These columns are optional:
- rsid
- effect allele frequency
- other allele frequency
- number of cases
- number of controls (or total sample size if continuous trait)
- imputation z-score
- imputation info score
OpenGWAS ID
An ID on OpenGWAS consists of three parts:
category-study-dataset. E.g.
- ebi-a-GCST90091033
- ieu-b-2
- finn-b-K11_FIBROCHIRLIV
The first two parts combined are also known as a “batch”,
e.g. ieu-b is a batch.
Auto-assign ID If you are uploading a few datasets,
you can usually let the system auto-assign IDs. The batch will be fixed
to ieu-b and the sequence (dataset) number will be
generated, so you may get ieu-b-6001, ieu-b-6002, etc. To go with this
route, just leave the OpenGWAS ID (igd_id) field as blank.
Specify ID If you are uploading hundreds or even
thousands of datasets and/or if the datasets are from a study or
consortium, please contact us and we may set up a new batch for you if
necessary. In this case you will need to specify the full ID for each
dataset when creating metadata. Assume you were advised to use the
met-e batch, you should specify
e.g. met-e-LDL_C (recommended) or met-e-1
(less meaningful) for a dataset. The system will not work out the
sequence number, so you cannot just give met-e and let the
system generate the last part. If you insist on numerical IDs you should
generate them by yourself.
See a list of current batches at https://opengwas.io/datasets/
Setup
You can track your contributions via the web portal, but you will need R/GwasDataImport anyway to upload the files.
library(GwasDataImport)
# GwasDataImport depends on ieugwasr. Make sure you have switched to the "public" API base URL
ieugwasr::select_api()
# Make sure you are authenticated and have got the "contributor" or "admin" role
ieugwasr::user()Upload a single dataset
For each dataset you will need to go through step 1 to 4. See also: Upload in bulk.
1. Create metadata (two methods)
In either ways you will need to create a new GwasDataImport instance in R. No parameter is required at this stage.
x <- Dataset$new()1a. Using the web portal
Create metadata (and specify ID when necessary) > Check OpenGWAS ID > Set the metadata flag
The portal provides a webform with dropdowns and tooltips for each field. This is handy when you are new to this process and/or only have a few datasets to upload.
Click the Add new metadata button and provide metadata
as instructed. You can hover on the input boxes to see the tips.
OpenGWAS ID: Leave blank for an auto-assigned ID. To specify an ID, give it in full e.g.
met-e-LDL_C.
When you click “Submit”, a new dataset will be added, and you can see
the OpenGWAS ID - assume it’s ieu-b-9999.
Locally the R package doesn’t know anything about the dataset you just created on the portal yet, so we need to specify the ID and skip its test on metadata upload.
# Let the instance know the ID it belongs to (replace with the auto-assigned or specified ID)
x$igd_id <- 'ieu-b-9999'
# Skip the test on metadata upload as we have added the metadata via the web portal
x$metadata_uploaded <- TRUE1b. Using R/GwasDataImport alone
Specify ID when necessary > Provide and upload metadata > Check OpenGWAS ID
As an alternative to 1a, you can opt to use R if you have an array of candidate datasets since it’s easier to upload metadata of multiple datasets programmatically. We recommend that you try the web portal in 1a first before you go down this more advanced route.
You may want to specify the OpenGWAS ID for this dataset.
OpenGWAS ID: You don’t need to do this if you would like an auto-assigned ID (which starts with
ieu-b). If you would like to specify an ID, give it in full, like -# Only run this if you would like to specify OpenGWAS ID for this dataset x$igd_id <- 'met-e-LDL_C'
Then populate the metadata fields and upload.
# See the list of options, descriptions and whether they are required or not
# For reference only - don't need to run this every time
x$view_metadata_options()
# Specify the metadata for this dataset
x$collect_metadata(list(
# Required fields
trait="TEST - DO NOT USE",
build="HG19/GRCh37",
group_name="public",
category="Risk factor",
subcategory="Anthropometric",
population="Mixed",
sex="Males and Females",
author="Mendel GJ",
# Optional fields (to name a few)
ontology="NA",
sample_size=339224,
year=2022,
unit="SD"
))
# Upload the metadata to the system, and look out for the OpenGWAS ID returned
x$api_metadata_upload()Look out for the OpenGWAS ID assigned by the system (if applicable) after the last command. At the same time a new record will show up on the web portal under Step 1.
2. Modify the metadata (only when necessary)
You can modify the metadata either via the web portal (recommended) or through R/GwasDataImport, regardless of how you created the metadata (i.e. metadata created via the R package can be modified on the web portal, or vice versa, as they are connected to the same database).
Note that the metadata can only be modified when there is no QC pipeline associated, because at the last step when generating the QC report, the metadata will be hardcoded in the report. Metadata can only be modified when
- (a) the file has not been uploaded for QC, or
- (b) the QC pipeline had already finished and you decided to “delete uploaded files” (but keep the metadata), which will effectively revert the state to (a)
3. Format the file and upload for QC
Always check that the OpenGWAS ID held locally is accurate. Then let the instance know the full path to the file.
# Is the ID correct?
x$igd_id
# Specify path to the file
x$filename <- '/Users/ab12345/Desktop/bmi_test/bmi_test.txt.gz'Specify column mapping (1-indexed) (check docs for parameter names):
x$determine_columns(list(
chr_col=11,
pos_col=12,
ea_col=2,
oa_col=3,
beta_col=4,
se_col=5,
pval_col=6,
snp_col=1,
eaf_col=7,
ncontrol_col=8
))Use the output to double-check the mapping. If necessary, run
x$determine_columns(...) again to overwrite.
Format the dataset and then upload (both may take a while):
# Reorganise columns, check for invalid values, liftover, pack the dataset...
x$format_dataset()
# Upload the file
x$api_gwasdata_upload()You will see a “Dataset has been added to the pipeline” message if the upload was successful.
And finally don’t forget to clean up the working directory for this dataset instance:
x$delete_wd()4. Check QC pipeline state and report, and submit for approval
On the web portal you can click the 2. QC tab of the dataset popup and check pipeline state.
For each dataset, you should review the QC report when it’s available and decide whether to submit the dataset for approval or not. You will have the following options:
- Submit the dataset for approval
- Go to the 3. Approval & Release tab and submit
- Re-upload the file for QC
- Go to the 1. Metadata tab and “Delete all files”
- Discard the dataset
- Go to the 1. Metadata tab and “Delete all files and metadata(!)”
You may also use the checkboxes on the main screen to select datasets and submit for approval in bulk.
Upload in bulk
If you have multiple datasets you may want to write an R snippet to semi-automate this process.
Set up for only once, then for each dataset go
through step 1 (x <- Dataset$new()), 1b, 2 and 3.
Finally visit the portal in step 4 and use the checkboxes to submit in
bulk.
What’s next
We will review and approve/reject each dataset. Approved datasets will go into the release pipeline which may take another few hours. Usually within 24 hours after the approval, you or anyone who has access may query the dataset via packages, e.g.:
ieugwasr::tophits("ieu-b-9999")
And if it’s uploaded under the public group, it will be
listed on https://opengwas.io/datasets/ as well.
Thank you for your contribution!