Importing datasets into OpenGWAS [NEW WORKFLOW]
import_pipeline_new.RmdAs a Contributor
This role allows you to contribute datasets to OpenGWAS. You will create metadata, upload file for QC, check QC report and submit dataset for approval. You will need to contact us and request to be added as a Contributor.
| Actions \ Tools | Web portal | R/GwasDataImport |
|---|---|---|
| List all your draft datasets and check progress | Yes | Not supported |
| Create metadata Edit metadata |
Yes | Yes |
| Upload file for QC | Not supported | Yes |
| Delete QC files but keep metadata Delete QC files and metadata |
Yes | Yes |
| Check QC report | Yes | Yes |
| Submit for approval | Yes | Not supported |
| List all your approved (released) datasets | Yes | Not supported |
Prerequisites
You will need:
- OpenGWAS JWT (token) from https://api.opengwas.io/profile/
- Web portal - you should have been granted access to it and received its URL
- R/GwasDataImport installed and your OpenGWAS JWT (token) set up in your R environment. It’s the same token you would use in R/ieugwasr, R/TwoSampleMR etc. - see this on how to set up a token
- Metadata of each dataset
- Summary stats files of each dataset - see below
- Optionally, the OpenGWAS ID of each dataset - see below
Summary stats file format
Each dataset will be a .txt or .txt.gz
file. The content looks like this:
ID ALT REF BETA SE PVALUE AF N ZVALUE INFO CHROM POS
rs10399878 G A 0.0118 0.016 0.4608 0.9569 124787 NA NA 1 1239953
rs1039063 G T 0.0026 0.0036 0.4702 0.55 236102 NA NA 1 2281978
rs1039100 G A 0.0033 0.0047 0.4826 NA 221290 NA NA 1 2286947
rs10158583 A G 0.0099 0.0059 0.09446 0.075 321197 NA NA 1 3144068
rs10157420 C T -0.0038 0.0075 0.6124 0.05 234171 NA NA 1 3146497
The header row is not required - you will need to specify the mappings in later steps. The header row (if any) will be ignored by the pipeline so it’s okay to leave it as-is.
These columns are required for each dataset:
- chromosome
- position
- beta
- se
- effect allele
- other allele
- pval
Note: you need to remove any leading 0 from the
chromosome values, e.g. 07 -> 7.
Note: a row will be removed by the pipeline entirely if it has NA/Inf value in any required columns.
These columns are optional:
- rsid
- effect allele frequency
- other allele frequency
- number of cases
- number of controls (or total sample size if continuous trait)
- imputation z-score
- imputation info score
OpenGWAS ID
An ID on OpenGWAS consists of three parts:
category-study-dataset. E.g.
- ebi-a-GCST90091033
- ieu-b-2
- finn-b-K11_FIBROCHIRLIV
The first two parts combined are also known as a “batch”,
e.g. ieu-b is a batch.
Auto-assign ID If you are uploading a few datasets,
you can usually let the system auto-assign IDs. The batch will be fixed
to ieu-b and the sequence (dataset) number will be
generated, so you may get ieu-b-6001, ieu-b-6002, etc. To go with this
route, just leave the OpenGWAS ID (igd_id) field as blank.
Specify ID If you are uploading hundreds or even
thousands of datasets and/or if the datasets are from a study or
consortium, please contact us and we may set up a new batch for you if
necessary. In this case you will need to specify the full ID for each
dataset. Assume you were given the met-e batch, you should
specify e.g. met-e-LDL_C (recommended) or
met-e-1 (less meaningful) for each dataset. The system will
not work out the sequence, so you cannot just give met-e
and let the system generate the last part. If you insist on numerical
IDs you should generate them by yourself.
See a list of current batches at https://opengwas.io/datasets/
Setup
You can track your contributions via the web portal, but you will need R/GwasDataImport anyway to upload the files.
library(GwasDataImport)
# GwasDataImport depends on ieugwasr. Make sure you have switched to the "public" API base URL
ieugwasr::select_api()
# Make sure you are authenticated and have got the "contributor" or "admin" role
ieugwasr::user()Upload a single dataset
For each dataset you will need to go through step 1 to 4. See also: Upload in bulk.
1a. Select the file and create metadata (with the web portal and then R/GwasDataImport)
Provide (upload) metadata (and specify ID when necessary) > Check OpenGWAS ID > Specify file location > Set the metadata flag
The portal provides a webform with dropdowns and tooltips for each field. This is handy when you are new to this process and/or only have a few datasets to upload.
Click the Add new metadata button and provide metadata
as instructed. You can hover on an input box to see the tips.
OpenGWAS ID: Leave blank for an auto-assigned ID. If you would like
to specify an ID, give it in full e.g. met-e-LDL_C.
When you click “Submit”, a new dataset will be added, and you can see the OpenGWAS ID.
In R, specify the path to the file that will be uploaded for QC
(assume it’s ~/bmi_test.txt.gz) and OpenGWAS ID, and then
mark the metadata as uploaded. Locally the R package doesn’t know
anything about the dataset you just created on the portal yet, so we
need to specify the ID and skip its test on metadata upload.
# Let the package know the location of the file and under which ID should the file be uploaded to
x <- Dataset$new(filename="~/bmi_test.txt.gz", igd_id='ieu-b-9999')
# Skip the test on metadata upload as we have added the metadata via the web portal
x$metadata_uploaded <- TRUE1b. Select the file and create metadata (with R/GwasDataImport alone)
Specify file location (and specify ID when necessary) > Provide (upload) metadata > Check OpenGWAS ID
As an alternative to 1a, you can opt to use R if you have an array of candidate datasets since it’s easier to upload metadata of multiple datasets programmatically. We recommend that you try the web portal in 1a first before you go down this more advanced route.
Assume the full path to the file of this dataset is
~/bmi_test.txt.gz.
# OpenGWAS ID, auto-assigned
x <- Dataset$new(filename="~/bmi_test.txt.gz")
# - OR -
# OpenGWAS ID, specified
x <- Dataset$new(filename="~/bmi_test.txt.gz", igd_id='ieu-b-9999')Then populate other fields, and upload the metadata.
# See the list of options, descriptions and whether they are required or not. For info only - don't need to run this every time
x$view_metadata_options()
# Specify the metadata for this dataset
x$collect_metadata(list(
# Required fields
trait="TEST - DO NOT USE 2",
build="HG19/GRCh37",
group_name="public",
category="Risk factor",
subcategory="Anthropometric",
population="Mixed",
sex="Males and Females",
author="Mendel GJ",
# Optional fields (to name a few)
ontology="NA",
sample_size=339224,
year=2022,
unit="SD"
))
# Upload the metadata to the system
x$api_metadata_upload()GWAS ID will be returned by the last command. Look out for the ID assigned by the system (if applicable). At the same time a new record will show up on the web portal under Step 1.
2. Modify the metadata (only when necessary)
You can modify the metadata either via the web portal (recommended) or through R/GwasDataImport, regardless of how you created the metadata (i.e. metadata created via the R package can be modified on the web portal, or vice versa, as they are connected to the same database).
Note that the metadata can only be modified when there is no QC pipeline associated, because at the last step when generating the QC report, the metadata will be hardcoded in the report. Metadata can only be modified when
- (a) the file has not been uploaded for QC, or
- (b) the QC pipeline had already finished and you decided to “delete uploaded files” (but keep the metadata), which will effectively revert the state to (a)
3. Format the file and upload for QC (R/GwasDataImport)
Always check that the GWAS ID and path information stored are accurate.
x$igd_id
x$filenameSpecify column mapping (1-indexed) (check docs for parameter names):
x$determine_columns(list(
chr_col=11,
pos_col=12,
ea_col=2,
oa_col=3,
beta_col=4,
se_col=5,
pval_col=6,
snp_col=1,
eaf_col=7,
ncontrol_col=8
))Use the output to double-check the mapping. If necessary, run
x$determine_columns(...) again with the correct
mapping.
Format the dataset and then upload (both may take a while):
x$format_dataset()
x$api_gwasdata_upload()You will see the “Dataset has been added to the pipeline” message if the upload was successful.
And finally don’t forget to clean up the working directory:
x$delete_wd()4. Check QC pipeline state and report, and submit for approval
On the web portal you can click the 2. QC tab of the dataset popup and check pipeline state.
For each dataset, you should review the QC report when it’s available and decide whether to submit the dataset for approval or not. You will have the following options:
- Submit the dataset for approval
- Go to the 3. Approval & Release tab and submit
- Re-upload the file for QC
- Go to the 1. Metadata tab and “Delete all files”
- Discard the dataset
- Go to the 1. Metadata tab and “Delete all files and metadata(!)”
You may also use the checkboxes on the main screen to select datasets and submit for approval in bulk.
Upload in bulk
If you have multiple datasets you may want to write an R snippet to semi-automate this process.
Set up for only once, then for each dataset go through step 1b, 2 and 3. Finally visit the portal in step 4 and use the checkboxes to submit in bulk.
What’s next
We will review and approve/reject each dataset. Approved datasets will go into the release pipeline which may take another few hours. Usually within 24 hours, you or anyone who has access may query the dataset via packages, e.g.:
ieugwasr::tophits("ieu-b-9999")
And if it’s uploaded under the public group, it will be
listed on https://opengwas.io/datasets/ as well.