Importing datasets into OpenGWAS [NEW WORKFLOW]
import_pipeline_new.Rmd
As a Contributor
This role enables you to contribute datasets (create metadata, upload file for QC, check QC report and submit dataset for approval) to OpenGWAS.
You may contact Admin (details TBC) and request to be added as a Contributor. You will be granted access to https://api.opengwas.io/contribution.
You also need R/GwasDataImport installed and your OpenGWAS JWT (token) set up in your R environment.
Prerequisites
Each dataset will be a .txt
or .txt.gz
file. The content looks like this:
ID ALT REF BETA SE PVALUE AF N ZVALUE INFO CHROM POS
rs10399878 G A 0.0118 0.016 0.4608 0.9569 124787 NA NA 1 1239953
rs1039063 G T 0.0026 0.0036 0.4702 0.55 236102 NA NA 1 2281978
rs1039100 G A 0.0033 0.0047 0.4826 NA 221290 NA NA 1 2286947
rs10158583 A G 0.0099 0.0059 0.09446 0.075 321197 NA NA 1 3144068
rs10157420 C T -0.0038 0.0075 0.6124 0.05 234171 NA NA 1 3146497
The header row is not required - you will need to specify the mappings in later steps. The header row (if any) will be ignored by the pipeline so it’s okay to leave it as-is.
These columns are required for each dataset:
- chromosome
- position
- beta
- se
- effect allele
- other allele
- pval
Note: you need to remove any leading 0
from the
chromosome values, e.g. 07
-> 7
.
Note: a row will be removed by the pipeline entirely if it has NA/Inf value in any required columns.
These columns are optional:
- rsid
- effect allele frequency
- other allele frequency
- number of cases
- number of controls (or total sample size if continuous trait)
- imputation z-score
- imputation info score
Setup
You can track your contributions via the web interface, but you will need R/GwasDataImport anyway to upload the files.
library(GwasDataImport)
# Make sure you have switched to the "public" API base URL
ieugwasr::select_api()
# Make sure you are authenticated and have got the "contributor" or "admin" role
ieugwasr::user()
Metadata and dataset
For each dataset you will need to go through step 1 to 4. See also: Import in bulk.
1. Create metadata and obtain GWAS ID (two methods)
For each dataset to be uploaded, you can either use the web interface or the R/GwasDataImport package to create the metadata.
1a. Use the web interface
https://api.opengwas.io/contribution/ provides a webform with dropdowns and tooltips for each field. This is handy when you are new to this process and/or only have a few datasets to upload.
GWAS ID will be available on the web interface for each dataset.
Assume it’s ieu-b-9999
.
In R, specify the path (assume it’s ~/bmi_test.txt.gz
)
and GWAS ID, and then mark the metadata as uploaded.
x <- Dataset$new(filename="~/bmi_test.txt.gz", igd_id='ieu-b-9999')
x$metadata_uploaded <- TRUE
1b. Use R/GwasDataImport
Alternatively, you can opt to use R if you have an array of candidate datasets since it’s easier to upload metadata of multiple datasets in a programmable way. We recommend that you try the web interface in 1a first before you go down this more advanced route.
Assume the full path to the file is
~/bmi_test.txt.gz
.
# Don't provide the GWAS ID
x <- Dataset$new(filename="~/bmi_test.txt.gz")
x$collect_metadata(list(
trait="TEST - DO NOT USE 2",
group_name="public",
build="HG19/GRCh37",
category="Risk factor",
subcategory="Anthropometric",
ontology="NA",
population="Mixed",
sex="Males and Females",
sample_size=339224,
author="Mendel GJ",
year=2022,
unit="SD"
))
x$api_metadata_upload()
GWAS ID will be returned by the last command. Assume it’s
ieu-b-9999
. At the same time a new record will show up on
https://api.opengwas.io/contribution/.
2. Modify the metadata (only when necessary)
You can modify the metadata either via the web interface (recommended) or through R/GwasDataImport, regardless of how you created the metadata (i.e. metadata created via the R package can be modified on the web interface, or vice versa).
Note that the metadata can only be modified when there is no QC pipeline associated, because at the last step when generating the QC report, the metadata will be hardcoded in the report. Metadata can only be modified when
- (a) the file has not been uploaded for QC, or
- (b) the QC pipeline had already finished and you decided to “delete uploaded files and QC products but keep the metadata”, which effectly reverted the state to (a)
3. Format the file and upload for QC (R/GwasDataImport)
Always check that the GWAS ID and path information stored are accurate.
x$igd_id
x$filename
Specify column mapping (1-indexed) (check docs for parameter names):
x$determine_columns(list(
chr_col=11,
pos_col=12,
ea_col=2,
oa_col=3,
beta_col=4,
se_col=5,
pval_col=6,
snp_col=1,
eaf_col=7,
ncontrol_col=8
))
Use the output to double-check the mapping. If necessary, run
x$determine_columns(...)
again with the correct
mapping.
Format the dataset and then upload (both may take a while):
x$format_dataset()
x$api_gwasdata_upload()
You will see the “Dataset has been added to the pipeline” message if the upload was successful.
And finally don’t forget to clean up the working directory:
x$delete_wd()
4. Check QC pipeline state and report, and submit for approval
On https://api.opengwas.io/contribution you can click the 2. QC tab of the dataset popup and check pipeline state.
For each dataset, you should review the QC report when it’s available and decide whether to submit the dataset for approval or not. You will have the following options:
- Submit the dataset for approval
- Go to the 3. Approval & Release tab and submit
- Re-upload the file for QC
- Go to the 1. Metadata tab and “delete uploaded files and QC products”
- Discard the dataset
- Go to the 1. Metadata tab and “delete everything”
You may also use the checkboxes on the main screen to select datasets and submit for approval in bulk.
Import in bulk
If you have multiple datasets you may want to write an R snippet to semi-automate this process.
Set up for only once, then for each dataset go through step 1b, 2 and 3. Finally visit the portal in step 4 and use the checkboxes to submit in bulk.
What’s next
Admins will review and approve/reject each datasets. Approved datasets will go into the release pipeline which may take 0.5 - 4 hours. After that you may query the dataset via packages, e.g.:
ieugwasr::tophits("ieu-b-9999")
Note: currently https://gwas.mrcieu.ac.uk/ has a synchronisation issue so new datasets may not be displayed there, but you can always use the new datasets directly via the packages like above.