To use the IEU GWAS database for MR analysis, see the TwoSampleMR R package.
Here we’ll demonstrate how to achieve the same data extractions using the GWAS VCF files.
We’ll use the example of LDL cholesterol ieu-a-300 and coronary heart disease ieu-a-7.
Load libraries:
suppressPackageStartupMessages(suppressWarnings({ library(TwoSampleMR) library(gwasglue) library(gwasvcf) library(ieugwasr) library(dplyr) })) #> API: public: http://gwas-api.mrcieu.ac.uk/
This is a simple procedure for MR using the TwoSampleMR package:
# Extract the instruments for LDL expd1 <- TwoSampleMR::extract_instruments("ieu-a-300") # Extract those SNP effects for CHD outd1 <- TwoSampleMR::extract_outcome_data(expd1$SNP, "ieu-a-7", proxies=FALSE) #> Extracting data for 81 SNP(s) from 1 GWAS(s) # Harmonise the exposure and outcome data dat1 <- TwoSampleMR::harmonise_data(expd1, outd1) #> Harmonising LDL cholesterol || id:ieu-a-300 (ieu-a-300) and Coronary heart disease || id:ieu-a-7 (ieu-a-7) #> Removing the following SNPs for being palindromic with intermediate allele frequencies: #> rs2954029, rs964184 # Perform MR TwoSampleMR::mr(dat1) #> Analysing 'ieu-a-300' on 'ieu-a-7' #> id.exposure id.outcome outcome #> 1 ieu-a-300 ieu-a-7 Coronary heart disease || id:ieu-a-7 #> 2 ieu-a-300 ieu-a-7 Coronary heart disease || id:ieu-a-7 #> 3 ieu-a-300 ieu-a-7 Coronary heart disease || id:ieu-a-7 #> 4 ieu-a-300 ieu-a-7 Coronary heart disease || id:ieu-a-7 #> 5 ieu-a-300 ieu-a-7 Coronary heart disease || id:ieu-a-7 #> exposure method nsnp b #> 1 LDL cholesterol || id:ieu-a-300 MR Egger 78 0.5067936 #> 2 LDL cholesterol || id:ieu-a-300 Weighted median 78 0.4014675 #> 3 LDL cholesterol || id:ieu-a-300 Inverse variance weighted 78 0.4141687 #> 4 LDL cholesterol || id:ieu-a-300 Simple mode 78 0.4774092 #> 5 LDL cholesterol || id:ieu-a-300 Weighted mode 78 0.5328338 #> se pval #> 1 0.07837328 8.725962e-09 #> 2 0.04287663 7.727079e-21 #> 3 0.05096262 4.403664e-16 #> 4 0.08115000 9.847698e-08 #> 5 0.07304508 2.290517e-10
Note that this extraction process can be simplified with:
dat1 <- make_dat("ieu-a-300", "ieu-a-7")
Let’s do the same with the vcf files (and the indexes). Download from here:
wget https://gwas.mrcieu.ac.uk/files/ieu-a-300/ieu-a-300.vcf.gz
wget https://gwas.mrcieu.ac.uk/files/ieu-a-300/ieu-a-300.vcf.gz.tbi
wget https://gwas.mrcieu.ac.uk/files/ieu-a-7/ieu-a-7.vcf.gz
wget https://gwas.mrcieu.ac.uk/files/ieu-a-7/ieu-a-7.vcf.gz.tbi
First get the tophits for LDL cholesterol
gwasvcf::set_bcftools() #> Loading required namespace: genetics.binaRies #> Path not provided, using binaries in the explodecomputer/genetics.binaRies package expd2 <- gwasvcf::query_gwas("ieu-a-300.vcf.gz", chrompos=paste0(expd1$chr.exposure, ":", expd1$pos.exposure))
Convert to TwoSampleMR format:
expd2 <- gwasglue::gwasvcf_to_TwoSampleMR(expd2, type="exposure")
Extract those SNPs from the outcome vcf file and convert to TwoSampleMR format
outd2 <- gwasvcf::query_gwas("ieu-a-7.vcf.gz", chrompos = paste0(expd1$chr.exposure, ":", expd1$pos.exposure)) outd2 <- gwasglue::gwasvcf_to_TwoSampleMR(outd2, "outcome")
Proceed with harmonising and performing MR
dat2 <- TwoSampleMR::harmonise_data(expd2, outd2) #> Harmonising IEU-a-300 (elEq4R) and IEU-a-7 (wfSXNb) #> Removing the following SNPs for being palindromic with intermediate allele frequencies: #> rs2954029, rs3184504, rs964184 TwoSampleMR::mr(dat2) #> Analysing 'elEq4R' on 'wfSXNb' #> id.exposure id.outcome outcome exposure method nsnp #> 1 elEq4R wfSXNb IEU-a-7 IEU-a-300 MR Egger 77 #> 2 elEq4R wfSXNb IEU-a-7 IEU-a-300 Weighted median 77 #> 3 elEq4R wfSXNb IEU-a-7 IEU-a-300 Inverse variance weighted 77 #> 4 elEq4R wfSXNb IEU-a-7 IEU-a-300 Simple mode 77 #> 5 elEq4R wfSXNb IEU-a-7 IEU-a-300 Weighted mode 77 #> b se pval #> 1 0.4731258 0.07572869 2.299636e-08 #> 2 0.3976896 0.04767446 7.318419e-17 #> 3 0.4142950 0.04880527 2.088973e-17 #> 4 0.4476460 0.08088536 4.263685e-07 #> 5 0.5299800 0.07522516 7.215281e-10
If we want to extract top hits based on a threshold and clump locally. First download the LD reference dataset:
Now extract the top hits based on a p-value threshold
gwasvcf::set_bcftools() #> Path not provided, using binaries in the explodecomputer/genetics.binaRies package expd3 <- gwasvcf::query_gwas("ieu-a-300.vcf.gz", pval=5e-8) expd3 #> class: CollapsedVCF #> dim: 3051 1 #> rowRanges(vcf): #> GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER #> info(vcf): #> DataFrame with 3 columns: AF, AC, AN #> info(header(vcf)): #> Number Type Description #> AF A Float Allele Frequency #> AC A Integer Allele count in genotypes #> AN 1 Integer Total number of alleles in called genotypes #> geno(vcf): #> List of length 9: ES, SE, LP, AF, SS, EZ, SI, NC, ID #> geno(header(vcf)): #> Number Type Description #> ES A Float Effect size estimate relative to the alternative allele #> SE A Float Standard error of effect size estimate #> LP A Float -log10 p-value for effect estimate #> AF A Float Alternate allele frequency in the association study #> SS A Float Sample size used to estimate genetic effect #> EZ A Float Z-score provided if it was used to derive the EFFECT and... #> SI A Float Accuracy score of summary data imputation #> NC A Float Number of cases used to estimate genetic effect #> ID 1 String Study variant identifier
Convert to TwoSampleMR format:
expd3 <- gwasglue::gwasvcf_to_TwoSampleMR(expd3, type="exposure")
Get a list of SNPs to retain after clumping and subset the data
retain_snps <- expd3 %>% dplyr::select(rsid=SNP, pval=pval.exposure) %>% ieugwasr::ld_clump(., plink_bin=genetics.binaRies::get_plink_binary(), bfile="data_maf0.01_rs_ref") %>% {.$rsid} #> Clumping vbZ6sJ, 3051 variants, using EUR population reference #> Removing 2973 of 3051 variants due to LD with other variants or absence from LD reference panel expd3 <- subset(expd3, SNP %in% retain_snps)
This only works if you extract on rsids at the moment, and is quite slow. But here it is:
gwasvcf::set_plink() #> Path not provided, using binaries in the explodecomputer/genetics.binaRies package outd2 <- gwasvcf::query_gwas("ieu-a-7.vcf.gz", rsid = expd3$SNP, proxies="yes", bfile="data_maf0.01_rs_ref") #> Initial search... #> Extracted 77 out of 78 rsids #> Searching for proxies for 1 rsids #> Determining searchspace... #> Proxy lookup... #> Finding proxies... #> Taking input= as a system command ('gunzip -c /var/folders/8t/4mzsfqzd5773rsc11w4q5fhw0000gp/T//RtmpmG4h9W/file172101101fa74.targets.ld.gz') and a variable has been used in the expression passed to `input=`. Please use fread(cmd=...). There is a security concern if you are creating an app, and the app could have a malicious user, and the app is not running in a secure environment; e.g. the app is running as root. Please read item 5 in the NEWS file for v1.11.6 for more information and for the option to suppress this message. #> Found 426 proxies #> Extrating proxies... #> Identified proxies for 1 of 1 rsids #> Aligning... outd2 <- gwasglue::gwasvcf_to_TwoSampleMR(outd2, "outcome")
A number of other MR methods are available. The current framework for using them is to:
Examples of formats that you can convert to within TwoSampleMR:
All data in OpenGWAS is stored on bc4 in the form of GWAS VCF files. You can create harmonised datasets easily on bc4 with these files.
Determine the locations of the GWAS VCF files and a number of other reference datasets and binaries:
Now simply run:
dat <- make_TwoSampleMR_dat("ieu-a-300", "ieu-a-7")
This can be run in parallel for large combinations of exposures and outcomes, e.g.:
dat <- make_TwoSampleMR_dat( id1=c("ieu-a-300", "ieu-a-302", "ieu-a-299"), id2=c("ieu-a-7", "ieu-a-2"), nthreads=6 )
This will lookup all instruments in the exposures (id1) in both exposures and outcomes, and harmonise all exposure-outcome pairs, parallelised across 6 threads.
Note: please make sure to run these analyses in batch mode by submitting to the slurm scheduler, not interactively on the head nodes!