To use the IEU GWAS database for MR analysis, see the TwoSampleMR R package.

Here we’ll demonstrate how to achieve the same data extractions using the GWAS VCF files.

We’ll use the example of LDL cholesterol ieu-a-300 and coronary heart disease ieu-a-7.

Load libraries:

#> API: public:

Using TwoSampleMR

This is a simple procedure for MR using the TwoSampleMR package:

# Extract the instruments for LDL
expd1 <- TwoSampleMR::extract_instruments("ieu-a-300")

# Extract those SNP effects for CHD
outd1 <- TwoSampleMR::extract_outcome_data(expd1$SNP, "ieu-a-7", proxies=FALSE)
#> Extracting data for 81 SNP(s) from 1 GWAS(s)

# Harmonise the exposure and outcome data
dat1 <- TwoSampleMR::harmonise_data(expd1, outd1)
#> Harmonising LDL cholesterol || id:ieu-a-300 (ieu-a-300) and Coronary heart disease || id:ieu-a-7 (ieu-a-7)
#> Removing the following SNPs for being palindromic with intermediate allele frequencies:
#> rs2954029, rs964184

# Perform MR
#> Analysing 'ieu-a-300' on 'ieu-a-7'
#>   id.exposure id.outcome                              outcome
#> 1   ieu-a-300    ieu-a-7 Coronary heart disease || id:ieu-a-7
#> 2   ieu-a-300    ieu-a-7 Coronary heart disease || id:ieu-a-7
#> 3   ieu-a-300    ieu-a-7 Coronary heart disease || id:ieu-a-7
#> 4   ieu-a-300    ieu-a-7 Coronary heart disease || id:ieu-a-7
#> 5   ieu-a-300    ieu-a-7 Coronary heart disease || id:ieu-a-7
#>                          exposure                    method nsnp         b
#> 1 LDL cholesterol || id:ieu-a-300                  MR Egger   78 0.5067936
#> 2 LDL cholesterol || id:ieu-a-300           Weighted median   78 0.4014675
#> 3 LDL cholesterol || id:ieu-a-300 Inverse variance weighted   78 0.4141687
#> 4 LDL cholesterol || id:ieu-a-300               Simple mode   78 0.4774092
#> 5 LDL cholesterol || id:ieu-a-300             Weighted mode   78 0.5328338
#>           se         pval
#> 1 0.07837328 8.725962e-09
#> 2 0.04287663 7.727079e-21
#> 3 0.05096262 4.403664e-16
#> 4 0.08115000 9.847698e-08
#> 5 0.07304508 2.290517e-10

Note that this extraction process can be simplified with:

dat1 <- make_dat("ieu-a-300", "ieu-a-7")

Using GWAS VCF files

Let’s do the same with the vcf files (and the indexes). Download from here:


First get the tophits for LDL cholesterol

#> Loading required namespace: genetics.binaRies
#> Path not provided, using binaries in the explodecomputer/genetics.binaRies package
expd2 <- gwasvcf::query_gwas("ieu-a-300.vcf.gz", chrompos=paste0(expd1$chr.exposure, ":", expd1$pos.exposure))

Convert to TwoSampleMR format:

expd2 <- gwasglue::gwasvcf_to_TwoSampleMR(expd2, type="exposure")

Extract those SNPs from the outcome vcf file and convert to TwoSampleMR format

outd2 <- gwasvcf::query_gwas("ieu-a-7.vcf.gz", chrompos = paste0(expd1$chr.exposure, ":", expd1$pos.exposure))
outd2 <- gwasglue::gwasvcf_to_TwoSampleMR(outd2, "outcome")

Proceed with harmonising and performing MR

dat2 <- TwoSampleMR::harmonise_data(expd2, outd2)
#> Harmonising IEU-a-300 (elEq4R) and IEU-a-7 (wfSXNb)
#> Removing the following SNPs for being palindromic with intermediate allele frequencies:
#> rs2954029, rs3184504, rs964184
#> Analysing 'elEq4R' on 'wfSXNb'
#>   id.exposure id.outcome outcome  exposure                    method nsnp
#> 1      elEq4R     wfSXNb IEU-a-7 IEU-a-300                  MR Egger   77
#> 2      elEq4R     wfSXNb IEU-a-7 IEU-a-300           Weighted median   77
#> 3      elEq4R     wfSXNb IEU-a-7 IEU-a-300 Inverse variance weighted   77
#> 4      elEq4R     wfSXNb IEU-a-7 IEU-a-300               Simple mode   77
#> 5      elEq4R     wfSXNb IEU-a-7 IEU-a-300             Weighted mode   77
#>           b         se         pval
#> 1 0.4731258 0.07572869 2.299636e-08
#> 2 0.3976896 0.04767446 7.318419e-17
#> 3 0.4142950 0.04880527 2.088973e-17
#> 4 0.4476460 0.08088536 4.263685e-07
#> 5 0.5299800 0.07522516 7.215281e-10

Other options

Clumping vcf files

If we want to extract top hits based on a threshold and clump locally. First download the LD reference dataset:

tar xzvf data_maf0.01_rs_ref.tgz

Now extract the top hits based on a p-value threshold

#> Path not provided, using binaries in the explodecomputer/genetics.binaRies package
expd3 <- gwasvcf::query_gwas("ieu-a-300.vcf.gz", pval=5e-8)
#> class: CollapsedVCF 
#> dim: 3051 1 
#> rowRanges(vcf):
#>   GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
#> info(vcf):
#>   DataFrame with 3 columns: AF, AC, AN
#> info(header(vcf)):
#>       Number Type    Description                                
#>    AF A      Float   Allele Frequency                           
#>    AC A      Integer Allele count in genotypes                  
#>    AN 1      Integer Total number of alleles in called genotypes
#> geno(vcf):
#>   List of length 9: ES, SE, LP, AF, SS, EZ, SI, NC, ID
#> geno(header(vcf)):
#>       Number Type   Description                                                
#>    ES A      Float  Effect size estimate relative to the alternative allele    
#>    SE A      Float  Standard error of effect size estimate                     
#>    LP A      Float  -log10 p-value for effect estimate                         
#>    AF A      Float  Alternate allele frequency in the association study        
#>    SS A      Float  Sample size used to estimate genetic effect                
#>    EZ A      Float  Z-score provided if it was used to derive the EFFECT and...
#>    SI A      Float  Accuracy score of summary data imputation                  
#>    NC A      Float  Number of cases used to estimate genetic effect            
#>    ID 1      String Study variant identifier

Convert to TwoSampleMR format:

expd3 <- gwasglue::gwasvcf_to_TwoSampleMR(expd3, type="exposure")

Get a list of SNPs to retain after clumping and subset the data

retain_snps <- expd3 %>% dplyr::select(rsid=SNP, pval=pval.exposure) %>%
    ieugwasr::ld_clump(., plink_bin=genetics.binaRies::get_plink_binary(), bfile="data_maf0.01_rs_ref") %>%
#> Clumping vbZ6sJ, 3051 variants, using EUR population reference
#> Removing 2973 of 3051 variants due to LD with other variants or absence from LD reference panel
expd3 <- subset(expd3, SNP %in% retain_snps)

Extracting outcome data with LD proxies

This only works if you extract on rsids at the moment, and is quite slow. But here it is:

#> Path not provided, using binaries in the explodecomputer/genetics.binaRies package
outd2 <- gwasvcf::query_gwas("ieu-a-7.vcf.gz", rsid = expd3$SNP, proxies="yes", bfile="data_maf0.01_rs_ref")
#> Initial search...
#> Extracted 77 out of 78 rsids
#> Searching for proxies for 1 rsids
#> Determining searchspace...
#> Proxy lookup...
#> Finding proxies...
#> Taking input= as a system command ('gunzip -c /var/folders/8t/4mzsfqzd5773rsc11w4q5fhw0000gp/T//RtmpmG4h9W/file172101101fa74.targets.ld.gz') and a variable has been used in the expression passed to `input=`. Please use fread(cmd=...). There is a security concern if you are creating an app, and the app could have a malicious user, and the app is not running in a secure environment; e.g. the app is running as root. Please read item 5 in the NEWS file for v1.11.6 for more information and for the option to suppress this message.
#> Found 426 proxies
#> Extrating proxies...
#> Identified proxies for 1 of 1 rsids
#> Aligning...
outd2 <- gwasglue::gwasvcf_to_TwoSampleMR(outd2, "outcome")

Further MR methods

A number of other MR methods are available. The current framework for using them is to:

  1. Convert your data to TwoSampleMR format
  2. Use the TwoSampleMR package to convert to other formats

Examples of formats that you can convert to within TwoSampleMR:

Bluecrystal4 users

All data in OpenGWAS is stored on bc4 in the form of GWAS VCF files. You can create harmonised datasets easily on bc4 with these files.

Determine the locations of the GWAS VCF files and a number of other reference datasets and binaries:

Now simply run:

dat <- make_TwoSampleMR_dat("ieu-a-300", "ieu-a-7")

This can be run in parallel for large combinations of exposures and outcomes, e.g.:

dat <- make_TwoSampleMR_dat(
    id1=c("ieu-a-300", "ieu-a-302", "ieu-a-299"), 
    id2=c("ieu-a-7", "ieu-a-2"),

This will lookup all instruments in the exposures (id1) in both exposures and outcomes, and harmonise all exposure-outcome pairs, parallelised across 6 threads.

Note: please make sure to run these analyses in batch mode by submitting to the slurm scheduler, not interactively on the head nodes!