To use the IEU GWAS database for MR analysis, see the TwoSampleMR R package.

Here we’ll demonstrate how to achieve the same data extractions using the GWAS VCF files.

We’ll use the example of LDL cholesterol ieu-a-300 and coronary heart disease ieu-a-7.

Load libraries:

suppressPackageStartupMessages(suppressWarnings({
    library(TwoSampleMR)
    library(gwasglue)
    library(gwasvcf)
    library(ieugwasr)
    library(dplyr)
}))
#> API: public: http://gwas-api.mrcieu.ac.uk/

Using TwoSampleMR

This is a simple procedure for MR using the TwoSampleMR package:

# Extract the instruments for LDL
expd1 <- TwoSampleMR::extract_instruments("ieu-a-300")

# Extract those SNP effects for CHD
outd1 <- TwoSampleMR::extract_outcome_data(expd1$SNP, "ieu-a-7", proxies=FALSE)
#> Extracting data for 81 SNP(s) from 1 GWAS(s)

# Harmonise the exposure and outcome data
dat1 <- TwoSampleMR::harmonise_data(expd1, outd1)
#> Harmonising LDL cholesterol || id:ieu-a-300 (ieu-a-300) and Coronary heart disease || id:ieu-a-7 (ieu-a-7)
#> Removing the following SNPs for being palindromic with intermediate allele frequencies:
#> rs2954029, rs964184

# Perform MR
TwoSampleMR::mr(dat1)
#> Analysing 'ieu-a-300' on 'ieu-a-7'
#>   id.exposure id.outcome                              outcome
#> 1   ieu-a-300    ieu-a-7 Coronary heart disease || id:ieu-a-7
#> 2   ieu-a-300    ieu-a-7 Coronary heart disease || id:ieu-a-7
#> 3   ieu-a-300    ieu-a-7 Coronary heart disease || id:ieu-a-7
#> 4   ieu-a-300    ieu-a-7 Coronary heart disease || id:ieu-a-7
#> 5   ieu-a-300    ieu-a-7 Coronary heart disease || id:ieu-a-7
#>                          exposure                    method nsnp         b
#> 1 LDL cholesterol || id:ieu-a-300                  MR Egger   78 0.5067936
#> 2 LDL cholesterol || id:ieu-a-300           Weighted median   78 0.4014675
#> 3 LDL cholesterol || id:ieu-a-300 Inverse variance weighted   78 0.4141687
#> 4 LDL cholesterol || id:ieu-a-300               Simple mode   78 0.4774092
#> 5 LDL cholesterol || id:ieu-a-300             Weighted mode   78 0.5328338
#>           se         pval
#> 1 0.07837328 8.725962e-09
#> 2 0.04469608 2.654913e-19
#> 3 0.05096262 4.403664e-16
#> 4 0.07891687 4.905257e-08
#> 5 0.07742194 1.388095e-09

Using GWAS VCF files

Let’s do the same with the vcf files (and the indexes). Download from here:

wget https://gwas.mrcieu.ac.uk/files/ieu-a-300/ieu-a-300.vcf.gz
wget https://gwas.mrcieu.ac.uk/files/ieu-a-300/ieu-a-300.vcf.gz.tbi
wget https://gwas.mrcieu.ac.uk/files/ieu-a-7/ieu-a-7.vcf.gz
wget https://gwas.mrcieu.ac.uk/files/ieu-a-7/ieu-a-7.vcf.gz.tbi

First get the tophits for LDL cholesterol

gwasvcf::set_bcftools()
#> Loading required namespace: genetics.binaRies
#> Path not provided, using binaries in the explodecomputer/genetics.binaRies package
expd2 <- gwasvcf::query_gwas("ieu-a-300.vcf.gz", chrompos=paste0(expd1$chr.exposure, ":", expd1$pos.exposure))

Convert to TwoSampleMR format:

expd2 <- gwasglue::gwasvcf_to_TwoSampleMR(expd2, type="exposure")

Extract those SNPs from the outcome vcf file and convert to TwoSampleMR format

outd2 <- gwasvcf::query_gwas("ieu-a-7.vcf.gz", chrompos = paste0(expd1$chr.exposure, ":", expd1$pos.exposure))
outd2 <- gwasglue::gwasvcf_to_TwoSampleMR(outd2, "outcome")

Proceed with harmonising and performing MR

dat2 <- TwoSampleMR::harmonise_data(expd2, outd2)
#> Harmonising IEU-a-300 (W6h4L6) and IEU-a-7 (JRY7cy)
#> Removing the following SNPs for being palindromic with intermediate allele frequencies:
#> rs2954029, rs3184504, rs964184
TwoSampleMR::mr(dat2)
#> Analysing 'W6h4L6' on 'JRY7cy'
#>   id.exposure id.outcome outcome  exposure                    method nsnp
#> 1      W6h4L6     JRY7cy IEU-a-7 IEU-a-300                  MR Egger   77
#> 2      W6h4L6     JRY7cy IEU-a-7 IEU-a-300           Weighted median   77
#> 3      W6h4L6     JRY7cy IEU-a-7 IEU-a-300 Inverse variance weighted   77
#> 4      W6h4L6     JRY7cy IEU-a-7 IEU-a-300               Simple mode   77
#> 5      W6h4L6     JRY7cy IEU-a-7 IEU-a-300             Weighted mode   77
#>           b         se         pval
#> 1 0.4731258 0.07572869 2.299636e-08
#> 2 0.3976896 0.04430251 2.790535e-19
#> 3 0.4142950 0.04880527 2.088973e-17
#> 4 0.4476460 0.07713274 1.414285e-07
#> 5 0.5299800 0.07387678 4.124467e-10

Other options

Clumping vcf files

If we want to extract top hits based on a threshold and clump locally. First download the LD reference dataset:

wget http://fileserve.mrcieu.ac.uk/ld/data_maf0.01_rs_ref.tgz
tar xzvf data_maf0.01_rs_ref.tgz

Now extract the top hits based on a p-value threshold

gwasvcf::set_bcftools()
#> Path not provided, using binaries in the explodecomputer/genetics.binaRies package
expd3 <- gwasvcf::query_gwas("ieu-a-300.vcf.gz", pval=5e-8)
expd3
#> class: CollapsedVCF 
#> dim: 3051 1 
#> rowRanges(vcf):
#>   GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
#> info(vcf):
#>   DataFrame with 3 columns: AF, AC, AN
#> info(header(vcf)):
#>       Number Type    Description                                
#>    AF A      Float   Allele Frequency                           
#>    AC A      Integer Allele count in genotypes                  
#>    AN 1      Integer Total number of alleles in called genotypes
#> geno(vcf):
#>   SimpleList of length 9: ES, SE, LP, AF, SS, EZ, SI, NC, ID
#> geno(header(vcf)):
#>       Number Type   Description                                                
#>    ES A      Float  Effect size estimate relative to the alternative allele    
#>    SE A      Float  Standard error of effect size estimate                     
#>    LP A      Float  -log10 p-value for effect estimate                         
#>    AF A      Float  Alternate allele frequency in the association study        
#>    SS A      Float  Sample size used to estimate genetic effect                
#>    EZ A      Float  Z-score provided if it was used to derive the EFFECT and...
#>    SI A      Float  Accuracy score of summary data imputation                  
#>    NC A      Float  Number of cases used to estimate genetic effect            
#>    ID 1      String Study variant identifier

Convert to TwoSampleMR format:

expd3 <- gwasglue::gwasvcf_to_TwoSampleMR(expd3, type="exposure")

Get a list of SNPs to retain after clumping and subset the data

retain_snps <- expd3 %>% dplyr::select(rsid=SNP, pval=pval.exposure) %>%
    ieugwasr::ld_clump(., plink_bin=genetics.binaRies::get_plink_binary(), bfile="data_maf0.01_rs_ref") %>%
    {.$rsid}
#> Clumping Ea4pxd, 3051 variants, using EUR population reference
#> Removing 2973 of 3051 variants due to LD with other variants or absence from LD reference panel
expd3 <- subset(expd3, SNP %in% retain_snps)

Extracting outcome data with LD proxies

This only works if you extract on rsids at the moment, and is quite slow. But here it is:

gwasvcf::set_plink()
#> Path not provided, using binaries in the explodecomputer/genetics.binaRies package
outd2 <- gwasvcf::query_gwas("ieu-a-7.vcf.gz", rsid = expd3$SNP, proxies="yes", bfile="data_maf0.01_rs_ref")
#> Initial search...
#> Extracted 77 out of 78 rsids
#> Searching for proxies for 1 rsids
#> Determining searchspace...
#> Proxy lookup...
#> Finding proxies...
#> Taking input= as a system command ('gunzip -c /var/folders/8t/4mzsfqzd5773rsc11w4q5fhw0000gp/T//Rtmpu5CMcy/file16545566b9207.targets.ld.gz') and a variable has been used in the expression passed to `input=`. Please use fread(cmd=...). There is a security concern if you are creating an app, and the app could have a malicious user, and the app is not running in a secure environment; e.g. the app is running as root. Please read item 5 in the NEWS file for v1.11.6 for more information and for the option to suppress this message.
#> Found 426 proxies
#> Extrating proxies...
#> Identified proxies for 1 of 1 rsids
#> Aligning...
outd2 <- gwasglue::gwasvcf_to_TwoSampleMR(outd2, "outcome")

Further MR methods

A number of other MR methods are available. The current framework for using them is to:

  1. Convert your data to TwoSampleMR format
  2. Use the TwoSampleMR package to convert to other formats

Examples of formats that you can convert to within TwoSampleMR: