To use the IEU GWAS database for MR analysis, see the TwoSampleMR R package.

Here we’ll demonstrate how to achieve the same data extractions using the GWAS VCF files.

We’ll use the example of LDL cholesterol ieu-a-300 and coronary heart disease ieu-a-7.

suppressPackageStartupMessages(suppressWarnings({
library(TwoSampleMR)
library(gwasglue)
library(gwasvcf)
library(ieugwasr)
library(dplyr)
}))
#> API: public: http://gwas-api.mrcieu.ac.uk/

## Using TwoSampleMR

This is a simple procedure for MR using the TwoSampleMR package:

# Extract the instruments for LDL
expd1 <- TwoSampleMR::extract_instruments("ieu-a-300")

# Extract those SNP effects for CHD
outd1 <- TwoSampleMR::extract_outcome_data(expd1$SNP, "ieu-a-7", proxies=FALSE) #> Extracting data for 81 SNP(s) from 1 GWAS(s) # Harmonise the exposure and outcome data dat1 <- TwoSampleMR::harmonise_data(expd1, outd1) #> Harmonising LDL cholesterol || id:ieu-a-300 (ieu-a-300) and Coronary heart disease || id:ieu-a-7 (ieu-a-7) #> Removing the following SNPs for being palindromic with intermediate allele frequencies: #> rs2954029, rs964184 # Perform MR TwoSampleMR::mr(dat1) #> Analysing 'ieu-a-300' on 'ieu-a-7' #> id.exposure id.outcome outcome #> 1 ieu-a-300 ieu-a-7 Coronary heart disease || id:ieu-a-7 #> 2 ieu-a-300 ieu-a-7 Coronary heart disease || id:ieu-a-7 #> 3 ieu-a-300 ieu-a-7 Coronary heart disease || id:ieu-a-7 #> 4 ieu-a-300 ieu-a-7 Coronary heart disease || id:ieu-a-7 #> 5 ieu-a-300 ieu-a-7 Coronary heart disease || id:ieu-a-7 #> exposure method nsnp b #> 1 LDL cholesterol || id:ieu-a-300 MR Egger 78 0.5067936 #> 2 LDL cholesterol || id:ieu-a-300 Weighted median 78 0.4014675 #> 3 LDL cholesterol || id:ieu-a-300 Inverse variance weighted 78 0.4141687 #> 4 LDL cholesterol || id:ieu-a-300 Simple mode 78 0.4774092 #> 5 LDL cholesterol || id:ieu-a-300 Weighted mode 78 0.5328338 #> se pval #> 1 0.07837328 8.725962e-09 #> 2 0.04469608 2.654913e-19 #> 3 0.05096262 4.403664e-16 #> 4 0.07891687 4.905257e-08 #> 5 0.07742194 1.388095e-09 ## Using GWAS VCF files Let’s do the same with the vcf files (and the indexes). Download from here: wget https://gwas.mrcieu.ac.uk/files/ieu-a-300/ieu-a-300.vcf.gz wget https://gwas.mrcieu.ac.uk/files/ieu-a-300/ieu-a-300.vcf.gz.tbi wget https://gwas.mrcieu.ac.uk/files/ieu-a-7/ieu-a-7.vcf.gz wget https://gwas.mrcieu.ac.uk/files/ieu-a-7/ieu-a-7.vcf.gz.tbi First get the tophits for LDL cholesterol gwasvcf::set_bcftools() #> Loading required namespace: genetics.binaRies #> Path not provided, using binaries in the explodecomputer/genetics.binaRies package expd2 <- gwasvcf::query_gwas("ieu-a-300.vcf.gz", chrompos=paste0(expd1$chr.exposure, ":", expd1$pos.exposure)) Convert to TwoSampleMR format: expd2 <- gwasglue::gwasvcf_to_TwoSampleMR(expd2, type="exposure") Extract those SNPs from the outcome vcf file and convert to TwoSampleMR format outd2 <- gwasvcf::query_gwas("ieu-a-7.vcf.gz", chrompos = paste0(expd1$chr.exposure, ":", expd1$pos.exposure)) outd2 <- gwasglue::gwasvcf_to_TwoSampleMR(outd2, "outcome") Proceed with harmonising and performing MR dat2 <- TwoSampleMR::harmonise_data(expd2, outd2) #> Harmonising IEU-a-300 (W6h4L6) and IEU-a-7 (JRY7cy) #> Removing the following SNPs for being palindromic with intermediate allele frequencies: #> rs2954029, rs3184504, rs964184 TwoSampleMR::mr(dat2) #> Analysing 'W6h4L6' on 'JRY7cy' #> id.exposure id.outcome outcome exposure method nsnp #> 1 W6h4L6 JRY7cy IEU-a-7 IEU-a-300 MR Egger 77 #> 2 W6h4L6 JRY7cy IEU-a-7 IEU-a-300 Weighted median 77 #> 3 W6h4L6 JRY7cy IEU-a-7 IEU-a-300 Inverse variance weighted 77 #> 4 W6h4L6 JRY7cy IEU-a-7 IEU-a-300 Simple mode 77 #> 5 W6h4L6 JRY7cy IEU-a-7 IEU-a-300 Weighted mode 77 #> b se pval #> 1 0.4731258 0.07572869 2.299636e-08 #> 2 0.3976896 0.04430251 2.790535e-19 #> 3 0.4142950 0.04880527 2.088973e-17 #> 4 0.4476460 0.07713274 1.414285e-07 #> 5 0.5299800 0.07387678 4.124467e-10 ## Other options ### Clumping vcf files If we want to extract top hits based on a threshold and clump locally. First download the LD reference dataset: wget http://fileserve.mrcieu.ac.uk/ld/data_maf0.01_rs_ref.tgz tar xzvf data_maf0.01_rs_ref.tgz Now extract the top hits based on a p-value threshold gwasvcf::set_bcftools() #> Path not provided, using binaries in the explodecomputer/genetics.binaRies package expd3 <- gwasvcf::query_gwas("ieu-a-300.vcf.gz", pval=5e-8) expd3 #> class: CollapsedVCF #> dim: 3051 1 #> rowRanges(vcf): #> GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER #> info(vcf): #> DataFrame with 3 columns: AF, AC, AN #> info(header(vcf)): #> Number Type Description #> AF A Float Allele Frequency #> AC A Integer Allele count in genotypes #> AN 1 Integer Total number of alleles in called genotypes #> geno(vcf): #> SimpleList of length 9: ES, SE, LP, AF, SS, EZ, SI, NC, ID #> geno(header(vcf)): #> Number Type Description #> ES A Float Effect size estimate relative to the alternative allele #> SE A Float Standard error of effect size estimate #> LP A Float -log10 p-value for effect estimate #> AF A Float Alternate allele frequency in the association study #> SS A Float Sample size used to estimate genetic effect #> EZ A Float Z-score provided if it was used to derive the EFFECT and... #> SI A Float Accuracy score of summary data imputation #> NC A Float Number of cases used to estimate genetic effect #> ID 1 String Study variant identifier Convert to TwoSampleMR format: expd3 <- gwasglue::gwasvcf_to_TwoSampleMR(expd3, type="exposure") Get a list of SNPs to retain after clumping and subset the data retain_snps <- expd3 %>% dplyr::select(rsid=SNP, pval=pval.exposure) %>% ieugwasr::ld_clump(., plink_bin=genetics.binaRies::get_plink_binary(), bfile="data_maf0.01_rs_ref") %>% {.$rsid}
#> Clumping Ea4pxd, 3051 variants, using EUR population reference
#> Removing 2973 of 3051 variants due to LD with other variants or absence from LD reference panel
expd3 <- subset(expd3, SNP %in% retain_snps)

### Extracting outcome data with LD proxies

This only works if you extract on rsids at the moment, and is quite slow. But here it is:

gwasvcf::set_plink()
#> Path not provided, using binaries in the explodecomputer/genetics.binaRies package
outd2 <- gwasvcf::query_gwas("ieu-a-7.vcf.gz", rsid = expd3\$SNP, proxies="yes", bfile="data_maf0.01_rs_ref")
#> Initial search...
#> Extracted 77 out of 78 rsids
#> Searching for proxies for 1 rsids
#> Determining searchspace...
#> Proxy lookup...
#> Finding proxies...
#> Taking input= as a system command ('gunzip -c /var/folders/8t/4mzsfqzd5773rsc11w4q5fhw0000gp/T//Rtmpu5CMcy/file16545566b9207.targets.ld.gz') and a variable has been used in the expression passed to input=. Please use fread(cmd=...). There is a security concern if you are creating an app, and the app could have a malicious user, and the app is not running in a secure environment; e.g. the app is running as root. Please read item 5 in the NEWS file for v1.11.6 for more information and for the option to suppress this message.
#> Found 426 proxies
#> Extrating proxies...
#> Identified proxies for 1 of 1 rsids
#> Aligning...
outd2 <- gwasglue::gwasvcf_to_TwoSampleMR(outd2, "outcome")

## Further MR methods

A number of other MR methods are available. The current framework for using them is to:

1. Convert your data to TwoSampleMR format
2. Use the TwoSampleMR package to convert to other formats

Examples of formats that you can convert to within TwoSampleMR: