Outcome data
Gibran Hemani
outcome.Rmd
library(TwoSampleMR)
Once instruments for the exposure trait have been specified, those variants need to be extracted from the outcome trait.
Available studies in IEU GWAS database
The IEU GWAS database (IGD) contains complete GWAS summary statistics from a large number of studies. You can browse them here:
To obtain details about the available GWASs programmatically do the following:
ao <- available_outcomes()
#> API: public: http://gwas-api.mrcieu.ac.uk/
head(ao)
#> # A tibble: 6 × 22
#> id trait ncase group_name year author consortium sex pmid population
#> <chr> <chr> <int> <chr> <int> <chr> <chr> <chr> <int> <chr>
#> 1 ieu-b-… Schi… 1234 public 2022 Trube… PGC Male… 3.54e7 Hispanic …
#> 2 ieu-b-… Schi… 52017 public 2022 Trube… PGC Male… 3.54e7 European
#> 3 ieu-b-… Schi… 12305 public 2022 Trube… PGC Male… 3.54e7 East Asian
#> 4 ieu-b-… Schi… 64322 public 2022 Trube… PGC Male… 3.54e7 Mixed
#> 5 ieu-b-… Schi… 76755 public 2022 Trube… PGC Male… 3.54e7 Mixed
#> 6 ieu-b-… Schi… 5998 public 2022 Trube… PGC Male… 3.54e7 African A…
#> # ℹ 12 more variables: unit <chr>, sample_size <int>, build <chr>,
#> # ncontrol <int>, category <chr>, subcategory <chr>, ontology <chr>,
#> # note <chr>, mr <int>, nsnp <int>, priority <int>, sd <dbl>
For information about authentication see https://mrcieu.github.io/ieugwasr/articles/guide.html#authentication.
The available_outcomes
function returns a table of all
the available studies in the database. Each study has a unique ID.
e.g.
Extracting particular SNPs from particular studies
If we want to perform MR of BMI against coronary heart disease, we need to identify the SNPs that influence the BMI, and then extract those SNPs from a GWAS on coronary heart disease.
Let’s get the Locke et al 2014 instruments for BMI as an example:
bmi_exp_dat <- extract_instruments(outcomes = 'ieu-a-2')
head(bmi_exp_dat)
#> pval.exposure samplesize.exposure chr.exposure se.exposure beta.exposure
#> 1 2.18198e-08 339152 1 0.0030 -0.0168
#> 2 4.56773e-11 339065 1 0.0031 0.0201
#> 3 5.05941e-14 313621 1 0.0087 0.0659
#> 4 5.45205e-10 338768 1 0.0029 0.0181
#> 5 1.88018e-28 338123 1 0.0030 0.0331
#> 6 2.28718e-40 339078 1 0.0037 0.0497
#> pos.exposure id.exposure SNP effect_allele.exposure
#> 1 47684677 ieu-a-2 rs977747 G
#> 2 78048331 ieu-a-2 rs17381664 C
#> 3 110082886 ieu-a-2 rs7550711 T
#> 4 201784287 ieu-a-2 rs2820292 C
#> 5 72837239 ieu-a-2 rs7531118 C
#> 6 177889480 ieu-a-2 rs543874 G
#> other_allele.exposure eaf.exposure exposure
#> 1 T 0.5333 Body mass index || id:ieu-a-2
#> 2 T 0.4250 Body mass index || id:ieu-a-2
#> 3 C 0.0339 Body mass index || id:ieu-a-2
#> 4 A 0.5083 Body mass index || id:ieu-a-2
#> 5 T 0.6083 Body mass index || id:ieu-a-2
#> 6 A 0.2667 Body mass index || id:ieu-a-2
#> mr_keep.exposure pval_origin.exposure data_source.exposure
#> 1 TRUE reported igd
#> 2 TRUE reported igd
#> 3 TRUE reported igd
#> 4 TRUE reported igd
#> 5 TRUE reported igd
#> 6 TRUE reported igd
We now need to find a suitable GWAS for coronary heart disease. We can search the available studies:
ao[grepl("heart disease", ao$trait), ]
#> # A tibble: 31 × 22
#> id trait ncase group_name year author consortium sex pmid population
#> <chr> <chr> <int> <chr> <int> <chr> <chr> <chr> <int> <chr>
#> 1 finn… Majo… 21012 public 2021 NA NA Male… NA European
#> 2 ukb-… Diag… 1195 public 2018 Ben E… MRC-IEU Male… NA European
#> 3 ukb-… I25 … 302 public 2020 Pan-U… NA Male… NA African A…
#> 4 ukb-… Diag… 9330 public 2018 Ben E… MRC-IEU Male… NA European
#> 5 ieu-… Coro… 60801 public 2015 Nikpay CARDIoGRA… Male… 2.63e7 Mixed
#> 6 finn… Seco… 428 public 2021 NA NA Male… NA European
#> 7 ukb-… Diag… 5771 public 2018 Ben E… MRC-IEU Male… NA European
#> 8 ukb-… Diag… 8755 public 2017 Neale Neale Lab Male… NA European
#> 9 ukb-… I25 … 1205 public 2020 Pan-U… NA Male… NA South Asi…
#> 10 finn… Othe… 62081 public 2021 NA NA Male… NA European
#> # ℹ 21 more rows
#> # ℹ 12 more variables: unit <chr>, sample_size <int>, build <chr>,
#> # ncontrol <int>, category <chr>, subcategory <chr>, ontology <chr>,
#> # note <chr>, mr <int>, nsnp <int>, priority <int>, sd <dbl>
The most recent CARDIOGRAM GWAS is ID number ieu-a-7
. We
can extract the BMI SNPs from this GWAS as follows:
chd_out_dat <- extract_outcome_data(
snps = bmi_exp_dat$SNP,
outcomes = 'ieu-a-7'
)
#> Extracting data for 79 SNP(s) from 1 GWAS(s)
The extract_outcome_data()
function is flexible. The
snps
argument only requires an array of rsIDs, and the
outcomes
argument can be a vector of outcomes, e.g.
chd_out_dat <- extract_outcome_data(c("rs234", "rs17097147"), c('ieu-a-2', 'ieu-a-7'))
will extract the two SNPs from each of the outcomes
ieu-a-2
and ieu-a-7
.
LD proxies
By default if a particular requested SNP is not present in the outcome GWAS then a SNP (proxy) that is in LD with the requested SNP (target) will be searched for instead. LD proxies are defined using 1000 genomes European sample data. The effect of the proxy SNP on the outcome is returned, along with the proxy SNP, the effect allele of the proxy SNP, and the corresponding allele (in phase) for the target SNP.
The parameters for handling LD proxies are as follows:
-
proxies
= TRUE or FALSE (TRUE by default) -
rsq
= numeric value of minimum rsq to find a proxy. Default is 0.8, minimum is 0.6 -
palindromes
= Allow palindromic SNPs? Default is 1 (yes) -
maf_threshold
= If palindromes allowed then what is the maximum minor allele frequency of palindromes allowed? Default is 0.3.
Using local GWAS summary data
If you have GWAS summary data that is not present in IEU GWAS database, this can still be used to perform analysis.
Supposing there is a GWAS summary file called “gwas_summary.csv” with e.g. 2 million rows and it looks like this:
rsid,effect,SE,a1,a2,a1_freq,p-value,Units,Gene,n
rs10767664,0.19,0.030612245,A,T,0.78,5.00E-26,kg/m2,BDNF,225238
rs13078807,0.1,0.020408163,G,A,0.2,4.00E-11,kg/m2,CADM2,221431
rs1514175,0.07,0.020408163,A,G,0.43,8.00E-14,kg/m2,TNNI3K,207641
rs1558902,0.39,0.020408163,A,T,0.42,5.00E-120,kg/m2,FTO,222476
...
...
To extract the exposure SNPs from this data, we would use the following command:
outcome_dat <- read_outcome_data(
snps = bmi_exp_dat$SNP,
filename = "gwas_summary.csv",
sep = ",",
snp_col = "rsid",
beta_col = "effect",
se_col = "SE",
effect_allele_col = "a1",
other_allele_col = "a2",
eaf_col = "a1_freq",
pval_col = "p-value",
units_col = "Units",
gene_col = "Gene",
samplesize_col = "n"
)
This returns an outcome data frame with only the SNPs that were requested (if those SNPs were present in the “gwas_summary.csv” file).
Outcome data format
The extract_outcome_data
function returns a table of SNP
effects for the requested SNPs on the requested outcomes. The format of
the data is similar to the exposure data format, except the main columns
are as follows:
SNP
beta.outcome
se.outcome
samplesize.outcome
ncase.outcome
ncontrol.outcome
pval.outcome
eaf.outcome
effect_allele.outcom
other_allele.outcome
units.outcome
outcome
consortium.outcome
year.outcome
pmid.outcome
id.outcome
originalname.outcome
proxy.outcome
target_snp.outcome
proxy_snp.outcome
target_a1.outcome
target_a2.outcome
proxy_a1.outcome
proxy_a2.outcome
mr_keep.outcome
data_source.outcome
More advanced use of local data
We have developed a summary data format called “GWAS VCF”, which is designed to store GWAS results in a strict and performant way. It is possible to use this format with the TwoSampleMR package. Going down this avenue also allows you to use LD proxy functionality using your own LD reference files (or ones that we provide). For more details, see this package that explains the format and how to query it in R:
https://github.com/mrcieu/gwasvcf
and this package for how to connect the data to other packages including TwoSampleMR