Major changes to the IEU GWAS resources for 2020Source:
This document details changes and new features specifically relating to the TwoSampleMR R package and the GWAS database behind it.
We have made a new system for naming datasets, and all datasets are
organised into data batches. Either new datasets are uploaded one at a
time in which case they are added to the
ieu-a data batch,
or there is a bulk upload in which case a new batch is created. For
ukb-a is a bulk upload of the first round of the
Neale lab UKBiobank GWAS, and
ukb-b is the IEU GWAS
analysis of the UKBiobank data. In most cases, a dataset is then
numbered arbitrarily within the batch. For example, the Locke et al 2014
BMI analysis was previously known as
2, and it is now known
There is backward compatibility built into the R packages that access the data, so if you use an ‘old’ ID, it will automatically translate that to the new one. But it will give you a warning, and we urge you to update your scripts to reflect this change.
Previously you would automatically be asked to authenticate any query to the database, through google. Now, we are making authentication voluntary - something that you do at the start of a session only if you need access to specific private datasets on the database. For the vast majority of use cases this is not required.
Another change is that the R package that managed the authentication has updated, and the file tokens generated are slightly different. For full information on how to deal with this, see here: https://mrcieu.github.io/ieugwasr/articles/guide.html#authentication
We conducted a large GWAS analysis using a pipeline that systematically analysed every PHESANT phenotype in UK Biobank. There were previously ~20k traits with complete GWAS data, but a majority of these were binary traits based on very few numbers of cases. We have now filtered out unreliable datasets, there are 2514 traits remaining, with any binary traits removed that had fewer than 1000 cases. Another issue is the combination of small numbers of cases and allele frequency - here minor allele count (MAC) for a particular association could be very small which would lead to high false positives when using Bolt-LMM. The remaining traits have been filtered to only retain associations where the MAC > 90.
Document detailing this investigation here: https://htmlpreview.github.io/?https://raw.githubusercontent.com/MRCIEU/ukbb-gwas-analysis/master/docs/ldsc_clumped_analysis.html?token=AAOV6TBQXEXEPT7SUXXLWMC6DWP3O
Previously the data were QC’d to remove malformed results and then deposited as we found them. We are now also pre-harmonising all the data. This means that all alleles are coded on the forward strand, and the non-effect allele is always aligned to the human genome reference sequence B37 (so the effect allele is the non-reference allele). This does mean that sometimes variants have been removed if they did not map to the human genome, and for most datasets the effect allele has been switched for approximately half of all sites. When an effect allele changes we do of course switch the sign of the effect size, so it should not impact any MR results.
We have updated the LD reference panel to be harmonised against human genome build 37, and as a consequence a few variants have been lost from the version that was previously used.
Previously we were pre-clumping the tophits and storing them in the MRInstruments R package, and there was often a delay in updating the MRInstruments R package after new datasets were uploaded to the database. We have moved away from this model. Everything dataset is pre-clumped, but that is stored in the database. If you request default clumping values when extracting the tophits of a dataset, it will still be fast but it is retrieving the data from the server, and not from the MRInstruments package. You can continue to use the MRInstruments package for GWAS hits from e.g. GTEx or the EBI GWAS catalog.
All rs IDs have been mapped to dbSNP build 144. Therefore, some rs IDs may have changed, but there is stronger alignment across all datasets.
We have a new home for the GWAS summary data: https://gwas.mrcieu.ac.uk/.
All variants have been mapped to chromosome and position
(hg19/build37). You can query based on chromosome position coordinates.
This means either a list of
<chr:pos> values, or a
Previously we were excluding these, but they are now retained. Be warned that if you extract a variant that has multiple alleles then you may get more than one row for that variant.
Automated download from the EBI repository, and an automated upload system and batch data processing system means that more data can be added faster to keep the database current.
Previously if a query to the database failed it didn’t give a reason, hopefully there is more clarity regarding what is happening now. You can also check the status of the server here: https://gwas-api.mrcieu.ac.uk/
We are trying to make it as flexible as possible to access the data. The TwoSampleMR R package was previously the only programmatic way to access the data, now we have the following options:
- ieugwasr R package: All the TwoSampleMR functions that access the data are done by calling this package now. It is a simple wrapper around the API that controls access to the database.
- ieugwaspy python package: Similar functionality to ieugwasr (Under construction).
- API: You can use the API directly, e.g. for building your own services or applications.
It is now possible to perform clumping, or create LD matrices, using your own local LD reference dataset. You can download the one that we have been using here: https://github.com/mrcieu/gwasglue#reference-datasets, or create your own plink format dataset e.g. with larger samples or for different ancestries. See the LD clumping functions in the ieugwasr package for more details.
Previously the data was only accessible through the database. Now the data can be downloaded in “GWAS VCF” format from here https://gwas.mrcieu.ac.uk/. (IEU members can access all the data on RDSF or bluecrystal4 directly). This means that if you want to perform very large or numerous operations, you can do it on HPC or locally in a more performant manner by using the data files directly. Please see the gwasvcf R package on how to work with these data.
Either the data in the database, or the GWAS VCF files, can be queried and the results translated into the formats for a bunch of different R packages for MR, colocalisation, fine mapping, etc. Have a look at the gwasglue R package, to see what is available and how to do this. It’s still under construction, but feel free to try it, make suggestions, and contribute code.
- The IEU GWAS database: https://gwas.mrcieu.ac.uk
- API to the IEU GWAS database: https://gwas-api.mrcieu.ac.uk
- ieugwasr package, for R access to the API: https://mrcieu.github.io/ieugwasr/
- ieugwaspy package, for python access to the API: https://github.com/MRCIEU/ieugwaspy/ (Under construction)
- gwasvcf package, R interface to GWAS VCF files: https://mrcieu.github.io/gwasvcf/
- pygwasvcf package, python interface to GWAS VCF files: https://github.com/mrcieu/pygwasvcf (Under construction)
- gwasglue package, linking GWAS data to various analytical methods: https://mrcieu.github.io/gwasglue/ (Functional, under construction)
- gwas2vcf online tool, allowing users to create GWAS VCF files from summary data: https://github.com/MRCIEU/gwas2vcf
We have setup a github issues page here: https://github.com/MRCIEU/opengwas-requests/issues
Please visit here to make a log of new data requests, or to contribute new data.
To install the new version of TwoSampleMR, perform as normal:
To update the package just run the
We recommend using this new version going forwards but for a limited time we are enabling backwards compatibility, in case you are in the middle of analysis or need to reproduce old analysis. In order to use the legacy version of the package and the database, install using: