This document details changes and new features specifically relating to the TwoSampleMR R package and the GWAS database behind it.

What has changed

Dataset IDs

We have made a new system for naming datasets, and all datasets are organised into data batches. Either new datasets are uploaded one at a time in which case they are added to the ieu-a data batch, or there is a bulk upload in which case a new batch is created. For example, ukb-a is a bulk upload of the first round of the Neale lab UKBiobank GWAS, and ukb-b is the IEU GWAS analysis of the UKBiobank data. In most cases, a dataset is then numbered arbitrarily within the batch. For example, the Locke et al 2014 BMI analysis was previously known as 2, and it is now known as ieu-a-2.

There is backward compatibility built into the R packages that access the data, so if you use an ‘old’ ID, it will automatically translate that to the new one. But it will give you a warning, and we urge you to update your scripts to reflect this change.

Authentication

Previously you would automatically be asked to authenticate any query to the database, through google. Now, we are making authentication voluntary - something that you do at the start of a session only if you need access to specific private datasets on the database. For the vast majority of use cases this is not required.

Another change is that the R package that managed the authentication has updated, and the file tokens generated are slightly different. For full information on how to deal with this, see here: https://mrcieu.github.io/ieugwasr/articles/guide.html#authentication

UKBiobank data has been curated

We conducted a large GWAS analysis using a pipeline that systematically analysed every PHESANT phenotype in UK Biobank. There were previously ~20k traits with complete GWAS data, but a majority of these were binary traits based on very few numbers of cases. We have now filtered out unreliable datasets, there are 2514 traits remaining, with any binary traits removed that had fewer than 1000 cases. Another issue is the combination of small numbers of cases and allele frequency - here minor allele count (MAC) for a particular association could be very small which would lead to high false positives when using Bolt-LMM. The remaining traits have been filtered to only retain associations where the MAC > 90.

Document detailing this investigation here: https://htmlpreview.github.io/?https://raw.githubusercontent.com/MRCIEU/ukbb-gwas-analysis/master/docs/ldsc_clumped_analysis.html?token=AAOV6TBQXEXEPT7SUXXLWMC6DWP3O

All data is now harmonised

Previously the data were QC’d to remove malformed results and then deposited as we found them. We are now also pre-harmonising all the data. This means that all alleles are coded on the forward strand, and the non-effect allele is always aligned to the human genome reference sequence B37 (so the effect allele is the non-reference allele). This does mean that sometimes variants have been removed if they did not map to the human genome, and for most datasets the effect allele has been switched for approximately half of all sites. When an effect allele changes we do of course switch the sign of the effect size, so it should not impact any MR results.

LD reference panel is now harmonised

We have updated the LD reference panel to be harmonised against human genome build 37, and as a consequence a few variants have been lost from the version that was previously used.

Instrument lists are up-to-date

Previously we were pre-clumping the tophits and storing them in the MRInstruments R package, and there was often a delay in updating the MRInstruments R package after new datasets were uploaded to the database. We have moved away from this model. Everything dataset is pre-clumped, but that is stored in the database. If you request default clumping values when extracting the tophits of a dataset, it will still be fast but it is retrieving the data from the server, and not from the MRInstruments package. You can continue to use the MRInstruments package for GWAS hits from e.g. GTEx or the EBI GWAS catalog.

dbSNP rs IDs

All rs IDs have been mapped to dbSNP build 144. Therefore, some rs IDs may have changed, but there is stronger alignment across all datasets.

Everything is faster

We are using Elasticsearch and Neo4j on an Oracle Cloud Infrastructure to serve the data. It’s much faster. Interestingly, it actually gets faster when more people are using it because the cache gets ‘warmed up’ by more requests.

What is new

Browse available datasets online

We have a new home for the GWAS summary data: https://gwas.mrcieu.ac.uk/.

Chromosome and position

All variants have been mapped to chromosome and position (hg19/build37). You can query based on chromosome position coordinates. This means either a list of <chr:pos> values, or a list of <chr:pos1-pos2> ranges.

INDELs are retained

Previously we were excluding these, but they are now retained

Multi-allelic variants are retained

Previously we were excluding these, but they are now retained. Be warned that if you extract a variant that has multiple alleles then you may get more than one row for that variant.

More data

Automated download from the EBI repository, and an automated upload system and batch data processing system means that more data can be added faster to keep the database current.

Error messages are more informative

Previously if a query to the database failed it didn’t give a reason, hopefully there is more clarity regarding what is happening now. You can also check the status of the server here: https://gwas-api.mrcieu.ac.uk/

Easier programmatic access to the database

We are trying to make it as flexible as possible to access the data. The TwoSampleMR R package was previously the only programmatic way to access the data, now we have the following options:

  • ieugwasr R package: All the TwoSampleMR functions that access the data are done by calling this package now. It is a simple wrapper around the API that controls access to the database.
  • ieugwaspy python package: Similar functionality to ieugwasr (Under construction).
  • API: You can use the API directly, e.g. for building your own services or applications.

Local LD operations

It is now possible to perform clumping, or create LD matrices, using your own local LD reference dataset. You can download the one that we have been using here: https://github.com/mrcieu/gwasglue#reference-datasets, or create your own plink format dataset e.g. with larger samples or for different ancestries. See the LD clumping functions in the ieugwasr package for more details.

Access the data directly

Previously the data was only accessible through the database. Now the data can be downloaded in “GWAS VCF” format from here https://gwas.mrcieu.ac.uk/. (IEU members can access all the data on RDSF or bluecrystal4 directly). This means that if you want to perform very large or numerous operations, you can do it on HPC or locally in a more performant manner by using the data files directly. Please see the gwasvcf R package on how to work with these data.

Connect the data to different analytical tools

Either the data in the database, or the GWAS VCF files, can be queried and the results translated into the formats for a bunch of different R packages for MR, colocalisation, fine mapping, etc. Have a look at the gwasglue R package, to see what is available and how to do this. It’s still under construction, but feel free to try it, make suggestions, and contribute code.

How to request new data

We have setup a github issues page here: https://github.com/MRCIEU/igd-data-requests/issues

Please visit here to make a log of new data requests, or to contribute new data.