Complete GWAS summary datasets are now abundant. A large repository of curated, harmonised and QC’d datasets is available in the IEU GWAS database. They can be queried via the API directly, or through the ieugwasr R package, or the ieugwaspy python package. However, for faster querying that can be used in a HPC environment, accessing the data directly and not through cloud systems is advantageous.

We developed a format for storing and harmonising GWAS summary data known as GWAS VCF format. All the data in the IEU GWAS database is available for download in this format. This R package provides fast and convenient functions for querying and creating GWAS summary data in GWAS VCF format. This package includes:

  • a wrapper around the bioconductor/VariantAnnotation package, providing functions tailored to GWAS VCF for reading, querying, creating and writing GWAS VCF format files
  • some LD related functions such as using a reference panel to extract proxies, create LD matrices and perform LD clumping
  • functions for harmonising a dataset against the reference genome and creating GWAS VCF files.

See also the gwasglue R package for methods to connect the VCF data to Mendelian randomization, colocalisation, fine mapping etc.




See vignettes here:


If using GWAS-VCF files please reference the studies that you use and the following paper:

The variant call format provides efficient and robust storage of GWAS summary statistics. Matthew Lyon, Shea J Andrews, Ben Elsworth, Tom R Gaunt, Gibran Hemani, Edoardo Marcora. bioRxiv 2020.05.29.115824; doi:

Reference datasets

Example GWAS VCF (GIANT 2010 BMI):

1000 genomes reference panels for LD for each superpopulation - used by default in OpenGWAS:

1000 genomes European reference panel for LD (legacy):

1000 genomes vcf harmonised against human genome reference:


Example data

data.vcf.gz and data.vcf.gz.tbi are the first few rows of the Speliotes 2010 BMI GWAS

The eur.bed/bim/fam files are the same range as data.vcf.gz, from here