Install

Quick start

Use web interface http://vcf.mrcieu.ac.uk

Run locally

Either run directly on a UNIX host or using Docker containerisation (recommended)

Download

git clone https://github.com/MRCIEU/gwas2vcf.git
cd gwas2vcf

Native

Requires Python v3.8

python3 -m venv env
source env/bin/activate
pip install -r requirements.txt
pip install git+https://github.com/bioinformed/vgraph@v1.4.0#egg=vgraph
python main.py -h

Docker

Pull image from DockerHub OR build image from source

# pull image from DockerHub
docker pull mrcieu/gwas2vcf

### OR ###

# build docker image from source
docker build -t gwas2vcf .

Run

docker run \
-v /path/to/fasta:/path/to/fasta \
--name gwas2vcf \
-it mrcieu/gwas2vcf:latest \
python main.py -h

Usage

usage: main.py [-h] [-v] [--out OUT] [--data DATA] --ref REF [--dbsnp DBSNP] --json JSON [--id ID] [--cohort_controls COHORT_CONTROLS]
               [--cohort_cases COHORT_CASES] [--csi] [--log {DEBUG,INFO,WARNING,ERROR,CRITICAL}] [--alias ALIAS]

Map GWAS summary statistics to VCF/BCF

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program version number and exit
  --out OUT             Path to output VCF/BCF. If not present then must be specified as 'out' in json file
  --data DATA           Path to GWAS summary stats. If not present then must be specified as 'data' in json file
  --ref REF             Path to reference FASTA
  --dbsnp DBSNP         Path to reference dbSNP VCF
  --json JSON           Path to parameters JSON
  --id ID               Study identifier. If not present then must be specified as 'id' in json file
  --cohort_controls COHORT_CONTROLS
                        Total study number of controls (if case/control) or total sample size if continuous. Overwrites value if present in json
                        file.
  --cohort_cases COHORT_CASES
                        Total study number of cases. Overwrites value if present in json file.
  --csi                 Default is to index tbi but use this flag to index csi
  --log {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Set the logging level
  --alias ALIAS         Optional chromosome alias file

Additional parameters are passed through a JSON parameters file using --json <param.json>, see param.py for full details and below example. Note that field columns start at 0.

Running the tests

Unit tests:

cd gwas2vcf
python -m pytest -v test

Reference files

FASTA

# GRCh36/hg18/b36
wget http://fileserve.mrcieu.ac.uk/ref/2.8/b36/human_b36_both.fasta.gz; gzip -d human_b36_both.fasta.gz
wget http://fileserve.mrcieu.ac.uk/ref/2.8/b36/human_b36_both.fasta.fai
wget http://fileserve.mrcieu.ac.uk/ref/2.8/b36/human_b36_both.dict

# GRCh37/hg19/b37
wget http://fileserve.mrcieu.ac.uk/ref/2.8/b37/human_g1k_v37.fasta.gz; gzip -d human_g1k_v37.fasta.gz
wget http://fileserve.mrcieu.ac.uk/ref/2.8/b37/human_g1k_v37.fasta.fai
wget http://fileserve.mrcieu.ac.uk/ref/2.8/b37/human_g1k_v37.dict

# GRCh38/hg38/b38
wget https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta
wget https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta.fai
wget https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.dict

dbSNP

# GRCh37/hg19/b37
wget http://fileserve.mrcieu.ac.uk/dbsnp/dbsnp.v153.b37.vcf.gz .
wget http://fileserve.mrcieu.ac.uk/dbsnp/dbsnp.v153.b37.vcf.gz.tbi .

# GRCh38/hg38/b38
wget http://fileserve.mrcieu.ac.uk/dbsnp/dbsnp.v153.hg38.vcf.gz .
wget http://fileserve.mrcieu.ac.uk/dbsnp/dbsnp.v153.hg38.vcf.gz.tbi .

Newer dbSNP builds can be obtained from the NCBI FTP but the VCF files have non-standard chromosome names which can be updated accordingly (thanks @darked89)

# download latest dbSNP VCF for hg38
wget https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.39.gz

# define chromosome name mapping
echo -n "NC_000001.11    1
NC_000002.12    2
NC_000003.12    3
NC_000004.12    4
NC_000005.10    5
NC_000006.12    6
NC_000007.14    7
NC_000008.11    8
NC_000009.12    9
NC_000010.11    10
NC_000011.10    11
NC_000012.12    12
NC_000013.11    13
NC_000014.9     14
NC_000015.10    15
NC_000016.10    16
NC_000017.11    17
NC_000018.10    18
NC_000019.10    19
NC_000020.11    20
NC_000021.9     21
NC_000022.11    22
NC_000023.11    X
NC_000024.10    Y
NC_012920.1     MT
" > hg38_rename_chrom_names.tsv

# update chromosome names
bcftools annotate \
--rename-chrs hg38_rename_chrom_names.tsv \
--output-type z \
--output dbSNP_clean.vcf.gz GCF_000001405.39.gz

# index modified file
bcftools index dbSNP_clean.vcf.gz