Here, we use an LD reference panel to identify SNPs that are in LD with the top signals from a GWAS. The algorithm sequentially chooses the top SNP, removes all SNPs in LD above some threshold within some window, then goes on to the next top hit and repeats the pruning process, until no more SNPs are left above the specified p-value threshold.
The data to be clumped can be retrieved either from the VCF files or data from the OpenGWAS database. Once the data has been retrieved, clumping can be performed either using the clumping routines in the cloud via the API, or locally using local LD reference data. The latter is recommended, it allows you to run things in parallel, to use larger LD reference panels, and avoids killing the servers!
Extract top hits for LDL cholesterol (ieu-a-300
)
dat <- ieugwasr::tophits("ieu-a-300")
This is very quick because it extracts the pre-clumped top hits for this dataset. If you specify a different threshold from the default it will be a bit slower e.g.
dat <- ieugwasr::tophits("ieu-a-300", pval=5e-7)
Performing clumping using local data is possible. For example, extract the data without clumping:
dat <- ieugwasr::tophits("ieu-a-300", clump=FALSE)
Obtain the plink
binary for your operating system. For convenience you can use the genetics.binaRies R package which has a few different widely used utilities bundled within it:
plink_bin <- genetics.binaRies::get_plink_binary()
Obtain some LD reference data. See the homepage of this site for options for downloading LD reference data. Here we’ll be using the same LD reference data as that used by the API by default, which is Europeans from the 1000 genomes reference panel
ldref <- "/path/to/EUR"
Perform clumping
clumped <- ld_clump(dat, bfile=ldref, plink_bin=plink_bin)
There is a single function that can be used to perform clumping on the VCF files. It will either run locally or run using the API, depending on the arguments you supply.
Running locally:
# Set path to file vcffile <- "/path/to/ieu-a-300.vcf.gz" # Set path to bcftools gwasvcf::set_bcftools() # Perform clumping clumped <- clump_gwasvcf(vcffile, bfile=ldref, plink_bin=plink_bin)
Running remotely:
clumped <- clump_gwasvcf(vcffile, pop="EUR")
Here, the pop
argument is passed to the API specifying which super-population to use for the LD reference.
Here, the objective is to extract a slice of the data with relevant fields, and its corresponding LD matrix. Then the data can be applied to a few different packages quite easily
finemapr
package - which simplifies analysis using FINEMAP, PAINTOR and CAVIARsusieR
package - new approach called “Sum of Single Effects”The required data format can be generated from the OpenGWAS database or from VCF files
One of the tophits for LDL cholesterol is rs10903129
, which is located at 1:25768937
on hg19. Determining the region to finemap around this variant is simplified by knowing the natural LD break points in a the European population, which is where the LDL GWAS was performed (ieu-a-300
). Berisa and Pickrell 2016 provide a useful dataset of natural breakpoints, which has been incorporated into this package.
Identify LD region to perform finemapping:
region <- map_variants_to_regions(chrpos="1:25768937", pop="EUR")
Extract data from that region into a format for finemapping
dat <- ieugwasr_to_finemapr(region$region, "ieu-a-300")
This returns a nested list. The top level of results is an item for every GWAS dataset requested. Each item contains a dataframe of rsids and their z-scores within the region, and an LD matrix for those variants, and the sample size for each of the variants.
Perform finemapping e.g. using CAVIAR (see https://github.com/variani/finemapr for more info:
options(finemapr_caviar = "/path/to/caviar") library(dplyr) o <- finemapr::run_caviar(dat[[1]]$z, dat[[1]]$ld, args = "-c 3")
Perform finemapping e.g using susieR (see https://stephenslab.github.io/susieR/ for more info):
fitted_rss <- susieR::susie_rss( dat[[1]]$z$zscore, dat[[1]]$ld, L=10, estimate_residual_variance=TRUE, estimate_prior_variance=TRUE, check_R=FALSE, z_ld_weight=1/500 ) summary(fitted_rss) susieR::susie_plot(fitted_rss, y="PIP")
Let’s perform a similar analysis for the VCF files
# extract data from vcf dat <- gwasvcf_to_finemapr(region = region$region, vcf=vcffile, bfile=ldref, plink_bin=plink_bin) # Perform finemapping fitted_rss <- susieR::susie_rss( dat[[1]]$z$zscore, dat[[1]]$ld, L=10, estimate_residual_variance=TRUE, estimate_prior_variance=TRUE, check_R=FALSE, z_ld_weight=1/500 ) summary(fitted_rss) susieR::susie_plot(fitted_rss, y="PIP")