This article is provided as a brief introductory guide to working with the EpiGraphDB platform through epigraphdb R package. Here we will demonstrate a few basic operations that can be carried out using the platform, but for more advanced methods please refer to the API endpoint documentation.

library("dplyr")
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library("epigraphdb")
#> 
#>     EpiGraphDB v1.0 (API: https://api.epigraphdb.org)
#> 

Part 1: Using EpiGraphDB to obtain biological mappings

With EpiGraphDB you can map genetic variants to genes, genes to proteins, proteins to pathways, pathways to diseases and so on, as shown in the network diagram here.

In this part, we want to demonstrate how to do basic mappings between biological entities. We are going to map genes to proteins (i.e. their UniProtID), proteins to pathways that they are found in (using Reactome data), and then extract information on the specific pathways identified.

But first, let’s talk about the basic querying syntax. query_epigraphdb is the main querying function in the package; it is used to communicate with EpiGraphDB by specifying API endpoints.

query_epigraphdb(
  route = endpoint, # supply the route / endpoint
  params = list(... = ...), # supply the query parameters
  mode = "table", # How the results are shown in R
  method = "GET" # HTTP method, "GET", "POST", etc.
)
  • endpoint and method constitute the ‘method’ provided by the EpiGraphDB API web service of where you want to query the data from, and there is a dedicated one for each case. e.g. POST /mappings/gene-to-protein is used to query gene-to-proteins mappings. See the full list of currently available endpoints here in the documentation or the Swagger interface of the API.

  • params is where you supply the list of things you want to query about (can be a list of genes, GWAS trait, MR outcome of interest), subject to what each endpoint accepts and requires.

  • mode is the format in which the results are returned. Use “table” for basic cases, for more advanced stuff you may need to use “raw” - more on this later.


In this guide, we are going to use the all-purpose query_epigraphdb function in all basic examples. However, many of the most common queries have been wrapped in specific functions within epigraphdb R package for the ease of use. Those are very helpful, but to help the understanding of the core principles behind using EpiGraphDB, here we present the ways to run queries in a less abstracted way.

Mapping genes to proteins

In this first section, we will take an arbitrary list of genes and query the EpiGraphDB API to find the proteins that they map to.

# Let's test this on a few genes
genes <- c("TP53", "BRCA1", "TNF")

# Select the endpoint "POST /mappings/gene-to-protein"
endpoint <- "/mappings/gene-to-protein"
method <- "POST"

# Build a query
proteins <- query_epigraphdb(
  route = endpoint,
  params = list(gene_name_list = genes),
  mode = "table",
  method = method
)

# Check the output
print(proteins)
#> # A tibble: 3 × 3
#>   gene.name gene.ensembl_id protein.uniprot_id
#>   <chr>     <chr>           <chr>             
#> 1 TP53      ENSG00000141510 P04637            
#> 2 BRCA1     ENSG00000012048 P38398            
#> 3 TNF       ENSG00000232810 P01375

In the above data frame, we see the output from querying EpiGraphDB for the proteins that have been mapped to the genes TP53, BRCA1, and TNF. Our query returned the UniProt and Ensembl IDs for those genes.

Mapping proteins to pathways

Next, to demonstrate the mapping of proteins to pathways, we are going to take one protein from the previous example, P04637, and query EpiGraphDB for all pathways it is known to be involved in.

# Let's see what pathways the protein product of TP53 gene is involved in

# NOTE: Argument `uniprot_id_list` requires a list of UniProt IDs in
#       POST /mappings/gene-to-protein.
#       In this case for the R package, we need to wrap `proteins_uniprot_ids`
#       with an `I()` function (AsIs) to prevent auto-unpacking by `httr`
proteins_uniprot_ids <- c("P04637") %>% I()

endpoint <- "/protein/in-pathway"

pathway_df <- query_epigraphdb(
  route = endpoint,
  params = list(uniprot_id_list = proteins_uniprot_ids),
  mode = "table",
  method = "POST"
)

# Check out how many pathways this protein is found in
print(pathway_df)
#> # A tibble: 1 × 3
#>   uniprot_id pathway_count pathway_reactome_id
#>   <chr>              <int> <list>             
#> 1 P04637                84 <chr [84]>

# Get pathways names (Reactome IDs)
print(pathway_df$pathway_reactome_id[[1]])
#>  [1] "R-HSA-983231"  "R-HSA-9006925" "R-HSA-8953897" "R-HSA-8943724"
#>  [5] "R-HSA-8941855" "R-HSA-8878159" "R-HSA-8853884" "R-HSA-8852276"
#>  [9] "R-HSA-74160"   "R-HSA-73894"   "R-HSA-73857"   "R-HSA-69895"  
#> [13] "R-HSA-69620"   "R-HSA-69615"   "R-HSA-69580"   "R-HSA-69563"  
#> [17] "R-HSA-69560"   "R-HSA-69541"   "R-HSA-69481"   "R-HSA-69473"  
#> [21] "R-HSA-69278"   "R-HSA-69275"   "R-HSA-6811555" "R-HSA-6807070"
#> [25] "R-HSA-6806003" "R-HSA-6804760" "R-HSA-6804759" "R-HSA-6804758"
#> [29] "R-HSA-6804757" "R-HSA-6804756" "R-HSA-6804754" "R-HSA-6804116"
#> [33] "R-HSA-6804115" "R-HSA-6804114" "R-HSA-6803211" "R-HSA-6803207"
#> [37] "R-HSA-6803205" "R-HSA-6803204" "R-HSA-6796648" "R-HSA-6791312"
#> [41] "R-HSA-6785807" "R-HSA-597592"  "R-HSA-5693606" "R-HSA-5693565"
#> [45] "R-HSA-5693532" "R-HSA-5689896" "R-HSA-5689880" "R-HSA-5688426"
#> [49] "R-HSA-5633008" "R-HSA-5633007" "R-HSA-5628897" "R-HSA-5357801"
#> [53] "R-HSA-453274"  "R-HSA-449147"  "R-HSA-392499"  "R-HSA-391251" 
#> [57] "R-HSA-390471"  "R-HSA-390466"  "R-HSA-3700989" "R-HSA-349425" 
#> [61] "R-HSA-3232118" "R-HSA-3108232" "R-HSA-2990846" "R-HSA-2559586"
#> [65] "R-HSA-2559585" "R-HSA-2559584" "R-HSA-2559583" "R-HSA-2559580"
#> [69] "R-HSA-2262752" "R-HSA-212436"  "R-HSA-1912422" "R-HSA-1912408"
#> [73] "R-HSA-168256"  "R-HSA-1640170" "R-HSA-162582"  "R-HSA-157118" 
#> [77] "R-HSA-139915"  "R-HSA-1280215" "R-HSA-1257604" "R-HSA-114452" 
#> [81] "R-HSA-111448"  "R-HSA-109606"  "R-HSA-109582"  "R-HSA-109581"

P04637 is involved in five pathways. Next, let’s get pathway info for one of them.

Get pathway info

# Let's see what exactly this pathway is
reactome_id <- "R-HSA-6804754"

endpoint <- "/meta/nodes/Pathway/search"

pathway_info <- query_epigraphdb(
  route = endpoint,
  params = list(id = reactome_id),
  mode = "table"
)

# Pathway description
print(pathway_info)
#> # A tibble: 1 × 6
#>   node._name                    node.name node._source node.id node._id node.url
#>   <chr>                         <chr>     <list>       <chr>   <chr>    <chr>   
#> 1 Regulation of TP53 Expression Regulati… <chr [1]>    R-HSA-… R-HSA-6… https:/…

If you are interested in this type of analysis, check out case studies 1 and 2 for further details on pathways analysis, PPI, mapping drugs to targets etc.


Running the above queries using the dedicated wrapped functions:

mappings_gene_to_protein(genes)
#> # A tibble: 3 × 3
#>   gene.name gene.ensembl_id protein.uniprot_id
#>   <chr>     <chr>           <chr>             
#> 1 TP53      ENSG00000141510 P04637            
#> 2 BRCA1     ENSG00000012048 P38398            
#> 3 TNF       ENSG00000232810 P01375
protein_in_pathway(proteins_uniprot_ids)
#> # A tibble: 1 × 3
#>   uniprot_id pathway_count pathway_reactome_id
#>   <chr>              <int> <list>             
#> 1 P04637                84 <chr [84]>

Part 2: Epidemiological relationships analysis

In this part we will demonstrate queries that may be relevant in epidemiology research.

Look up GWAS studies

First, we want to check what GWAS are available within EpiGraphDB for our trait of interest, e.g. Body mass index. Doing this query is equivalent doing a look-up using EpiGraphDB Web UI. The search functionality is fuzzy search and case insensitive, i.e. ‘body mass index’ or ‘Body Mass Index’ will give you the same set of results.

# Let's see what Body mass index GWAS are available in EpiGraphDB
trait <- "body mass index"

endpoint <- "/meta/nodes/Gwas/search"

results <- query_epigraphdb(
  route = endpoint,
  params = list(name = trait),
  mode = "table"
)

# show selected columns in the results
results %>%
  select(
    node.trait, node.id, node.sample_size,
    node.year, node.author
  )
#> # A tibble: 10 × 5
#>    node.trait      node.id          node.sample_size node.year node.author
#>    <chr>           <chr>            <chr>            <chr>     <chr>      
#>  1 Body mass index ieu-a-1089       120286.0         2016.0    Wood       
#>  2 Body mass index ieu-a-974        171977.0         2015.0    Locke AE   
#>  3 Body mass index ieu-a-95         73137.0          2013.0    Randall JC 
#>  4 Body mass index ebi-a-GCST004904 158284.0         2017.0    Akiyama M  
#>  5 Body mass index ebi-a-GCST006368 315347.0         2018.0    Hoffmann TJ
#>  6 Body mass index bbj-a-2          85894.0          2019.0    Ishigaki K 
#>  7 Body mass index ieu-a-835        322154.0         2015.0    Locke AE   
#>  8 Body mass index ieu-a-2          339224.0         2015.0    Locke AE   
#>  9 Body mass index ieu-a-785        152893.0         2015.0    Locke AE   
#> 10 Body mass index bbj-a-1          158284.0         2019.0    Ishigaki K

Explore Mendelian randomization studies

In these examples, we show how to extract pre-computed MR results for the specified exposure, outcome, or both, traits of interest.

Specify exposure trait

# Look up all MR analyses where a trait was used as exposure
# and find all outcome traits with causal evidence from it.

trait1 <- "Body mass index"
# NB: here, trait name has to specific (not fuzzy):
# use the exact trait name wording as in GWAS `node.trait` (previous example)

endpoint <- "/mr"

mr_df <- query_epigraphdb(
  route = endpoint,
  params = list(
    exposure_trait = trait1,
    pval_threshold = 5e-08
  ),
  mode = "table"
)
print(mr_df)
#> # A tibble: 2,282 × 10
#>    exposure.id  exposure.trait  outcome.id outcome.trait    mr.b   mr.se mr.pval
#>    <chr>        <chr>           <chr>      <chr>           <dbl>   <dbl>   <dbl>
#>  1 ieu-a-974    Body mass index ebi-a-GCS… Fibrinogen lev… 0.193 0.00224       0
#>  2 ebi-a-GCST0… Body mass index ukb-b-2303 Body mass inde… 0.595 0.0153        0
#>  3 ieu-a-785    Body mass index ieu-a-85   Extreme body m… 1.72  0.00163       0
#>  4 ieu-a-835    Body mass index ukb-b-180… Leg fat mass (… 0.608 0.0139        0
#>  5 ieu-a-835    Body mass index ukb-b-128… Arm fat percen… 0.526 0.0130        0
#>  6 ieu-a-835    Body mass index ieu-a-93   Overweight      1.67  0.0341        0
#>  7 ieu-a-835    Body mass index ieu-a-61   Waist circumfe… 0.824 0.0194        0
#>  8 ieu-a-835    Body mass index ieu-a-60   Waist circumfe… 0.820 0.0195        0
#>  9 ebi-a-GCST0… Body mass index ukb-b-201… Arm fat percen… 0.533 0.0104        0
#> 10 ieu-a-2      Body mass index ukb-b-4650 Comparative bo… 0.439 0.00989       0
#> # … with 2,272 more rows, and 3 more variables: mr.method <chr>,
#> #   mr.selection <chr>, mr.moescore <dbl>

The returned data frame includes all MR analysis with exposure.trait being “Body mass index”. However, there are several GWAS with this names. If you are interested in a specific GWAS, you will need to filter by exposure.id.

# Show how many MR analyses were done for each Body mass index GWAS
mr_df %>% count(exposure.id)
#> # A tibble: 8 × 2
#>   exposure.id          n
#>   <chr>            <int>
#> 1 ebi-a-GCST004904   143
#> 2 ebi-a-GCST006368   478
#> 3 ieu-a-2            408
#> 4 ieu-a-785          223
#> 5 ieu-a-835          346
#> 6 ieu-a-94           187
#> 7 ieu-a-95           213
#> 8 ieu-a-974          284

Specify outcome trait

Next, we can check all available MR analyses for an outcome trait of interest.

# Look up all MR for a specified outcome trait
# and find all exposure traits with causal evidence on it.

trait2 <- "Waist circumference"

endpoint <- "/mr"

mr_df <- query_epigraphdb(
  route = endpoint,
  params = list(
    outcome_trait = trait2,
    pval_threshold = 5e-08
  ),
  mode = "table"
)
print(mr_df)
#> # A tibble: 2,607 × 10
#>    exposure.id exposure.trait  outcome.id outcome.trait     mr.b   mr.se mr.pval
#>    <chr>       <chr>           <chr>      <chr>            <dbl>   <dbl>   <dbl>
#>  1 ukb-b-15957 Types of trans… ieu-a-104  Waist circum…  0.591   7.59e-3       0
#>  2 ukb-b-13423 Breastfed as a… ieu-a-104  Waist circum…  0.601   1.26e-2       0
#>  3 ubm-a-130   IDP T1 FAST RO… ieu-a-104  Waist circum…  0.0899  1.25e-3       0
#>  4 prot-b-38   interleukin 1 … ieu-a-104  Waist circum… -0.00884 2.30e-4       0
#>  5 met-c-903   Phospholipids … ieu-a-104  Waist circum… -0.0428  2.97e-4       0
#>  6 ukb-b-19060 Hearing aid us… ieu-a-66   Waist circum…  0.845   1.29e-2       0
#>  7 ukb-a-344   Difficulty not… ieu-a-66   Waist circum…  0.102   4.80e-4       0
#>  8 ubm-a-11    IDP T1 FIRST l… ieu-a-66   Waist circum…  0.0468  2.60e-4       0
#>  9 prot-a-2356 Lysosomal Pro-… ieu-a-66   Waist circum…  0.0157  6.30e-5       0
#> 10 prot-a-1332 Hemojuvelin     ieu-a-66   Waist circum… -0.0196  2.56e-4       0
#> # … with 2,597 more rows, and 3 more variables: mr.method <chr>,
#> #   mr.selection <chr>, mr.moescore <dbl>

Specify both exposure and outcome traits

Finally, we can look up MR causal inference results for a pair of exposure and outcome.

# Look up a specific pair of exposure+outcome
trait1 <- "Body mass index"
trait2 <- "Coronary heart disease"

endpoint <- "/mr"

mr_df <- query_epigraphdb(
  route = endpoint,
  params = list(
    exposure_trait = trait1,
    outcome_trait = trait2
  ),
  mode = "table"
)

print(mr_df)
#> # A tibble: 18 × 10
#>    exposure.id  exposure.trait  outcome.id  outcome.trait   mr.b  mr.se  mr.pval
#>    <chr>        <chr>           <chr>       <chr>          <dbl>  <dbl>    <dbl>
#>  1 ieu-a-2      Body mass index ieu-a-7     Coronary hear… 0.464 0.0415 5.46e-29
#>  2 ebi-a-GCST0… Body mass index ieu-a-7     Coronary hear… 0.457 0.0410 3.33e-20
#>  3 ieu-a-974    Body mass index ieu-a-7     Coronary hear… 0.389 0.0493 3.42e-15
#>  4 ieu-a-835    Body mass index ieu-a-7     Coronary hear… 0.417 0.0492 1.00e-11
#>  5 ieu-a-974    Body mass index ieu-a-9     Coronary hear… 0.320 0.0536 2.32e- 9
#>  6 ieu-a-2      Body mass index ieu-a-9     Coronary hear… 0.358 0.0535 5.91e- 9
#>  7 ieu-a-835    Body mass index ieu-a-9     Coronary hear… 0.397 0.0604 1.79e- 8
#>  8 ebi-a-GCST0… Body mass index ieu-a-9     Coronary hear… 0.341 0.0590 1.24e- 7
#>  9 ieu-a-95     Body mass index ieu-a-9     Coronary hear… 0.371 0.0708 1.62e- 7
#> 10 ebi-a-GCST0… Body mass index ieu-a-6     Coronary hear… 0.493 0.0986 5.88e- 7
#> 11 ieu-a-785    Body mass index ieu-a-9     Coronary hear… 0.395 0.0609 1.07e- 6
#> 12 ebi-a-GCST0… Body mass index ebi-a-GCST… Coronary hear… 0.309 0.0648 1.81e- 6
#> 13 ebi-a-GCST0… Body mass index ieu-a-8     Coronary hear… 0.309 0.0648 1.81e- 6
#> 14 ebi-a-GCST0… Body mass index ieu-a-7     Coronary hear… 0.275 0.0514 2.76e- 6
#> 15 ieu-a-95     Body mass index ieu-a-7     Coronary hear… 0.455 0.0971 2.82e- 6
#> 16 ieu-a-2      Body mass index ieu-a-8     Coronary hear… 0.317 0.0686 3.93e- 6
#> 17 ieu-a-2      Body mass index ebi-a-GCST… Coronary hear… 0.312 0.0688 5.86e- 6
#> 18 ieu-a-974    Body mass index ieu-a-8     Coronary hear… 0.328 0.0731 6.97e- 6
#> # … with 3 more variables: mr.method <chr>, mr.selection <chr>,
#> #   mr.moescore <dbl>

To query EpiGraphDB directly by GWAS ID, you will need to use the advanced functionality. See the end of this article.


Running the above MR query using a dedicated wrapped function:

mr(
  exposure_trait = trait1,
  outcome_trait = trait2
)
#> # A tibble: 18 × 10
#>    exposure.id  exposure.trait  outcome.id  outcome.trait   mr.b  mr.se  mr.pval
#>    <chr>        <chr>           <chr>       <chr>          <dbl>  <dbl>    <dbl>
#>  1 ieu-a-2      Body mass index ieu-a-7     Coronary hear… 0.464 0.0415 5.46e-29
#>  2 ebi-a-GCST0… Body mass index ieu-a-7     Coronary hear… 0.457 0.0410 3.33e-20
#>  3 ieu-a-974    Body mass index ieu-a-7     Coronary hear… 0.389 0.0493 3.42e-15
#>  4 ieu-a-835    Body mass index ieu-a-7     Coronary hear… 0.417 0.0492 1.00e-11
#>  5 ieu-a-974    Body mass index ieu-a-9     Coronary hear… 0.320 0.0536 2.32e- 9
#>  6 ieu-a-2      Body mass index ieu-a-9     Coronary hear… 0.358 0.0535 5.91e- 9
#>  7 ieu-a-835    Body mass index ieu-a-9     Coronary hear… 0.397 0.0604 1.79e- 8
#>  8 ebi-a-GCST0… Body mass index ieu-a-9     Coronary hear… 0.341 0.0590 1.24e- 7
#>  9 ieu-a-95     Body mass index ieu-a-9     Coronary hear… 0.371 0.0708 1.62e- 7
#> 10 ebi-a-GCST0… Body mass index ieu-a-6     Coronary hear… 0.493 0.0986 5.88e- 7
#> 11 ieu-a-785    Body mass index ieu-a-9     Coronary hear… 0.395 0.0609 1.07e- 6
#> 12 ebi-a-GCST0… Body mass index ebi-a-GCST… Coronary hear… 0.309 0.0648 1.81e- 6
#> 13 ebi-a-GCST0… Body mass index ieu-a-8     Coronary hear… 0.309 0.0648 1.81e- 6
#> 14 ebi-a-GCST0… Body mass index ieu-a-7     Coronary hear… 0.275 0.0514 2.76e- 6
#> 15 ieu-a-95     Body mass index ieu-a-7     Coronary hear… 0.455 0.0971 2.82e- 6
#> 16 ieu-a-2      Body mass index ieu-a-8     Coronary hear… 0.317 0.0686 3.93e- 6
#> 17 ieu-a-2      Body mass index ebi-a-GCST… Coronary hear… 0.312 0.0688 5.86e- 6
#> 18 ieu-a-974    Body mass index ieu-a-8     Coronary hear… 0.328 0.0731 6.97e- 6
#> # … with 3 more variables: mr.method <chr>, mr.selection <chr>,
#> #   mr.moescore <dbl>

Part 3. Looking for literature evidence

Accessing information in the literature is a ubiquitous task in research, be it for novel hypothesis generation or as part of evidence triangulation. EpiGraphDB facilitates fast processing of this information by allowing access to a host of literature-mined relationships that have been structured into semantic triples. These take the general form (subject, predicate, object) and have been generated using contemporary natural language processing techniques applied to a massive amount of published biomedical research papers.

In the following section, we will query the API for the relationship between a given gene and a disease outcome.

# Identity all publications where a gene is mentioned with relation to a disease

gene <- "IL23R"
trait <- "Inflammatory bowel disease"

endpoint <- "/gene/literature"

lit_df <- query_epigraphdb(
  route = endpoint,
  params = list(
    gene_name = gene,
    object_name = trait
  ),
  mode = "table"
)

# Review the found evidence in the literature
print(lit_df)
#> # A tibble: 5 × 6
#>   pubmed_id  gene.name lt.id    lt.name                     lt.type st.predicate
#>   <list>     <chr>     <chr>    <chr>                       <list>  <chr>       
#> 1 <chr [1]>  IL23R     C0021390 Inflammatory Bowel Diseases <chr [… PREDISPOSES 
#> 2 <chr [2]>  IL23R     C0021390 Inflammatory Bowel Diseases <chr [… NEG_ASSOCIA…
#> 3 <chr [1]>  IL23R     C0021390 Inflammatory Bowel Diseases <chr [… CAUSES      
#> 4 <chr [21]> IL23R     C0021390 Inflammatory Bowel Diseases <chr [… ASSOCIATED_…
#> 5 <chr [1]>  IL23R     C0021390 Inflammatory Bowel Diseases <chr [… AFFECTS

The data frame above shows that IL23R has been mentioned in 25 publications (pubmed_id column) in relation to Inflammatory bowel disease, in four predicates.

# Get a list of all PubMed IDs
lit_df %>%
  pull(pubmed_id) %>%
  unlist() %>%
  unique()
#>  [1] "23131344" "21155887" "17484863" "31728561" "18383521" "18383363"
#>  [7] "25159710" "18341487" "18047540" "19575361" "19496308" "18698678"
#> [13] "18088064" "19175939" "19817673" "29248579" "19747142" "20393462"
#> [19] "20067801" "18368064" "21846945" "18164077" "24280935" "27852544"

If you are interested in literature mining analysis, and also matching MR results to literature evidence, please refer to more specific examples in case studies 3 and 2.

EpiGraphDB stores data as nodes (data types) and edges (relationships between nodes). The available nodes can be viewed through the meta/nodes endpoint. Let’s list the available nodes:

endpoint <- "/meta/nodes/list"
meta_node_fields <- query_epigraphdb(
  route = endpoint, params = NULL, mode = "raw"
)
meta_node_fields %>% unlist()
#>  [1] "Disease"        "Drug"           "Efo"            "Gene"          
#>  [5] "Gwas"           "Literature"     "LiteratureTerm" "Pathway"       
#>  [9] "Protein"        "Tissue"         "Variant"

The list of nodes is also available in the documentation, along with the node properties, which are useful for running native Cypher queries. In this section, we want to show how to perform node search for a term of interest. Any node can be searched by specifying it in this query: GET /meta/nodes/{meta_node}/search.

To demonstrate this, we are going to search for ‘breast cancer’ in various nodes:

name <- "breast cancer"

endpoint <- "/meta/nodes/Gwas/search"
results <- query_epigraphdb(
  route = endpoint,
  params = list(name = name),
  mode = "table"
)
results %>%
  select(
    node.trait, node.id, node.sample_size,
    node.year, node.author
  )
#> # A tibble: 10 × 5
#>    node.trait                  node.id    node.sample_size node.year node.author
#>    <chr>                       <chr>      <chr>            <chr>     <chr>      
#>  1 Breast cancer               ebi-a-GCS… 89677.0          2015.0    Michailido…
#>  2 Breast cancer               ebi-a-GCS… 139274.0         2017.0    Michailido…
#>  3 Breast cancer (Combined On… ieu-a-1126 228951.0         2017.0    Michailido…
#>  4 Breast cancer (GWAS)        ieu-a-1168 33832.0          2015.0    Michailido…
#>  5 Breast cancer (GWAS)        ieu-a-1131 32498.0          2017.0    Michailido…
#>  6 Breast cancer (Oncoarray)   ieu-a-1129 106776.0         2017.0    Michailido…
#>  7 Breast cancer (Survival)    ieu-a-1165 37954.0          2015.0    Guo Q      
#>  8 Breast cancer (iCOGS)       ieu-a-1162 89677.0          2015.0    Michailido…
#>  9 Breast cancer (iCOGS)       ieu-a-1130 89677.0          2017.0    Michailido…
#> 10 Breast cancer anti-estroge… prot-a-234 3301.0           2018.0    Sun BB

endpoint <- "/meta/nodes/Disease/search"
results <- query_epigraphdb(
  route = endpoint,
  params = list(name = name),
  mode = "table"
)
results %>%
  select(node.label, node.id, node.definition)
#> # A tibble: 7 × 3
#>   node.label                                   node.id     node.definition      
#>   <chr>                                        <chr>       <chr>                
#> 1 Her2-receptor negative breast cancer         http://pur… NA                   
#> 2 breast cancer                                http://pur… A primary or metasta…
#> 3 estrogen-receptor negative breast cancer     http://pur… A subtype of breast …
#> 4 estrogen-receptor positive breast cancer     http://pur… A subtype of breast …
#> 5 progesterone-receptor negative breast cancer http://pur… NA                   
#> 6 progesterone-receptor positive breast cancer http://pur… NA                   
#> 7 sporadic breast cancer                       http://pur… A carcinoma that ari…

endpoint <- "/meta/nodes/Efo/search"
results <- query_epigraphdb(
  route = endpoint,
  params = list(name = name),
  mode = "table"
)
results
#> # A tibble: 10 × 6
#>    node._name      node._source node.id     node._id    node.type node.value    
#>    <chr>           <list>       <chr>       <chr>       <chr>     <chr>         
#>  1 BRCAX breast c… <chr [1]>    http://www… http://www… typed-li… BRCAX breast …
#>  2 Her2-receptor … <chr [1]>    http://pur… http://pur… typed-li… Her2-receptor…
#>  3 Hereditary bre… <chr [1]>    http://www… http://www… typed-li… Hereditary br…
#>  4 age at breast … <chr [1]>    http://www… http://www… typed-li… age at breast…
#>  5 breast cancer   <chr [1]>    http://pur… http://pur… typed-li… breast cancer 
#>  6 breast cancer … <chr [1]>    http://www… http://www… typed-li… breast cancer…
#>  7 breast cancer … <chr [1]>    http://www… http://www… typed-li… breast cancer…
#>  8 breast cancer … <chr [1]>    http://www… http://www… typed-li… breast cancer…
#>  9 estrogen-recep… <chr [1]>    http://www… http://www… typed-li… estrogen-rece…
#> 10 estrogen-recep… <chr [1]>    http://www… http://www… typed-li… estrogen-rece…

The queries above can be further expanded using Cypher as will be shown in the next section. To get more information about meta endpoint please see the guidelines on meta functionalities.

Advanced examples

The functionalities of epigraphdb R package and the REST API of EpiGraphDB are currently limited to a certain number of API endpoints available via the query_epigraphdb function, which are simply a small and limited subset of what a graph database offers. If you would like to further customise your query, EpiGraphDB API supports using Neo4j Cypher to directly query the graph database.

To get you started, we want to show that the majority of API endpoint queries are simple wrappers around Cypher queries which directly request data from the graph database. For example, the simple GWAS query we’ve done in Part 2 using “table” mode, can be executed using “raw” mode to expose the exact Cypher query that was run against the database:

# Running a GWAS query from Part 2
# Ask for "raw" format (as a list)
trait <- "body mass index"
endpoint <- "/meta/nodes/Gwas/search"
response <- query_epigraphdb(
  route = endpoint,
  params = list(name = trait),
  mode = "raw"
)

# display the Cypher query
response$metadata$query
#> [1] "MATCH (node: Gwas) WHERE node.trait =~ \"(?i).*body mass index.*\" RETURN node LIMIT 10;"

This is what a native Cypher query looks like:

query <- "
    MATCH (node: Gwas)
    WHERE node.trait =~ \"(?i).*body mass index.*\"
    RETURN node LIMIT 10;
"

This is how you supply a Cypher query to query_epigraphdb function:

# use POST /cypher

endpoint <- "/cypher"
method <- "POST"
params <- list(query = query)

results <- query_epigraphdb(
  route = endpoint,
  params = params,
  method = method,
  mode = "table"
)

# The result should be identical to the example in Part 2
results %>%
  select(
    node.trait, node.id, node.sample_size,
    node.year, node.author
  )
#> # A tibble: 10 × 5
#>    node.trait      node.id          node.sample_size node.year node.author
#>    <chr>           <chr>            <chr>            <chr>     <chr>      
#>  1 Body mass index ieu-a-1089       120286.0         2016.0    Wood       
#>  2 Body mass index ieu-a-974        171977.0         2015.0    Locke AE   
#>  3 Body mass index ieu-a-95         73137.0          2013.0    Randall JC 
#>  4 Body mass index ebi-a-GCST004904 158284.0         2017.0    Akiyama M  
#>  5 Body mass index ebi-a-GCST006368 315347.0         2018.0    Hoffmann TJ
#>  6 Body mass index bbj-a-2          85894.0          2019.0    Ishigaki K 
#>  7 Body mass index ieu-a-835        322154.0         2015.0    Locke AE   
#>  8 Body mass index ieu-a-2          339224.0         2015.0    Locke AE   
#>  9 Body mass index ieu-a-785        152893.0         2015.0    Locke AE   
#> 10 Body mass index bbj-a-1          158284.0         2019.0    Ishigaki K

Next step: let’s modify Cypher query

# Let's return Body mass index GWAS that were done by Locke AE

query <- "
    MATCH (node: Gwas)
    WHERE node.trait =~ \"(?i).*body mass index.*\"
    AND node.author = \"Locke AE\"
    RETURN node;
"

endpoint <- "/cypher"
method <- "POST"
params <- list(query = query)

results_subset <- query_epigraphdb(
  route = endpoint,
  params = params,
  method = method,
  mode = "table"
)

results_subset %>%
  select(
    node.trait, node.id, node.sample_size,
    node.year, node.author
  )
#> # A tibble: 4 × 5
#>   node.trait      node.id   node.sample_size node.year node.author
#>   <chr>           <chr>     <chr>            <chr>     <chr>      
#> 1 Body mass index ieu-a-974 171977.0         2015.0    Locke AE   
#> 2 Body mass index ieu-a-835 322154.0         2015.0    Locke AE   
#> 3 Body mass index ieu-a-2   339224.0         2015.0    Locke AE   
#> 4 Body mass index ieu-a-785 152893.0         2015.0    Locke AE

NOTE: Be mindful of the data type of each node property. Please refer to data dictionary to explore data types before writing native Cypher queries.


Now, let’s try making queries by specifying a GWAS ID.

# Extract info only for GWAS 'ieu-a-2'
query <- "
    MATCH (node: Gwas)
    WHERE node.id = \"ieu-a-2\"
    RETURN node;
"
endpoint <- "/cypher"
params <- list(query = query)
results_subset <- query_epigraphdb(
  route = endpoint,
  params = params,
  method = "POST",
  mode = "table"
)

results_subset %>%
  select(
    node.trait, node.id, node.sample_size,
    node.year, node.author
  )
#> # A tibble: 1 × 5
#>   node.trait      node.id node.sample_size node.year node.author
#>   <chr>           <chr>   <chr>            <chr>     <chr>      
#> 1 Body mass index ieu-a-2 339224.0         2015.0    Locke AE
# Return MR results only for exposure trait 'ieu-a-2' (body mass index)

# first let's check the MR Cypher query that we run in "table" mode in Part 2
trait1 <- "Body mass index"
endpoint <- "/mr"
mr_df <- query_epigraphdb(
  route = endpoint,
  params = list(
    exposure_trait = trait1,
    pval_threshold = 5e-08
  ),
  mode = "raw"
)
mr_df$metadata$query
#> [1] "MATCH (exposure:Gwas)-[mr:MR_EVE_MR]->(outcome:Gwas) WHERE exposure.trait = \"Body mass index\" AND mr.pval < 5e-08 RETURN exposure {.id, .trait}, outcome {.id, .trait}, mr {.b, .se, .pval, .method, .selection, .moescore} ORDER BY mr.pval ;"

# modify the query to only return 'ieu-a-2' GWAS results
query <- "
  MATCH (exposure:Gwas)-[mr:MR_EVE_MR]->(outcome:Gwas)
  WHERE exposure.id = \"ieu-a-2\"
  AND mr.pval < 5e-08
  RETURN exposure {.id, .trait}, outcome {.id, .trait}, mr {.b, .se, .pval, .method, .selection, .moescore}
  ORDER BY mr.pval ;
"

endpoint <- "/cypher"
params <- list(query = query)
results_subset <- query_epigraphdb(
  route = endpoint,
  params = params,
  method = "POST",
  mode = "table"
)
results_subset
#> # A tibble: 408 × 10
#>    exposure.trait  exposure.id outcome.trait  outcome.id  mr.b   mr.se mr.method
#>    <chr>           <chr>       <chr>          <chr>      <dbl>   <dbl> <chr>    
#>  1 Body mass index ieu-a-2     Comparative b… ukb-b-4650 0.439 0.00989 FE IVW   
#>  2 Body mass index ieu-a-2     Body mass ind… ukb-b-2303 0.674 0.0178  FE IVW   
#>  3 Body mass index ieu-a-2     Basal metabol… ukb-b-164… 0.449 0.0118  FE IVW   
#>  4 Body mass index ieu-a-2     Arm fat perce… ukb-a-282  0.528 0.0125  FE IVW   
#>  5 Body mass index ieu-a-2     Waist circumf… ukb-b-9405 0.645 0.0126  FE IVW   
#>  6 Body mass index ieu-a-2     Arm predicted… ukb-b-9093 0.398 0.00841 FE IVW   
#>  7 Body mass index ieu-a-2     Body mass ind… ieu-a-94   1.01  0.0269  FE IVW   
#>  8 Body mass index ieu-a-2     Hip circumfer… ieu-a-48   0.828 0.0149  FE IVW   
#>  9 Body mass index ieu-a-2     Waist circumf… ieu-a-65   0.732 0.0200  FE IVW   
#> 10 Body mass index ieu-a-2     Hip circumfer… ukb-b-155… 0.652 0.0180  FE IVW   
#> # … with 398 more rows, and 3 more variables: mr.selection <chr>,
#> #   mr.pval <dbl>, mr.moescore <dbl>

# check exposures in the results
results_subset %>% count(exposure.id)
#> # A tibble: 1 × 2
#>   exposure.id     n
#>   <chr>       <int>
#> 1 ieu-a-2       408


Great! You can now use the basic functionality of the R package and make simple Cypher queries to the API. Next, we recommend to work through the case studies and check out the Web UI examples and the EpiGraphDB Gallery.