About GenomicKB

Genomic Knowledgebase (GenomicKB) is a database for researchers to explore and investigate human genome, epigenome, transcriptome, and 4D nucleome with simple and efficient queries.

The backend of GenomicKB is a knowledge graph which consolidates genomic datasets and annotations from over 30 consortia and portals, and includes 347 million genomic entities, 1.36 billion relations, and 3.9 billion entity and relation properties.

The frontend of GenomicKB is a user-friendly interface which allows users to query the knowledge graph with customized graph patterns and specific constraints on entities and relations.

Compared with traditional tabular-structured data stored in separate data portals, GenomicKB emphasizes the relations among genomic entities, intuitively connects isolated data matrices, and supports efficient queries for scientific discoveries.

GenomicKB queries do not necessarily start with a genomic region or a specific genomic entity. Instead, it supports customized pattern queries such as ``finding two genes which are both related to signal transduction, locate on the same chromosome, and form ligand-receptor pairs".

As a result, GenomicKB transforms multi-modal data analysis into intuitive queries, and enables large-scale cross-modality pattern searching and learning in a highly-integrated knowledge graph.


FAQ

Q: Are there any requirements for user queries?

A: User can draw customized sub-graph queries and add node property constraints. There can only be one connected component in the query. For example, a four-node query [A-->B, C-->D] is not allowed since two components should be submitted separately.


Q: How can I know where to find a node type (e.g., gene) in the “add node” window?

A: The hierarchical node type/relation type is shown in the tables below.


Q: How can I know more information about a node/relation type? E.g., which other nodes does a gene usually connect to.

A: In Example nodes and edges, users can check the example nodes/relations from all node types.


Q: The database I am interested in is not covered by GenomicKB, is it possible to import it?

A: Currently we do not support users importing their data. But since knowledge graphs are flexible to adapt updates of nodes, relations, and entire data sources, we are open to any new data from well-established data sources. You can suggest our team of new data sources by e-mail, and we are happy to include more!


Q: How many nodes/relations are displayed on the result page?

A: It depends on the complexity of the query. The neo4j backend tries to find sub-graphs that match the user query. At most twenty sub-graphs will be visualized for a one-node query, ten for two-node a query, and five for queries with more nodes. However, two identified sub-graphs might have some overlaps, and therefore ten result sub-graphs of a two-node query do not mean twenty nodes.


Q: How to get a complete result?

A: By clicking “export”, users can download the complete result of the user query. “Export summary” returns the count of all node/relation types, and “export all” returns the complete result graph in Microsoft Excel format.


Q: The “export all” takes a long time.

A: Exporting complete results requires searching the entire knowledge graph (with more than 300 million nodes and 1 billion relations), which is normal to take a longer time. If the waiting time is more than 10 minutes, you can try to narrow down the results by adding node constraints (e.g., chromosomes).



Coverage of GenomicKB


Table 1. The summary of entities in GenomicKB

Entity Type Entity Sub-type Data Source Number of Entities
Coding Elements Genes GENCODE 61186
Transcripts GENCODE 236816
Exons GENCODE 643060
Proteins GENCODE 106140
Coding Elements Enhancers ENCODE Candidate cis-Regulatory Elements (CCRE) 809429
ENdb 249
EnhancerAtlas 2895013
Insulators ENCODE Candidate cis-Regulatory Elements (CCRE) 56766
Promoters The Eukaryotic Promoter Database (EPD) 21071
ENCODE Candidate cis-Regulatory Elements (CCRE) 34803
Super-enhancers dbSuper 38030
non_coding_RNA RNAcentral 474310
Genomic Variants SNPs Genotype-Tissue Expression (GTEx) 4295337
GWAS Catalog 167191
dbSNP 18117451
insertion/deletion Genotype-Tissue Expression (GTEx) 337120
dbSNP 676693
Indel dbSNP 2313181
MNP dbSNP 139
Structural variants Database of Genomic Variants (DGV) 808608
NCBI dbVar 67718
3D structures Topological associating domains (TADs) 4D Nucleome (4DN) 44643
4DN compartment 4D Nucleome (4DN) 7879
Chromatin loops 4D Nucleome (4DN) 37892
Frequently interacting regions (FIRES) FIRE studies 20960
Epigenomic features Transcriptional factor binding profile ENCODE 219830128
Transcriptional factor binding motifs MotiMap 3996453
DNase-hypersensitivity sites ENCODE 21858996
Histone binding profiles ENCODE 44947427
Replication Timing 4D Nucleome (4DN) 354962
ChromHMM State ChromHMM 4143552
Ontologies Tissue and cell lines Cell ontology (CL) 2493 27783
Uber-anatomy ontology (UBERON) 15398
BRENDA tissue ontology (BTO) 6520
Experimental factors Experimental factor ontology (EFO) 11299 28472
Gene ontologies Gene ontologies (GO) 50635
Transcriptional factors Human transcriptional factors 2765


Table 2. The summary of relationships in GenomicKB frontend

Relationship type Relationship subtype From To
Positional Overlap (overlap) All entities that have a location property All entities that have a location property
Locate in (locate_in) All entities that have a location property All entities that have a location property
Upstream All entities that have a location property All entities that have a location property
Downstream All entities that have a location property All entities that have a location property
Expression Express into (express_into) Genes Transcriptional factors
Express in (express_in) Genes Tissue and cell lines
Transcribe (transcribe_into) Genes Transcripts
Translate (translate_into) Transcripts Proteins
Include Transcripts Exons
Regulatory Regulate Genes Genes
Enhancers Genes
Expression QTLs (correlate_with) Variants Genes
Variants Ontologies
Annotation Belong to Genes Ontologies
Non coding RNA Ontologies

According to our benchmarking results and suggestions from Neo4j, we summarize the following strategies to increase query efficiency.

  1. Splitting queries.
    For example, if users would like to query ``which variants are in gene p53, and which other genes does gene p53 regulate", splitting it into two queries does not change the results, but increases efficiency.
  2. Adding restrictions.
    If applicable, restrictions such as chromosome=``chr1" narrow down the search range and increase efficiency.
  3. Reducing the number of positional relations between two nodes at different resolutions.
    For example, the two-node relation ``a CTCF binding site overlaps a chromatin loop" is translated into ``a CTCF binding site locates on a 200-bp genomic region, the 200-bp genomic region is in a 5,000-bp genomic region, and a chromatin loop anchors at the 5,000-bp genomic region", which includes 4 node and 3 edges.
    Therefore, more cross-resolution positional relations increase the number of nodes in the query graph and slow down the query.

Example nodes and edges

  • Coding elements

    • Gene

      A basic unit of heredity and a sequence of nucleotides in DNA that encodes the synthesis of a gene product, either RNA or protein.

      Possible relationships:

      • enhancer/gene -[regulate]-> gene
      • gene -[transcribe_into]-> transcript
      • gene -[belong_to]-> gene_ontology
      • sequence_variant -[correlate_with]-> gene (eQTL)
      • gene -[express_in]-> tissue_or_cell_line
      • gene -[overlap/locate_in/downstream/upstream]- (any entities with coordinates)
    • Transcript

      The product of gene transcription. Due to alternative splicing, one gene might correspond to multiple transcripts. It is also the combination of introns and exons.

      Possible relationships:

      • gene -[transcribe_into]-> transcript
      • transcript -[translate_into]-> protein
      • transcript -[include]-> exon
      • transcript -[overlap/locate_in/downstream/upstream]- (any entities with coordinates)
    • Exon

      A transcript is a set of exons in GenomicKB (from Ensembl).

      Possible relationships:

      • transcript -[include]-> exon
    • Protein

      The product of RNA translation.

      Possible relationships:

      • transcript -[translate_into]-> protein
  • Non coding elements

    • Enhancer

      An enhancer is a short (50–1500 bp) region of DNA that can be bound by proteins (activators) to increase the likelihood that transcription of a particular gene will occur. Enhancers are cis-acting. They can be located up to 1 Mbp (1,000,000 bp) away from the gene, upstream or downstream from the start site.

      Possible relationships:

      • enhancer/gene -[regulate]-> gene
      • enhancer -[overlap/locate_in/downstream/upstream]- (any entities with coordinates)
    • Promoter

      A sequence of DNA to which proteins bind to initiate transcription of a single RNA transcript from the DNA downstream of the promoter. Promoters are located near the transcription start sites of genes, upstream on the DNA.

      Possible relationships:

      • promoter -[overlap/locate_in/downstream/upstream]- (any entities with coordinates)
    • Insulator

      A type of cis-regulatory element known as a long-range regulatory element. Insulators function either as an enhancer-blocker or a barrier, or both.

      Possible relationships:

      • insulator -[overlap/locate_in/downstream/upstream]- (any entities with coordinates)
    • Super Enhancer

      The term 'super-enhancer' has been used to describe groups of putative enhancers in close genomic proximity with unusually high levels of Mediator binding, as measured by chromatin immunoprecipitation and sequencing (ChIP-seq).

      Possible relationships:

      • super_enhancer -[overlap/locate_in/downstream/upstream]- (any entities with coordinates)
    • Non Coding RNA

      An RNA molecule that is not translated into a protein. Abundant and functionally important types of non-coding RNAs include transfer RNAs (tRNAs) and ribosomal RNAs (rRNAs), as well as small RNAs such as microRNAs, siRNAs, piRNAs, snoRNAs, snRNAs, exRNAs, scaRNAs and the long ncRNAs such as Xist and HOTAIR.

      Possible relationships:

      • super_enhancer -[overlap/locate_in/downstream/upstream]- (any entities with coordinates)
      • non_coding_RNA -[belong_to]-> gene_ontology
  • Variants

    • Sequence Variant

      Short genomic variants (several base pairs).

      Possible relationships:

      • sequence_variant -[correlate_with]-> gene (eQTL)
      • sequence_variant -[correlate_with]-> experimental_factor (GWAS)
      • sequence_variant -[overlap/locate_in/downstream/upstream]- (any entities with coordinates)
    • Structural Variant

      Long genomic variants (several thousand base pairs)

      Possible relationships:

      • structural_variant -[overlap/locate_in/downstream/upstream]- (any entities with coordinates)
  • Epigenomic features

    • TF Binding Site

      TF binding sites from ChIP-seq experiments.

      Possible relationships:

      • TF_binding_site -[overlap/locate_in/downstream/upstream]- (any entities with coordinates)
    • Histone Binding Site

      Histone binding sites from ChIP-seq experiments.

      Possible relationships:

      • Histone_binding_site -[overlap/locate_in/downstream/upstream]- (any entities with coordinates)
    • DNase hypersensitivity site

      Open chromosome regions from ChIP-seq experiments.

      Possible relationships:

      • DNase_hypersensitivity_site -[overlap/locate_in/downstream/upstream]- (any entities with coordinates)
    • TF Binding Motif

      DNA motifs that a specific TF binds to.

      Possible relationships:

      • TF_binding_motif -[overlap/locate_in/downstream/upstream]- (any entities with coordinates)
    • Replication Timing

      Replication timing refers to the order in which segments of DNA along the length of a chromosome are duplicated. We include early/late two-stage timing in GenomicKB.

      Possible relationships:

      • replication_timing -[overlap/locate_in/downstream/upstream]- (any entities with coordinates)
    • ChromHMM State

      ChromHMM is software for learning and characterizing chromatin states. The 15-state results are included in GenomicKB.

      Possible relationships:

      • ChromHMM_state -[overlap/locate_in/downstream/upstream]- (any entities with coordinates)
  • 3D structure

    • TAD

      Topological associating domains. A self-interacting genomic region, meaning that DNA sequences within a TAD physically interact with each other more frequently than with sequences outside the TAD.

      Possible relationships:

      • TAD -[overlap/locate_in/downstream/upstream]- (any entities with coordinates)
    • Loop

      In a DNA looping event, chromatin forms physical loops, bringing DNA regions into close contact. Thus, even regions that are far apart along the linear chromosome can be brought together in three-dimensional space.

      Since loops have two anchor locations, positional relationships “locate_in”, “upstream”, and “downstream” might be ambiguous, only “overlap” is supported in GenomicKB (i.e., an entity overlaps with loop anchors).

      Possible relationships:

      • loop -[overlap]- (any entities with coordinates)
    • FIRE Region

      Frequently interacting regions(FIREs) is identified by studying a compendium of Hi-C datasets across 14 human primary tissues and 7 cell types. (ref: https://www.sciencedirect.com/science/article/pii/S2211124716314814)

      Possible relationships:

      • FIRE_region -[overlap/locate_in/downstream/upstream]- (any entities with coordinates)
    • AB Compartment

      Researchers noticed that the whole genome could be split into two spatial compartments, labelled A and B, where regions in compartment A tend to interact preferentially with A compartment-associated regions than B compartment-associated ones. Similarly, regions in compartment B tend to associate with other B compartment-associated regions.

      Possible relationships:

      • AB_compartment -[overlap/locate_in/downstream/upstream]- (any entities with coordinates)
  • Ontology

    • Tissue or Cell Line

      Possible relationships:

      • gene -[express_in]-> tissue_or_cell_line
      • tissue_or_cell_line -[subclass_of]-> tissue_or_cell_line
    • Gene Ontology

      Gene function annotation.

      Possible relationships:

      • gene_ontology -[subclass_of]-> gene_ontology
      • gene/non_coding_RNA -[subclass_of]-> gene_ontology
    • Experimental Factor

      Experimental factors including tissue or cell line names and diseases in GenomicKB.

      Possible relationships:

      • experimental_factor -[subclass_of]-> experimental_factor
      • sequence_variant -[correlate_with]-> experimental_factor (GWAS)