Skip to content

Novartis/NODE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NODE logo

NODE: Networked Orchestration of Distal Elements

Lifecycle: experimental R version License: GPL-3

NODE is an R package for identifying transcription factor (TF) target genes by integrating TF binding sites, 3D chromatin interactions (Hi-C), and gene annotations. It builds a genomic graph to find and classify regulatory paths from distal elements to gene promoters.


🚀 Getting Started

Installation

# install.packages("remotes")
remotes::install_github("Novartis/NODE")

📖 Core Concepts

NODE classifies TF-to-gene links by the shortest path in the genomic graph.

Path Type Description Graph Structure
👑 direct TF peak overlaps a gene promoter. TF -> Transcript -> Gene
🔗 hic TF peak connects to a promoter via one Hi-C loop. TF -> Region A -> Region B -> Transcript -> Gene
🧬 hic_hopped TF peak connects to a promoter via two linked Hi-C loops. TF -> Region A -> Region B -> Region C -> Transcript -> Gene

✨ Features

  • Integrated Network Analysis: Builds a single igraph object from multiple genomic data types.
  • Path-Based Classification: Categorizes links as direct, hic, or hic_hopped.
  • Robust Input Validation: Pre-flight checks ensure all inputs are correct before execution.
  • Optional Cis-Regulatory Filtering: Focus analysis on regions overlapping with known regulatory elements.
  • Reproducible by Design: Creates a self-contained project with comprehensive metadata.
  • HPC-Ready: Scales to large analyses with batchtools and SGE support.

🛠️ Configuration & Inputs

Input Data Requirements
  • TF Peaks: A named GRanges object of TF ChIP-seq peaks.
  • Gene List: A character vector of target gene IDs (e.g., ENSEMBL).
  • Hi-C Loops: A tab-delimited Hi-C loop file in BEDPE-like format.
  • Chromosome Sizes: A tab-delimited file with chromosome names and lengths.
  • Annotations: A TxDb and an OrgDb annotation object.
  • Configuration: Parameters for project, cluster, and analysis settings.
Hi-C File Format

NODE expects a tab-delimited file (plain or gzipped) with at least six columns representing the interacting anchors. It robustly handles files with or without a header (including commented headers).

Required Columns:

# Columns 1-6 are standardized internally
chrom1  start1  end1    chrom2  start2  end2
chr1    10000   20000   chr1    50000   60000

Additional columns from tools like Juicer HiCCUPS are allowed but ignored.

HPC and `batchtools` Configuration

For large analyses on an HPC cluster, point the batchtools directories to a shared scratch space for better performance.

  • batchtools_registry_dir: Stores temporary job files.
  • batchtools_template_dir: Stores the SGE template file.
node_results <- run_NODE(
    # ... other parameters ...
    batchtools_registry_dir = "/path/to/cluster/scratch/registry",
    batchtools_template_dir = "/path/to/cluster/scratch/templates",
    # ... other parameters ...
)

📂 Project Structure & Outputs

NODE creates a self-contained project directory for each run, ensuring results are organized and reproducible.

my_tf_analysis/
├── 📁 data_prepped/  (Intermediate data files)
├── 📝 metadata/      (Run parameters and session info)
└── 📊 results/       (Final result tables)
    ├── NODE_results.rds
    └── NODE_results.tsv

The output table includes TF peaks, gene information, path classifications, and the genomic elements forming the link.


Minimal Example

library(NODE)
library(GenomicRanges)
library(TxDb.Hsapiens.GENCODE.v46.hg38)
library(org.Hsapiens.eg.db)

# 1. Define inputs
my_tf_peaks <- GRanges(
    seqnames = c("chr1", "chr1"),
    ranges = IRanges::IRanges(start = c(100000, 250000), end = c(100500, 250500))
)
names(my_tf_peaks) <- c("peak_1", "peak_2")

my_gene_list <- c("ENSG00000123456", "ENSG00000654321")
hic_file <- "/path/to/your/loops.bedpe.gz"
chrom_file <- "/path/to/your/hg38.chrom.sizes"

# 2. Run NODE
node_results <- run_NODE(
    project_dir = "./my_tf_analysis",
    tfPeaks = my_tf_peaks,
    geneList = my_gene_list,
    hic_bedpe_file = hic_file,
    chrom_sizes_file = chrom_file,
    txdb = TxDb.Hsapiens.GENCODE.v46.hg38::TxDb.Hsapiens.GENCODE.v46.hg38,
    orgdb = org.Hsapiens.eg.db::org.Hsapiens.eg.db,
    hic_bins = 10000,
    orgdb_keytype = "ENSEMBLTRANS",
    orgdb_columns = c("ENSEMBL", "SYMBOL"),
    ensembl_column = "ENSEMBL",
    ucsc_genome_build = "hg38",
    batchtools_cores = 2,
    batchtools_mem_gb = 8,
    batchtools_chunks = 100
)

# 3. Explore results
head(node_results)

🆘 Troubleshooting

Common Issues & Solutions
  • tfPeaks must be named: Ensure your GRanges object has unique names.
    names(my_tf_peaks) <- paste0("peak_", seq_along(my_tf_peaks))
  • Chromosome names do not match: Use consistent chromosome naming (e.g., chr1, chr2) across all input files and annotation objects.
  • No paths found: This can happen if:
    • Genomic regions in your input files do not physically overlap.
    • hic_bins does not match the resolution of your Hi-C data.
    • Genome builds are mismatched between files.
    • Optional cis-regulatory filters are too restrictive.

📜 License

This project is licensed under the GNU General Public License v3.0. See the LICENSE file for details.

About

NODE is an R package for identifying transcription factor (TF) target genes by integrating TF binding sites, 3D chromatin interactions (Hi-C), and gene annotations. It builds a genomic graph to find regulatory paths from distal elements to gene promoters.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages