NODE is an R package for identifying transcription factor (TF) target genes by integrating TF binding sites, 3D chromatin interactions (Hi-C), and gene annotations. It builds a genomic graph to find and classify regulatory paths from distal elements to gene promoters.
# install.packages("remotes")
remotes::install_github("Novartis/NODE")NODE classifies TF-to-gene links by the shortest path in the genomic graph.
| Path Type | Description | Graph Structure |
|---|---|---|
👑 direct |
TF peak overlaps a gene promoter. | TF -> Transcript -> Gene |
🔗 hic |
TF peak connects to a promoter via one Hi-C loop. | TF -> Region A -> Region B -> Transcript -> Gene |
🧬 hic_hopped |
TF peak connects to a promoter via two linked Hi-C loops. | TF -> Region A -> Region B -> Region C -> Transcript -> Gene |
- Integrated Network Analysis: Builds a single
igraphobject from multiple genomic data types. - Path-Based Classification: Categorizes links as
direct,hic, orhic_hopped. - Robust Input Validation: Pre-flight checks ensure all inputs are correct before execution.
- Optional Cis-Regulatory Filtering: Focus analysis on regions overlapping with known regulatory elements.
- Reproducible by Design: Creates a self-contained project with comprehensive metadata.
- HPC-Ready: Scales to large analyses with
batchtoolsand SGE support.
Input Data Requirements
- TF Peaks: A named
GRangesobject of TF ChIP-seq peaks. - Gene List: A
charactervector of target gene IDs (e.g., ENSEMBL). - Hi-C Loops: A tab-delimited Hi-C loop file in BEDPE-like format.
- Chromosome Sizes: A tab-delimited file with chromosome names and lengths.
- Annotations: A
TxDband anOrgDbannotation object. - Configuration: Parameters for project, cluster, and analysis settings.
Hi-C File Format
NODE expects a tab-delimited file (plain or gzipped) with at least six columns representing the interacting anchors. It robustly handles files with or without a header (including commented headers).
Required Columns:
# Columns 1-6 are standardized internally
chrom1 start1 end1 chrom2 start2 end2
chr1 10000 20000 chr1 50000 60000
Additional columns from tools like Juicer HiCCUPS are allowed but ignored.
HPC and `batchtools` Configuration
For large analyses on an HPC cluster, point the batchtools directories to a shared scratch space for better performance.
batchtools_registry_dir: Stores temporary job files.batchtools_template_dir: Stores the SGE template file.
node_results <- run_NODE(
# ... other parameters ...
batchtools_registry_dir = "/path/to/cluster/scratch/registry",
batchtools_template_dir = "/path/to/cluster/scratch/templates",
# ... other parameters ...
)NODE creates a self-contained project directory for each run, ensuring results are organized and reproducible.
my_tf_analysis/
├── 📁 data_prepped/ (Intermediate data files)
├── 📝 metadata/ (Run parameters and session info)
└── 📊 results/ (Final result tables)
├── NODE_results.rds
└── NODE_results.tsv
The output table includes TF peaks, gene information, path classifications, and the genomic elements forming the link.
library(NODE)
library(GenomicRanges)
library(TxDb.Hsapiens.GENCODE.v46.hg38)
library(org.Hsapiens.eg.db)
# 1. Define inputs
my_tf_peaks <- GRanges(
seqnames = c("chr1", "chr1"),
ranges = IRanges::IRanges(start = c(100000, 250000), end = c(100500, 250500))
)
names(my_tf_peaks) <- c("peak_1", "peak_2")
my_gene_list <- c("ENSG00000123456", "ENSG00000654321")
hic_file <- "/path/to/your/loops.bedpe.gz"
chrom_file <- "/path/to/your/hg38.chrom.sizes"
# 2. Run NODE
node_results <- run_NODE(
project_dir = "./my_tf_analysis",
tfPeaks = my_tf_peaks,
geneList = my_gene_list,
hic_bedpe_file = hic_file,
chrom_sizes_file = chrom_file,
txdb = TxDb.Hsapiens.GENCODE.v46.hg38::TxDb.Hsapiens.GENCODE.v46.hg38,
orgdb = org.Hsapiens.eg.db::org.Hsapiens.eg.db,
hic_bins = 10000,
orgdb_keytype = "ENSEMBLTRANS",
orgdb_columns = c("ENSEMBL", "SYMBOL"),
ensembl_column = "ENSEMBL",
ucsc_genome_build = "hg38",
batchtools_cores = 2,
batchtools_mem_gb = 8,
batchtools_chunks = 100
)
# 3. Explore results
head(node_results)Common Issues & Solutions
tfPeaksmust be named: Ensure yourGRangesobject has unique names.names(my_tf_peaks) <- paste0("peak_", seq_along(my_tf_peaks))
- Chromosome names do not match: Use consistent chromosome naming (e.g.,
chr1,chr2) across all input files and annotation objects. - No paths found: This can happen if:
- Genomic regions in your input files do not physically overlap.
hic_binsdoes not match the resolution of your Hi-C data.- Genome builds are mismatched between files.
- Optional cis-regulatory filters are too restrictive.
This project is licensed under the GNU General Public License v3.0. See the LICENSE file for details.
