BioRxiv pic: Mixing features of transcription factors and genes enables accurate prediction of gene regulation relationships for unknown transcription factors

Published on 5 May 2025 at 11:19

Mixing features of transcription factors and genes enables accurate prediction of gene regulation relationships for unknown transcription factors

Risa Okubo, Takashi Morikura, Yusuke Hiki, Yuta Tokuoka, Tetsuya J. Kobayashi, Takahiro G. Yamada & Akira Funahashi

This is a very cool and timely approach that tackles a long-standing challenge in regulatory genomics: predicting gene regulatory relationships for unknown or understudied transcription factors (TFs) — a task where most existing models fall short.

What makes GReNIMJA so innovative is its ability to generalize beyond known TFs. Unlike previous models that depend on known binding motifs or supervised training with specific TFs, GReNIMJA can predict TF–gene regulatory interactions even if the TF has never been experimentally profiled. This is achieved using a two-dimensional LSTM (2D LSTM) architecture that jointly processes amino acid sequences of TFs and nucleotide sequences of genes — effectively learning a language of regulatory compatibility.

💡 Key strength

By learning similarities in sequence and regulatory context, GReNIMJA places unknown TFs in proximity to functionally similar known TFs — allowing surprisingly accurate predictions (84.4% for known TFs, and 68.5% for previously unseen ones).

When paired with tools like DeepTFactor, which can identify novel TFs purely from protein sequence, GReNIMJA’s capabilities could be expanded to infer regulatory networks for completely novel transcription factors — potentially even in unexplored species or synthetic biology systems.

🧠 How It Works

Here’s the visual diagram of GReNIMJA — a model that predicts gene regulation by mixing transcription factor and gene features using a 2D LSTM:

Figure: Schematic overview of the GReNIMJA architecture. TF amino acid sequences and gene nucleotide sequences are embedded using word2vec, passed through convolution and pooling layers, then mixed using a 2D LSTM module to predict regulatory relationships.

⚠️ Limitations

Despite its promise, GReNIMJA has some important limitations:

No biological context: It doesn’t consider 3D genome structure, chromatin accessibility (e.g., ATAC-seq), or TF expression, limiting its ability to infer cell-type–specific GRNs.
Promoter-only view: It focuses solely on 1kb upstream promoter regions, ignoring distal enhancers and other regulatory elements.
No TF cooperativity: Co-factors and protein–protein interactions, which modulate binding and regulation, are not modeled.
No distinction of closely related TFs: The model clusters TFs based on sequence similarity and can struggle to distinguish homologous TFs that bind similar motifs but regulate different target genes in vivo.

🔭 Final Thoughts

Despite these constraints, GReNIMJA is an exciting advance: it’s robust, generalizable, and highly scalable. Its ability to make useful predictions for TFs with little or no data is particularly valuable. With future improvements — such as integration of ATAC-seq, TF expression data, or enhancer annotations — GReNIMJA could evolve into a powerful platform for constructing context-aware gene regulatory networks, bringing us closer to decoding the complex logic of gene control across diverse biological systems.

BioRxiv pic: MAGELLAN: Automated Generation of Interpretable Computational Models for Biological Reasoning Next »

Add comment

Comments

There are no comments yet.