Deep Learning Architectures from A Genomic Perspective

Deep Learning Architectures from A Genomic Perspective: History

View Latest Version

Please note this is an old version of this entry, which may differ significantly from the current revision.

Subjects: Biochemistry & Molecular Biology

Contributor:

Tianwei Yue

Yuanxin Wang

Longxiang Zhang

Chunming Gu

Haoru Xue

Wenping Wang

Qi Lyu

Yujie Dun

The data explosion driven by advancements in genomic research, such as high-throughput sequencing techniques, is constantly challenging conventional methods used in genomics. In parallel with the urgent demand for robust algorithms, deep learning has succeeded in various fields such as vision, speech, and text processing.

deep learning
genomics
large language model
computer vision
multi-modal machine learning

1. Introduction

Various deep learning algorithms have their own advantages to resolve particular types of problems in genomic applications (see a comprehensive list in Table 1). For example, CNNs that are famous for capturing features in image classification tasks have been widely adopted to automatically learn local and global characterizations of genomic data. RNNs that succeed in speech recognition problems are skillful at handling sequence data and thus were mostly used to deal with DNA sequences. Autoencoders are popular for both pre-training models and denoising or pre-processing the input data. LLMs are known for their emergent capabilities in dealing with extremely long-range interactions in sequences. When designing deep learning models, researchers could take advantage of these merits to efficiently extract reliable features and reasonably model the biological process. For example, with sufficient labeled data, traditional CNNs and RNNs might be used as solid baselines; when robust representations are needed for various downstream tasks, VAEs could be a good point to start; if the capability of coping with long input sequences is required, LLMs should come into play.

Table 1. Overview of deep neural network architectures and their application in genomics. “Multiple” means multiple architectures have been studied in the references.

Architecture	Reference	Year	Application in Genomics
CNN	Alipanahi et al. [31]	2015	Protein—binding
	Zhou and Troyanskaya [32]	2015	DNA sequence—noncoding variants
	Min et al. [33]	2016	DNA sequence—enhancers
	Zeng et al. [34]	2016	Protein—binding
	Lanchantin et al. [35]	2016	Protein— TFBS classification
	Kelley et al. [36]	2016	DNA sequence—functional activities
	Chen et al. [37]	2017	Protein—TFBS classification
	Hou et al. [38]	2017	Protein—fold classification
	Pan and Shen [39]	2017	Protein—RNA binding
	Schreiber et al. [40]	2017	Protein—contact map prediction
	Zhang et al. [41]	2017	Protein—contact map prediction
	Adhikari et al. [42]	2018	Protein—contact map prediction
	Kelley et al. [43]	2018	DNA sequence—phenotype to genotype prediction
	Xuan et al. [44]	2019	RNA sequence—noncoding genes
	Kelley [45]	2020	DNA sequence—gene regulation
	Yang et al. [46]	2020	Protein—inter-residue distance prediction
	Wu et al. [47]	2021	Protein—inter-residue distance prediction
RNN	Sønderby et al. [48]	2015	Protein—subcellular localization
	Quang and Xie [49]	2016	DNA sequence—noncoding function
	Cao et al. [50]	2017	Protein—function prediction
	Liu et al. [51], ProDec-BLSTM	2017	Protein—remote homology detection
	Boža et al. [52]	2017	DNA/RNA sequence—nanopore base calling
	Singh et al. [53]	2020	Protein—RNA binding
VAE	Way and Greene [54]	2017	Cancer—cancer gene expression
	Choi and Chae [55]	2020	DNA—methylome dataset construction
	Rashid et al. [56]	2021	Cancer—unmasking tumor heterogeneity
	Nissen et al. [57]	2021	Cancer—metagenomic binning
Hybrid	Sønderby et al. [48]	2015	Protein—subcellular localization
	Quang and Xie [49]	2016	DNA sequence—noncoding variants prediction
	Lanchantin et al. [58]	2016	Protein—TFBS classification
	Singh et al. [59]	2016	DNA sequence—enhancer promoter interaction
	Almagro Armenteros et al. [60]	2017	Protein—subcellular localization
	Yang et al. [61]	2017	DNA sequence—enhancers
	Li et al. [62]	2021	DNA sequence—regulatory function
Transformer	Ji et al. [63]	2017	DNA sequence—core promoter detection
	Rives et al. [64]	2019	Protein—ProtLM; secondary structure
	Elnaggar et al. [65]	2020	Protein—ProtLM; secondary structure; tertiary structure
	Avsec et al. [66]	2021	DNA sequence—gene expression prediction
	Wu et al. [67]	2022	Protein—ProtLM; secondary structure; tertiary structure
	Zhou et al. [68]	2023	DNA sequence—core promoter detection; Protein—TFBS classification
	Weissenow et al. [69]	2023	Protein—ProtLM; secondary structure; tertiary structure
	Nguyen et al. [70]	2023	Genomic language model
	Lin et al. [71]	2023	Protein—ProtLM; secondary structure; tertiary structure
	Chen et al. [72]	2023	Protein—ProtLM
Multiple	Busia et al. [73]	2016	Protein—secondary structure
	Hou et al. [74]	2019	Protein—contact map; tertiary structure
	Senior et al. [75], AlphaFold	2020	Protein—secondary structure; tertiary structure
	Zhang and Shen [76]	2020	Protein—contact map; tertiary structure
	Jumper et al. [77], AlphaFold2	2021	Protein—secondary structure; tertiary structure
	Liu et al. [78]	2022	Protein—contact map; tertiary structure

2. Convolutional Neural Networks

Convolutional neural networks (CNNs) are one of the most successful deep learning models for image processing owing to their outstanding capacity to analyze spatial information. Early applications of CNNs in genomics relied on the fundamental building blocks of CNNs in computer vision [79] to extract features. Zeng et al. [34] described the adaptation of CNNs from the field of computer vision to genomics as accomplished by comprehending a window of genome sequence as an image.

The highlight of CNNs is the dexterity of automatically performing adaptive feature extraction during the training process. For instance, CNNs can be applied to discover meaningful recurring patterns with small variances, such as genomic sequence motifs. This makes CNNs suitable for motif identification and therefore binding classification [35].

Recently, CNNs have been shown to take a lead among current algorithms for solving several sequence-based problems. Alipanahi et al. [31], DeepBind, and [34] successfully applied CNNs to model the sequence specificity of protein binding. Zhou and Troyanskaya [32] (DeepSEA) developed a conventional three-layer CNN model to predict from only genomic sequence the effects of noncoding variants. Kelley et al. [36] and Basset adopted a similar architecture to study the functional activities of DNA sequences.

Although multiple researchers have demonstrated the superiority of CNNs over other existing methods, inappropriate structure design would still result in even poorer performance than conventional models. For example, Zeng et al. [34] conducted a comprehensive analysis of CNN networks of various architectures on the task of motif discovery and motif occupancy in genomic sequences, and they showed that although an increasing number of convolutional kernels generally increases model performance, the performance may be indifferent or even negatively impacted by an increasing number of convolutional layers and inappropriate pooling methods. Therefore, what remains is for researchers to master and optimize the ability of CNNs to skillfully match a CNN architecture to each particular given task. To achieve this, researchers should have an in-depth understanding of CNN architectures as well as take into consideration the biological background. Zeng et al. [34] developed a parameterized convolutional neural network to conduct a systematic exploration of CNNs on two classification tasks, motif discovery, and motif occupancy. They performed a hyper-parameter search using Mri (https://github.com/Mri-monitoring/Mri-docs/blob/master/mriapp.rst (accessed on 14 September 2023)) and mainly examined the performance of nine variants of CNNs, and they concluded that CNNs do not need to be deep for motif discovery tasks as long as the structure is appropriately designed. When applying CNNs in genomics, simply changing the network depth would not account for much improvement in model performance. This is because deep learning models are usually over-parameterized, meaning there are more parameters in the neural network than what is actually required to complete the task [80]. In this direction, Xuan et al. [44] designed a dual CNN with attention mechanisms to extract deeper and more complex feature representations of lncRNA (long noncoding RNA genes); while Kelley et al. [43,45] took a different path in using dilated convolution instead of classical convolution to share information across long distances without adding depth indefinitely.

3. Recurrent Neural Networks

Recurrent neural networks (RNNs) raised a surge of interest owning to their impressive performance on sequential prediction problems such as language translation, summarization, and speech recognition. RNNs outperform CNNs and other early deep neural networks (DNNs) on sequential data thanks to their capability of processing long ordered sequences and memorizing long-range information through recurrent loops. Specifically, RNNs scan the input sequences sequentially and feed both the previously hidden layer and current input segment as the model input so that the final output implicitly integrates both current and previous information in the sequence. Schuster and Paliwal [81] later proposed bidirectional RNN (BRNN) for use cases where both past and future contexts in the input matter.

The cyclic structure makes a seemingly shallow RNN over long-time prediction actually very deep if unrolled in time. To resolve the vanishing gradient problem rendered by this, Hochreiter and Schmidhuber [20] substituted the hidden units in RNNs with LSTM units to truncate the gradient propagation. Cho et al. [82] introduced Gated Recurrent Units (GRUs) with a similar proposal.

Genomics data are typically sequential and often considered languages of biological nature. Recurrent models are thus applicable in many scenarios. For example, Cao et al. [50] (ProLanGO) built an LSTM-based Neural Machine Translation, which converts the task of protein function prediction to a language translation problem by interpreting protein sequences as the language of Gene Ontology terms. Boža et al. [52] developed DeepNano for base calling, Quang and Xie [49] proposed DanQ to quantify the function of noncoding DNA, Sønderby et al. [48] devised a convolutional LSTM to predict protein subcellular localization from protein sequences, Busia et al. [73] applied the idea of seq-to-seq learning to their model for protein secondary structure prediction conditioned on previously predicted labels, and Wang et al. [83] used bidirectional LSTM (Bi-LSTM) in their prPred-DRLF predictor for plant resistance protein detection, demonstrating effective crossovers between natural language processing (NLP) and genomics [84]. Furthermore, sequence-to-sequence learning for genomics is boosted by attention mechanisms: Singh et al. [53] introduced an attention-based approach where a hierarchy of multiple LSTM modules are used to encode input signals and model how various chromatin marks cooperate; similarly, Shen et al. [85] used LSTM as a feature extractor and attention modules as importance scoring functions to identify regions of the RNA sequence that bind to proteins.

4. Autoencoders

Autoencoders, conventionally used as pre-processing tools to initialize the network weights, have been extended to stacked autoencoders (SAEs; [86]), denoising autoencoders (DAs; [87]), contractive autoencoders (CAEs; [88]), etc. Now they have proved successful in feature extraction because of being able to learn a compact representation of input through the encode–decode procedure. For example, Gupta et al. [89] applied stacked denoising autoencoders (SDAs) for gene clustering tasks. They extracted features from data by forcing the learned representation resistant to a partial corruption of the raw input. More examples can be found in Section 4.1.1. Autoencoders are also also used for dimension reduction in gene expression, e.g., [90,91,92]. When applying autoencoders, one should be aware that better reconstruction accuracy does not necessarily lead to model improvement [93].

Variational autoencoders (VAEs), though named “autoencoders”, were rather developed as an approximate-inference method to model latent variables. Based on the structure of autoencoders, Kingma and Welling [94] added stochasticity to the encoded units and added a penalty term encouraging the latent variables to produce a valid decoding. VAEs aim to deal with the problems in which each datum has a corresponding latent representation and are thus useful for genomic data, among which there are complex interdependencies. Rampasek and Goldenberg [93] presented a two-step VAE-based model for drug response prediction, which first predicts the post- from the pre-treatment state in an unsupervised manner and then extends it to the final semi-supervised prediction. This model was based on data from Genomics of Drug Sensitivity in Cancer (GDSC; [95]) and Cancer Cell Line Encyclopedia (CCLE; [96]). VAEs can also be used in many other genomic applications including cancer gene expression prediction [54,97], single cell feature extraction for unmasking tumor heterogeneity [56], metagenomic binning [57], DNA methylome dataset construction [55], etc.

5. Emergent Deep Architectures

As deep learning is constantly showing success in genomics, researchers are expecting deep learning to show higher accuracy than simply outperforming statistical or machine learning methods. To this end, the vast majority of work nowadays approaches genomic problems from more advanced architectures beyond classic deep architectures or employing hybrid models. Here, we review some examples of recent appearing deep architectures which skillfully modify or combine classical deep learning models.

5.1. Beyond Classic Models

Most of these emergent advanced architectures are of natural designs modified from classic deep learning models. Researchers began to leverage more genomic intuitions to fit each particular problem with a more advanced and suitable model.

Motivated by the fact that protein folding is a progressive refinement [98] rather than an instantaneous process, Lena et al. [99] designed DST-NNs for residue–residue contact prediction. It consists of a 3D stack of neural networks in which topological structures (same input, hidden, and output layer sizes) are identical in each stack. Each level of this stacked network can be regarded as a distinct contact predictor and can be trained in a supervised matter to refine the predictions of the previous level, hence addressing the typical problem of vanishing gradients in deep architectures. The spatial features in this deep spatiotemporal architecture refer to the original model inputs, while temporal features are gradually altered so as to progress to the upper layers. Angermueller et al. [100] (DeepCpG) took advantage of two CNN sub-models and a fusion module to predict DNA methylation states. The two CNN sub-models take different inputs and thus focus on disparate purposes. The CpG module accounts for correlations between CpG sites within and across cells, while the DNA module detects informative sequence patterns (motifs). Then, the fusion module can integrate higher-level features derived from two low-level modules to make predictions. Instead of subtle modifications or combinations, some works focused on depth, trying to improve the model performance by designing even deeper architectures. Wang et al. [101] developed an ultra-DNN consisting of two deep residual neural networks to predict protein contacts from a sequence of amino acids. Each of the two residual nets in this model has its particular function. A series of 1D convolutional transformations are designed for extracting sequential features (e.g., sequence profile, predicted secondary structure, and solvent accessibility). The 1D output is converted to a 2D matrix by an operation similar to the outer product and merged with pairwise features (e.g., pairwise contact, co-evolution information, and distance potential). Then, they are together fed into the second residual network, which consists of a series of 2D convolutional transformations. The combination of these two disparate residual nets creates a novel approach that can integrate sequential features and pairwise features in one model.

5.2. Hybrid Architectures

The fact that each type of DNN has its own strength inspires researchers to develop hybrid architectures that could well utilize the potential of multiple deep learning architectures. DanQ [49] is a hybrid convolutional and recurrent DNN for predicting the function of noncoding DNA directly from sequence alone. A DNA sequence is input as the one-hot representation of four bases to a simple convolutional neural network with the purpose of scanning motif sites. Motivated by the fact that the motifs can be determined to some extent by the spatial arrangements and frequencies of combinations of DNA sequences [49], the purported motifs learned by CNN are then fed into a Bi-LSTM. Similar convolutional-recurrent designs were further discussed by Lanchantin et al. [58] (Deep GDashboard). They demonstrated how to understand three deep architectures—convolutional, recurrent, and convolutional-recurrent networks—and verified the validity of the features generated automatically by the model through visualization techniques. They argued that a CNN–RNN architecture outperforms CNN or RNN alone based on their experimental results on a transcription factor binding site (TFBS) classification task. The feature visualization achieved by Deep GDashboard indicated that CNN–RNN architecture is able to model both motifs as well as dependencies among them. Sønderby et al. [48] added a convolutional layer between the raw data and LSTM input to address the problem of protein sorting or subcellular localization. In total, there are three types of models proposed and compared in the paper: a vanilla LSTM, an LSTM with an attention model used in a hidden layer, and an ensemble of ten vanilla LSTMs. They achieved higher accuracy than previous benchmark models in predicting the subcellular location of proteins from DNA sequences while no human-engineered features were involved. Almagro Armenteros et al. [60] proposed a hybrid integration of an RNN, a Bi-LSTM, an attention mechanism, and a fully connected layer for protein subcellular localization prediction; each of the four modules is designed for a specific purpose. These hybrid models are increasingly favored by recent research, e.g., [59].

Hybrid architectures allow flexible network design by selecting specific components with proven success representing different types of information in genomic sequences. For example, in both [49,62], CNN layers have been included to generate representations on local patterns such as regulatory motifs in DNA sequences, while RNN and attention modules are used to encode information on long-range dependence. Although hybrid architectures built on existing successful models have been proven to improve performance over single architecture, there still lacks a systematic principle or algorithm for designing or even optimizing network architecture for deep learning models in genomics.

6. Transformer-Based Large Language Models

As mentioned, many prior deep learning works utilized CNNs and RNNs to solve genomics tasks. However, there are several intrinsic limitations of these two architectures. (1) CNNs might fail to capture the global understanding of a long DNA sequence due to its limited receptive field. (2) RNNs could have difficulty in capturing useful long-term dependencies because of vanishing gradients and suffer from low-efficiency problems due to their non-parallel sequence processing nature. (3) Both architectures need extensive high-quality labeled data to train. These limitations hinder them from coping with harder genomics problems since these tasks usually require the model to (1) understand long-range interactions, (2) process very long sequences efficiently, and (3) perform well even for low-resource training labels.

Transformer-based [21] language models such as BERT [102] and GPT family [22,23,24] then become a natural fit to overcome these limitations. Their built-in attention mechanism learns better representations that can be generalized to data-scarce tasks via larger receptive fields. Ref. [103] found that a pre-trained large DNA language model is able to make accurate zero-shot predictions of noncoding variant effects. Similarly, according to [104], these language model architectures generate robust contextualized embeddings on top of nucleotide sequences and achieve accurate molecular phenotype prediction even in low-data settings.

Instead of processing input tokens one by one sequentially as RNNs do, transformers process all input tokens more efficiently at the same time in parallel. However, simply increasing the input context window infinitely is infeasible, since the computation time and memory scale quadratically with context length in the attention layers. Several improvements have been made from different perspectives: Nguyen et al. [70] uses the Hyena architecture [105] and scales sub-quadratically in context length, while Zhou et al. [68] replace k-mer tokenization used in Ji et al. [63] with Byte Pair Encoding (BPE) to achieve a 3× efficiency improvement.

In light of dealing with extremely long-range interactions in DNA sequences, the Enformer model [66] employs transformer modules that scale a five times larger receptive field compared to previous CNN-based approaches [43,45,106], and it is capable of detecting sequence elements that are 100 kb away. Moreover, the recent success of ChatGPT [107] and GPT-4 [108] further illustrated the emergent capabilities of large language models (LLMs) to deal with such long DNA sequences. A typical transformer-based genomics foundational model can only take 512 to 4k tokens as input context, which is less than 0.001% of the human genome. Nguyen et al. [70] proposed an LLM-based genomic model that expands the input context length to 1 million tokens at the single nucleotide level, which is up to a 500× increase over previous dense attention-based models.

Even with all these advancements in efficiency improvement, the significant training and serving cost still remains a challenging problem for LLMs [109], especially for long input context for genomics problems. Furthermore, due to privacy concerns and legal regulations, the generation and collection of large-scale high quality genomics data usually requires complex procedures, which might slow down the iteration of model development.

This entry is adapted from the peer-reviewed paper 10.3390/ijms242115858

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.