Cancer is a genetic disease that involves perturbation of gene regulatory networks (GRNs) caused by various mechanisms, such as copy number alteration, abnormal methylation status, abnormal protein configuration, and post-transcriptional dysregulation. Although driver gene mutation information is crucial for the estimation of the genetic etiology of cancer, it is becoming increasingly evident that many genes are involved in cancer pathophysiology, which appears to disrupt GRNs. In this context, the identification of information regarding gene regulation in cancer tissues is expected to provide invaluable information for the development of anticancer agents or cancer management strategies.
1. Introduction
Cancer is a genetic disease that involves perturbation of gene regulatory networks (GRNs) caused by various mechanisms, such as copy number alteration, abnormal methylation status, abnormal protein configuration, and post-transcriptional dysregulation [1][2][3][4]. Although driver gene mutation information is crucial for the estimation of the genetic etiology of cancer, it is becoming increasingly evident that many genes are involved in cancer pathophysiology, which appears to disrupt GRNs [5]. In this context, the identification of information regarding gene regulation in cancer tissues is expected to provide invaluable information for the development of anticancer agents or cancer management strategies.
Previously, efforts aimed toward the identification of GRNs focused only on a small number of genes [6]. Currently, with the advent of high-throughput technologies, such as microarray and next-generation sequencing (NGS), tens of thousands of gene expressions are examined simultaneously [7][8]. A microarray is constructed on a slide on which probes for the hybridization of complementary DNAs for whole messenger RNAs ( mRNAs) from one type of cell are implanted. The mRNAs obtained from sample tissues are tagged with fluorescent dyes so that a scanner can identify variations in the photosignals of the dyes as the abundance of the mRNAs. After scanning of the signals, gene expressions are summarized into a matrix of expression profiles in which rows and columns indicate genes and samples, respectively. Instead of utilizing hybridization, NGS platforms determine the amount of mRNA through direct sequencing of mRNAs.
After the raw expression signals are generated, appropriate adjustment steps are required to correct any experimental bias, which is known as the process of normalization. Most omics technologies need such normalization steps to avoid artificial results that are not relevant to the underlying biology. The normalization of omics data is beyond the scope of this review, and excellent reviews regarding this issue are available [9][10]. Once the normalization process has removed bias from an experiment or sample preparations, an estimation process for the identification of GRNs can be applied. Here, the concept and implications of the GRN need to be clarified as follows. In many studies, GRN refers to the regulatory relationship between genes that is estimated from experimental data. However, regulation is somewhat ambiguous, and one gene can impact another gene (or genes) in various ways. For example, the protein encoded by one gene can act as a transcription factor (TF) for another gene [11], and a microRNA gene can affect the expression of other genes through the microRNA encoded by it [12]. A more sophisticated mechanism underlies the epigenomic control of gene expression via changes in methylation and various histone modifications [13]. However, most GRNs are inferred from transcriptomics data. Therefore, it is always possible that the GRNs inferred from the transcriptomics data indicate the statistical relationships between genes rather than biological relationships that can be explained. This possibility should be considered in the interpretation of GRNs estimated from gene expression data. Even if there is a positive relationship between genes in a GRN estimated from transcriptomics data, it is possible that the biological mechanism underlying the relationship cannot be explained with the current knowledge. For validation or confirmation of the regulatory relationships, other omics data containing epigenomic or protein expression information or independent experimental procedures, such as gene perturbation, are required.
2. Evaluation of Inference Methods for Estimation of GRNs and Their Results
2.1. Comparative Analysis Using the Simulation Method
GRNs include over tens of thousands of genes, and a comparative analysis of GRN inference methods needs systematic approaches for unbiased comparison of these methods. Probably the best practical solution is the application of simulation data where a true positive regulatory relationship is already determined. In fact, most studies that develop novel methods use simulation data and compare the preexisting methods and newly developed ones. Whereas ad hoc simulation models have been applied independently across studies, there are computational tools that generate simulation data using statistical or mathematical models for gene expressions. These tools are classified into two categories. First, differential equation models are used, and the data are generated based on predefined parameters
[14][15][16][17][18]. GeneNetWeaver and Stochastic Gene Networks Simulator are tools that use differential equation models
[14][16], and NetBenchmark provides several different simulation tools for benchmark analysis
[15]. Second, specific models are used in the generation of simulation data. For example, in the GeneNet package, a GGM-based function is implemented, and users can generate data using their own parameters, including the number of genes, samples, and regulatory relationships between specific genes
[19]. In many studies, simulation data were generated using a GGM or BN model, and several parameters were combined to generate simulation data with different characteristics. SynTReN simulates gene expression data from different types of network topology models
[20].
2.2. Application of Prior Biological Knowledge Obtained from External Databases
For faster and more efficient validation of results from GRN analysis, evidence from databases can be applied; one of the most frequently used databases is the GO database. GO is a nomenclature system that provides information about the functions of genes
[21]. A GO term is mapped to multiple genes that are associated with the same biological processes, molecular functions, or cellular locations. GO is frequently used for the validation of co-expression modules that are defined using expression similarities between genes. If the genes in the module have similar functions, then certain GO terms linked to the genes are overrepresented in comparison to what would be attributable to random chance alone. The test of significant overrepresentation of GO terms is called an enrichment test, which uses chi-square or hypergeometric distributions for testing the significance of the enrichment of GO terms in the module. The enrichment test can be applied to the validation of a GRN, especially when a gene is linked to multiple genes or to a group of genes interconnected by the regulatory relationship in the estimated GRN. As it is assumed that genes involved in similar biological processes tend to be linked to each other in a GRN
[22], the application of an enrichment test can be an efficient strategy for the computational evaluation of a GRN.
Information about intergene regulations from pathway databases, such as KEGG
[23], WikiPathways
[24], Small Molecule Pathway Database
[25], PathBank
[26], Pathway Commons
[27], Reactome
[28], GeneMANIA
[29], and MSigDB
[30], provides another platform for the evaluation of estimated GRNs. These databases collect information through manual curation of publication data and/or data analysis of open functional genomics data, and they contain regulatory relationships between genes. This information is applied to the validation of GRNs as gold-standard relationships. In cancer, however, it is possible that aberrant genetic circuit information is not included in the pathway databases. Fortunately, information related to cancer-specific pathways is available in KEGG, Reactome, WikiPathways, and MSigDB. This should be considered when the estimated GRN is evaluated using knowledge obtained from external biological databases.
Other databases used in the validation of GRNs provide biological knowledge about TFs and their binding site and epigenomic control of gene expression. TRANSFAC is one of the most popular databases for information regarding TFs and their binding sites
[31]. As mentioned earlier, the database is frequently applied for the supervised estimation of GRNs. The JASPAR database has a similar functionality to TRANSFAC, and its data content is limited to six taxonomic groups
[32]. Other databases such as hTFtarget
[33], TRRUST
[34], and TF2DNA
[35] provide similar biological knowledge for humans and/or mice. In addition to the TF databases, integrated databases combining heterogeneous information about gene regulation are available. For example, the European Bioinformatics Institute maintains a database of information regarding gene regulation as part of the Ensembl database system
[36]. The RegNetwork includes TF–microRNA relationships
[37], and the Gene Transcription Regulation Database integrates processed epigenomic data of DNase I hypersensitive site sequencing and chromatin immunoprecipitation sequencing
[38]. The dbCoRC database provides information about super-enhancers that are predicted using H3K27ac ChIP-seq data
[39].
2.3. Experimental Technologies for the Validation of GRNs
As more than tens of thousands of genes exist in the human genome, experimental validation of the regulatory relationships between all genes is hard to accomplish. Even if fewer genes are selected, it is still difficult to simultaneously validate the regulatory relationships between several genes. For the experimental validation of regulatory relationships, genes that show the most significant results or high connectivity (e.g., measured using CC, PCC, or any other metrics) are selected and analyzed in experiments. Moreover, for this purpose, gene perturbation experiments are used
[40]. The basic assumption of such validation is that the genes are estimated to have regulatory relationships if their expressions are changed after the perturbation of a certain gene.
Overexpression of one gene and observation of the other genes that are estimated to have regulatory relationships are widely used methods for the validation of regulatory relationships. DNA transfection and subsequent polymerase chain reaction experiments for the identification of gene expression changes allow researchers to decide whether a gene is indeed related to the changes in expressions of the other genes that are known to be regulated by that gene. Using siRNA is another method for perturbation of gene expression
[41]. The length of siRNA ranges from 21 to 23 nucleotides, and it is easy to use compared with the previously mentioned transfection method. If an siRNA acts on one gene, then the expression of other genes that have regulatory relationships with that gene tends to change. Recently, clustered regularly interspaced short palindromic repeats (CRISPR) technology has been applied to gene perturbation experiments. CRISPR technology knocks down gene expression through complementary base-pairing of a single guide RNA. Similar to siRNA experiments, regulatory relationships between genes can be identified by gene perturbation using CRISPR technology
[42].
3. Conclusions
The widely used pairwise measures are CC and MI, and their interpretation is performed with GO in many cases. Since significant co-expression does not guarantee a regulatory relationship, it should be validated with other information. As multivariate measures, the GGM, BN, and MRF are frequently applied. The concepts of the models are different, but they consider multiple variables when determining regulatory relationships between two variables, and inference of regulatory relationships is possible. The supervised approach is performed with the application of linear regression or classification models, and biological knowledge is integrated through weights in the regression model or into a classification model as information for deciding whether genes have a regulatory relationship or not.
Considering the methodologies used in the publications, there are issues that should be considered in the GRN analysis of cancer transcriptomics data. First, although several GRN inference methods have been developed, few methods that can be used for identifying cancer-specific GRNs are available. Most of the methods are applicable to transcriptomics data regardless of whether the data are generated from cancerous or non-cancerous samples. As the rewiring of GRNs is a characteristic feature of cancer, the deviation from normal GRNs should be estimated in studies on GRNs using cancer transcriptomics data. Second, because canonical knowledge of gene regulatory interactions may not be valid in cancer, it is necessary to provide cancer-specific genetic regulatory evidence, which needs to be collected in databases, and a testing procedure deciding whether the estimated GRN is valid given information from the databases. Third, even cancer of the same type or stage can have a heterogeneous nature, which indicates the necessity of developing methods for the inference of heterogeneous GRNs from cancer transcriptomics data. Finally, more research about integrative methods for multiomics data is required for accurate estimation of GRNs in cancer.