Your browser does not fully support modern features. Please upgrade for a smoother experience.

Version	Summary	Created by	Modification	Content Size	Created at	Operation
1		Ioannis Michalopoulos	--	5250	2022-08-18 11:38:24	\|
2	Reference format revised.	Lindsay Dong	+ 566 word(s)	5816	2022-08-22 03:38:57	\| \|
3	add a new keyword	Lindsay Dong	+ 1 word(s)	5817	2022-08-22 03:40:20	\|

Video Upload Options

We provide professional Academic Video Service to translate complex research into visually appealing presentations. Would you like to try it?

No, upload directly Yes

Cite

If you have any further questions, please contact Encyclopedia Editorial Office.

Select a Style

Zogopoulos, V.L.; Saxami, G.; Malatras, A.; Papadopoulos, K.; Tsotra, I.; Iconomidou, V.A.; Michalopoulos, I. Approaches in Gene Coexpression Analysis in Eukaryotes. Encyclopedia. Available online: https://encyclopedia.pub/entry/26276 (accessed on 07 February 2026).

Zogopoulos VL, Saxami G, Malatras A, Papadopoulos K, Tsotra I, Iconomidou VA, et al. Approaches in Gene Coexpression Analysis in Eukaryotes. Encyclopedia. Available at: https://encyclopedia.pub/entry/26276. Accessed February 07, 2026.

Zogopoulos, Vasileios L., Georgia Saxami, Apostolos Malatras, Konstantinos Papadopoulos, Ioanna Tsotra, Vassiliki A. Iconomidou, Ioannis Michalopoulos. "Approaches in Gene Coexpression Analysis in Eukaryotes" Encyclopedia, https://encyclopedia.pub/entry/26276 (accessed February 07, 2026).

Zogopoulos, V.L., Saxami, G., Malatras, A., Papadopoulos, K., Tsotra, I., Iconomidou, V.A., & Michalopoulos, I. (2022, August 18). Approaches in Gene Coexpression Analysis in Eukaryotes. In Encyclopedia. https://encyclopedia.pub/entry/26276

Zogopoulos, Vasileios L., et al. "Approaches in Gene Coexpression Analysis in Eukaryotes." Encyclopedia. Web. 18 August, 2022.

Approaches in Gene Coexpression Analysis in Eukaryotes

Edit

This entry is adapted from the peer-reviewed paper 10.3390/biology11071019

Gene coexpression analysis refers to the en masse discovery of coexpressed genes from a large variety of transcriptomic experiments. The type of biological networks that studies gene coexpression, known as Gene Coexpression Networks, consist of an undirected graph depicting genes and their coexpression relationships. Coexpressed genes are clustered in smaller subnetworks, the predominant biological roles of which can be determined through enrichment analysis. By studying well-annotated gene partners, the attribution of new roles to genes of unknown function or assumption for participation in common metabolic pathways can be achieved, through a guilt-by-association approach.

gene coexpression networks transcriptomics eukaryotes

1. Introduction

The development of high-throughput technologies ^[1] aided the discovery of biological networks which provide insights into the understanding of system properties ^[2]^[3]^[4]. An earlier classification ^[5] divided biological networks into four groups:

Protein–protein interaction (PPI) networks ^[6] describe the associations, either through physical contact or common pathway participation, between two or more proteins;
Gene regulatory networks (GRNs) ^[7] depict the causal interactions between regulators and their target genes;
Signal transduction networks ^[8] contain information on the interactions between biochemical signalling molecules and cell receptors;
Metabolic and biochemical networks ^[9] display all metabolic reactions and molecules involved in biological pathways.

Due to the recent accumulation of large amounts of transcriptomic data through microarray and RNA-Seq technologies, an additional group of biological networks has emerged ^[10]^[11]: Gene coexpression networks (GCNs) ^[12] allow the study of the coexpression patterns of multiple genes in different biological conditions.

Gene coexpression networks depict the degree of similarity between the expression profiles of all genes, in a particular set of biological samples that may derive from different tissues, developmental stages, or environmental conditions, to reach conclusions far beyond the scopes of the individual studies the samples have come from. The underlying basis of gene coexpression analysis is that coexpressed genes tend to participate in similar biological processes ^[13]^[14]. Furthermore, expression levels of correlated genes may be controlled by similar regulatory mechanisms. As such, GCNs can replicate known functional roles and regulatory interactions between genes. The construction of GCNs can additionally function as a prediction method, identifying novel functional interactions between genes, as well as assigning new roles to existing genes or genes of yet unknown function ^[15]^[16].

Many methods have been developed for the construction of a gene coexpression network ^[12]^[17]. However, most of the methodologies include the following steps:

Collection and integration of expression data
Processing and filtering of gene expression data and construction of expression matrices ^[12]^[18]
Selection of coexpression measure and construction of similarity matrices ^[15]^[17]
Selection of significance threshold and network construction ^[18]^[19].
Identification of modules using clustering techniques ^[17].

2. Collection and Processing of Transcriptomic Data and Construction of Gene Expression Matrices

The two main transcriptomic technologies used to obtain expression data for coexpression analysis are microarrays ^[20] and RNA-Seq ^[21]. The samples used for a coexpression analysis can be procured from public databases, produced through in-house experiments by research groups, or a combination of both. Using publicly available experiments is usually preferred, as many public transcriptomic data repositories provide an abundance of expression profiling studies. The most popular ones include Gene Expression Omnibus (GEO) ^[22], ArrayExpress ^[23], and Expression Atlas ^[24] which contain both microarray and RNA-Seq data, as well as Sequence Read Archive (SRA) ^[25], Gene-Tissue Expression (GTEx) ^[26], The Cancer Genome Atlas (TCGA) ^[27] and European Nucleotide Archive (ENA) ^[28], which are RNA-Seq specific.

The source data must originate from the same organism and the same transcriptomic platform for the coexpression results to be comparable. Subsequently, there are two major approaches to coexpression analysis, depending on the experimental conditions of the primary sample data sets used ^[3]:

(A): ‘Condition independent’ approach uses a set of samples of a multitude of different conditions and source tissues. This method is suitable for studying the global coexpression landscape of an organism and demonstrates gene relationships regardless of experimental conditions ^[12].
(B): ‘Condition dependent’ ^[12]^[29] approach uses a set of samples that derive from a specific tissue or a specific experimental condition. In this case, the coexpression analysis aims to discover the gene coexpression profile under the selected condition.

The biological question at hand defines which one of the two approaches should be adopted. Since all aforementioned transcriptomic data repositories describe in detail each of their available samples and can be queried using integrated advanced search functions, samples of the same species from the same platform can be easily retrieved. This sample filtering strategy can be expanded to identify samples of a specific tissue or condition.

Another important point lies in the total number of samples used for the coexpression analysis. Although using a small number of samples results in stronger gene correlations, it also increases the chance for spurious correlations to appear ^[3]. Consequently, a minimum amount of 20 samples is recommended to perform a coexpression analysis ^[30].

2.1. Microarray Data Analysis

There are several microarray manufacturers, such as Affymetrix ^[31], Agilent ^[32], Illumina ^[33], etc. Among them, Affymetrix GeneChip is the most popular platform to quantify gene expression. For each Affymetrix microarray hybridisation, a CEL file that contains the intensity values per probe is produced. Those primary files are then pre-processed with the assistance of a Chip Description File (CDF) which describes probe locations and probe set groupings on the chip, to calculate the expression values per probe set. These values are combined with an annotation file that contains gene-probe set correspondences, to obtain the gene expression values (Figure 1). Microarray pre-processing algorithms, usually referred to as normalisation algorithms, include the following steps:

background correction
normalisation
probe summarisation
log₂ transformation (optional)

/media/item_content/202208/6302d70969cd1biology-11-01019-g001.png

Figure 1. Pre-processing procedure for transcriptomic data. Primary microarray data are procured in a CEL format which is transformed to gene expression values by using a normalisation algorithm which is guided by a Chip Description File (CDF). In RNA-Seq primary data pre-processing, the FASTQ-formatted sequence read data are trimmed, then aligned to a reference genome. Gene counts are produced with the help of a General Feature Format (GFF) file. GFF file may also be used during alignment. Expression values are produced through normalisation. Both technologies eventually converge to the production of the same output, an expression matrix which contains the expressions of each gene in all samples.

The most popular normalisation methods that lead to one expression value per probe set are MAS5 ^[34], RMA ^[35], GCRMA ^[36], PLIER ^[37] and SCAN ^[38]. The oldest of these algorithms, MAS5, is the only one that does not perform logarithm transformation to the expression values. SCAN and MAS5 algorithms normalise each microarray sample independently of the others of the same series and are preferred when combining microarray samples from different series or laboratories, as other pre-processing algorithms, such as RMA or GCRMA, derive information from all samples together during normalisation and thus potentially introduce erroneous calculations, known as correlation artifacts.

2.2. RNA-Seq Data Analysis

Since its introduction, RNA-Seq has been steadily increasing as the method of choice to measure gene expressions accurately. The RNA-Seq technology that studies the aggregated mRNA of cell populations or tissue parts is also referred as bulk RNA-Seq. RNA-Seq is based on next-generation sequencing (NGS) where the length of the reads does not exceed 700 bps ^[39] and third-generation sequencing where the read length can be more than 150,000 bps ^[40]. Next-generation sequencing technologies include Illumina ^[41], 454 Life Science ^[42], etc, while third-generation sequencers include PacBio ^[43], Nanopore ^[44], etc. The raw data produced by RNA-Seq experiments are FASTQ ^[45] files, containing the sequence reads, as well as a quality value for each base. The pre-processing of RNA-Seq data ^[46] consists of:

quality control and trimming of sequence reads
mapping reads to a reference genome or transcriptome
producing gene read counts
normalisation

The first step of the pre-processing pipeline includes the quality assessment of the sequence reads and subsequent trimming of the adapter sequences and low-quality reads ^[47]. Software for quality control includes FastQC ^[48] which produces per-sample reports and MultiQC ^[49] which aggregates these reports, producing a single summary report and LongQC ^[50] which is specific for third-generation sequencing data. Software for trimming includes Cutadapt ^[51], fastp ^[52] and Trimmomatic ^[53]. Complete removal, also known as hard-clipping, is usually performed exclusively on the adapter sequences to save up storage space and facilitate downstream analysis. Soft-clipping refers to tagging low-quality reads or adapter sequences, so that they can be ignored in later steps of the analysis. Soft-clipping is preferrable to hard-clipping, as important information regarding the reads is not completely lost. Next, the trimmed reads are aligned to FASTA-formatted sequences of their corresponding reference genome. This step is performed using specific alignment software depending on the sequence read length: Aligners such as TopHat2 ^[54] and HISAT2 ^[55] are used for short reads, Magic-Blast ^[56], Graphmap2 ^[57], DART ^[58] LAMSA ^[58] and deSALT ^[59] for long reads and Bowtie 2 ^[60], minimap2 ^[61], STAR ^[62] GMAP ^[63] and BWA-MEM ^[64] for both types of reads. Some aligners can also perform soft-clipping of bases from the left or right end of the read sequence ^[62] and unmapped reads will always be soft-clipped during the alignment step. This process produces a BAM-formatted ^[65] file which contains the mapping of the reads to the reference genome. This output is then combined with a General Feature Format 3 (GFF3) file ^[66] which contains the genomic feature coordinates, to count the gene reads, using programs such as Cufflinks ^[67], featureCounts ^[68] and HTSeq ^[69]. Aligners may also use GFF3 annotations upfront. The exon joints provided by GFF files, accelerate the mapping process and increase the quality of the spliced alignments. Finally, to calculate the gene expression values, the resulting gene read count data are normalised. Algorithms such as Total Count ^[70], Quantile ^[71] and Upper Quartile ^[72], are purely based on arithmetic calculations concerning the read counts and their distributions in the samples, while TPM ^[73] and RPKM ^[74] take transcript length into account. TMM ^[75] and DESeq ^[76] use a mathematical and biological combination and qsmooth ^[77] normalises read counts based on the assumption that the distribution of samples should differ on a global scale, but not in each biological group/tissue. After normalisation, log₂ transformation of expression data is applied (Figure 1). Other software, such as Kallisto ^[78] and Salmon ^[79], use a different approach, pseudoaligning reads to a reference transcriptome, producing gene expression data two orders of magnitude faster than other pipelines. The selection of the normalisation algorithm impacts the quality of the resulting GCNs ^[80], thus, different normalisation procedures might be chosen for condition-independent or condition-dependent analyses.

2.3. Single-Cell RNA-Seq in Coexpression Analysis

Single-cell RNA-Seq (scRNA-Seq) is a recently emerging RNA-Seq-based technology which studies the transcriptome of single cells ^[81]. The pre-processing pipeline of scRNA-Seq data is similar to that of bulk RNA-Seq data. However there are certain additional steps that need to be performed, to account for the high heterogeneity of single-cell data ^[82]. A common phenomenon in scRNA-Seq data, is the appearance of a large amount of zero counts of genes that are truly expressed in other cells of the same type, known as dropout events ^[83]. In order to fill in the missing values, imputation methods, such as scImpute ^[84], SAVER ^[85] and MAGIC ^[86], have been developed. The produced expression matrix includes the expression values of genes per sample which in this case refers to a single cell.

2.4. Microarrays vs. RNA-Seq in Coexpression Analysis

The end result of both microarray and RNA-Seq data pre-processing is a file containing gene expression values per sample. Affymetrix-based chips use an outdated default CDF, so several probe sets either do not correspond to any known gene or correspond to more than one genes, and some genes are recognised by no probe set or by more than one probe sets. Thus, a custom CDF that better reflects current genomic and transcriptomic knowledge is recommended. One such example is the frequently updated BrainArray CDF ^[87] which ensures that each probe set corresponds to a single gene and vice versa.

RNA-Seq is a rapidly evolving technology with a larger, ever-increasing amount of publicly available data. As opposed to microarrays, RNA-Seq can accurately measure all known genes of an organism and has higher sensitivity. However, the expression estimations of RNA-Seq and microarrays are comparable, especially in genes with average expression ^[88]. Thus, the resulting gene coexpression landscapes which derive from RNA-Seq and microarrays are close ^[89] and biological pathway enrichments are similar ^[90]. The drawbacks of RNA-Seq include the significantly longer execution time of data pre-processing and higher computational resource requirements, as well as the use of pipelines of not yet fully optimised algorithms. On the contrary, all steps in microarray pre-processing are performed by a single, quick, light and optimised algorithm (Figure 1).

Irrespective of the transcriptomic technology, pre-processing of existing raw transcriptomic data from public repositories is imperative, as it ensures data uniformity which is essential for subsequent coexpression analysis. Reanalysis of the original primary data with modern normalisation algorithms and genomic annotations, can highly improve the estimation of gene expressions and thus, the coexpression landscape. This is crucial, especially in the case of microarray data analysis, as it was reported that up to 50% of the genes that were identified as differentially expressed in Affymetrix-based studies where default CDF was used, might be artifacts ^[87].

2.5. Batch Correction

There are many conditions which may vary during the course of an experiment (such as reagents, equipment, personnel, etc.) and may introduce batch effects, which is a common source of variation in both microarray and RNA-Seq data ^[91]. In the case of condition dependent (tissue-specific) coexpression analysis where data from multiple studies are combined, another layer of batch effects is introduced: experiments from different laboratories. Thus, batch effect identification and subsequent correction is an important step after expression data pre-processing. Usually, the studies that each sample belongs to, are used to define the batches, although the date and time of each experiment may be used as batch surrogates. Existence of batch effects is confirmed through visual inspection via principal component analysis (PCA) ^[92] and hierarchical clustering ^[93]. Batch effects are present if samples from the same study which derive from different biological conditions are clustered together, whereas the clusters should have been made up of the samples of the same conditions, regardless of study source. Batch-corrected microarray-based coexpression analysis using ComBat ^[94], produces combined correlations which are more consistent with each single study’s correlations ^[89], while a larger number of high quality GCNs are produced when ComBat batch correction is applied to normalised RNA-Seq data ^[80]. While ComBat requires manual denoting of the sources of the batch effects, SVA ^[95] can automatically estimate them, and subsequently applies ComBat correction. SVA is useful in cases where there are indications of technical variations (e.g., observed by PCA) but their source is not evident. scRNA-Seq samples are much more prone to technical variations, due to the low amount of genetic material isolated from each cell ^[82]. In this case, batch effect correction is perfomed by scRNA-Seq specific methods, such as f-scLVM ^[96], MNN ^[97] and kBET ^[98].

3. Selection of Coexpression Measure and Construction of Similarity Matrices

After the acquisition of gene expression data, the correlation of expression between each gene pair needs to be calculated. This is performed through a vast variety of approaches:

Distance-based measures calculate the dissimilarity between the expression of a pair of genes. Traditional distance measures are based on Minkowski distances ^[99]:

d_{m i n} = {(\sum_{i = 1}^{n} {|x_{i -} y_{i}|}^{m})}^{\frac{1}{m}}

where m is a positive integer and

x_{i}

and

y_{i}

are the expression values of x and y genes in the ith sample. Euclidean and Manhattan distances are cases of Minkowski distance, depending on the value of m. In Manhattan distance, m = 1:

In one of the most used distance measures, Euclidean distance, m = 2:

d = \sqrt{\sum_{i = 1}^{n} {(x_{i -} y_{i})}^{2}}

Correlation metrics describe the tendency of the expression levels of a pair of genes, to increase or decrease simultaneously across different samples ^[3]^[4]. They produce coefficients ranging from −1 (perfect anti-correlation) to +1 (perfect correlation), with values near 0 indicating no correlation.

The Pearson correlation coefficient (PCC or r) ^[100] is a measure that depicts the linear correlation between two genes, x and y, and is calculated as follows:

r = \frac{\sum_{i = 1}^{n} (x_{i} - \bar{x}) (y_{i} - \bar{y})}{\sqrt{\sum_{i = 1}^{n} {(x_{i} - \bar{x})}^{2} \sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}}

The Pearson correlation coefficient (PCC or r) ^[100] is a measure that depicts the linear correlation between two genes, x and y, and is calculated as follows:

r = \frac{\sum_{i = 1}^{n} (x_{i} - \bar{x}) (y_{i} - \bar{y})}{\sqrt{\sum_{i = 1}^{n} {(x_{i} - \bar{x})}^{2} \sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}}

where n is the number of samples and x_i and y_i are the expression values of x and y genes in the ith sample. PCC is useful for detecting correlation between genes that may have different average expression levels, however in some cases it is sensitive to outliers ^[3]^[12] resulting in false-positive results when the number of samples is small and pre-processing is based on quantile normalisation ^[101].

Uncentred correlation (Cosine similarity) ^[102] depicts the similarity between the expression of two gene pairs and, in contrast to centred PCC, it does not take into account the mean expression of each gene. It is given by:

c o s_{s i m} (x, y) = \frac{\sum_{i = 1}^{n} x_{i} y_{i}}{\sqrt{\sum_{i = 1}^{n} {(x_{i})}^{2}} \sqrt{\sum_{i = 1}^{n} {(y_{i})}^{2}}}

Spearman’s rank correlation coefficient (ρ) ^[103] is calculated as the PCC of the rankings of the expression values. In cases where there are no ranking ties, ρ can be calculated as follows ^[104]:

ρ (x, y) = 1 - \frac{6 \sum_{j = 1}^{n} D_{j}^{2}}{n (n^{2} - 1)}

where D_j is the difference between the ranks of the corresponding values of genes

x

and

y

As a parametric measure, PCC is used if gene expression values follow normal distributions across samples, otherwise a nonparametric method, such as Spearman’s rank correlation coefficient, should be used. The selection of the algorithm can be based on a normality test. As Spearman’s correlation coefficient uses expression ranks instead of expression values, ρ is less sensitive to extreme data values.

Kendall’s rank correlation coefficient (τ) ^[105] is a measure of nonlinear dependence between two random variables. It is suitable for identifying key genes that increase or decline in monotonic fashions in expression data collected during a biological process or developmental stage ^[106]. For any pair of observations

\{(x_{i}, x_{j}), (y_{i}, y_{j})\}

of expressions of genes x and y in samples i and j, where i < j, if (x_i > x_j AND y_i > y_j) OR (x_i < x_j AND y_i < y_j), the pair is concordant, if (x_i > x_j AND y_i < y_j) OR (x _i< x_j AND y_i > y_j) the pair is discordant, or if x_i = x_j OR y_i = y_j, the pair is neither concordant nor discordant. Kendall’s correlation coefficient is given by ^[106]:

τ = \frac{n_{c} - n_{d}}{\sqrt{[\frac{n (n - 1)}{2} - \sum_{k} \frac{t_{k} (t_{k} - 1)}{2}] | \frac{n (n - 1)}{2} - \sum_{l} \frac{u_{l} (u_{l} - 1)}{2} |}}

where n is the number of samples, n_c is the number of concordant observation pairs, n_d the number of discordant pairs,

t_{k}

is the number of observations tied at k rank of x and

u_{l}

is the number of observations tied at l rank of y. In cases where there are no tied observations, the following formula is used:

τ = \frac{n_{c} - n_{d}}{\frac{n (n - 1)}{2}}

Since Kendall’s rank correlation coefficient is used to identify monotonic relationships, it is used as an alternative to Spearman’s.

The aforementioned correlation coefficient values are used to compute the Mutual Rank (MR) ^[107] score as follows:

M_{x y} = \sqrt{R_{x y} R_{y x}}

where R_xy is the rank of the correlation of genes x and y in the descending list of all gene correlations of x. Since MR is a distance measure, with smaller values meaning higher correlation, a Logit Score (LS) transformation ^[108] is applied:

L_{x y} = \log_{2} (N - M_{x y}) - \log_{2} (M_{x y})

where N is the total number of genes studied. Higher values of LS indicate stronger correlations.

Finally, Mutual Information (MI) is a method that detects the amount of information obtained about the expression of one gene by observing the expression of another gene ^[109]. MI is based on Shannon’s theory of communication ^[110] and is calculated by subtracting the joint entropy of two genes X and Y from the sum of their entropies ^[111]:

I (X, Y) = H (X) + H (Y) - H (X, Y)

4. Selection of Significance Thresholds for Network Construction

Once a correlation measure has been chosen, a correlation matrix which contains all pairwise gene correlation coefficients $c o r (x, y)$ for any x and y genes, is constructed. The correlation matrix is a square matrix with M × M dimensions, where M is the number of studied genes. The diagonal values of the matrix are 1, as they correspond to the correlation of any gene to itself and the matrix is symmetric to the main diagonal, thus it can also be portrayed as an upper or lower triangular matrix, displaying each gene pair correlation once.

There are several ways to portray the correlation landscape of a large number of genes (Figure 2). The simplest and commonest way to study gene coexpression, is by producing a list of most coexpressed genes to a “driver gene” i.e., the gene of interest. In this coexpressed gene list ^[112], the correlations of the driver gene with all other genes are ranked according to their correlation coefficient, either in descending order to highlight the top positively correlated genes, or in ascending order to highlight the top negatively correlated genes. In effect, a coexpression list contains the ordered values of the correlation matrix row (or column) of the driver gene, thus it demonstrates singular gene coexpression relationships, without accounting for any interconnections among the coexpressed genes of the list.

/media/item_content/202208/6302da42427f7biology-11-01019-g002.png

Figure 2. Flowchart depicting the steps for performing gene coexpression analysis using gene expression data. Gene pairwise correlations are calculated and regardless of the chosen correlation measure, correlation values need to be transformed to similarity values and then to adjacency values. Gene coexpression can be depicted as lists, dendrograms or networks. Eventually, the results of the coexpression analysis need to be evaluated through enrichment analysis.

To overcome the aforementioned limitation, a more sophisticated way to study gene coexpression is the construction of a GCN, based on an M × M similarity matrix which scales all correlation values between 0 and 1. If the absolute correlation values are used for the construction of the matrix (

s_{x y} = |c o r (x, y)|

, where

s_{x y}

is the similarity between x and y genes), then the similarity matrix is considered “unsigned”. In unsigned similarity matrices, positively and negatively correlated gene pairs cannot be distinguished.

5. Identification of Modules Using Clustering Techniques

Modules in a GCN can be defined as a group of genes that are densely linked ^[113]^[114]^[115]. Highly connected genes within a network are called hub genes. These genes have been shown to be functionally significant ^[116]^[117]. There are two types of hub genes named intra-modular and inter-modular hubs that are central to specific modules in the network or central to the entire network, respectively ^[17].

Clustering is a method to group and visualise coexpressed genes, using a distance matrix as input. Genes that have similar expression patterns across multiple samples are grouped to produce sets of coexpressed genes ^[17]^[109]. The most common clustering method is hierarchical clustering whose most popular implementation in gene coexpression is the unweighted pair group method with arithmetic mean (UPGMA) ^[93]. Hierarchical clustering starts by connecting genes that are closest to each other and continues to connect resulting clusters based on their pairwise distances, eventually forming a tree (in this case, a gene coexpression tree). The leaves of the tree represent the genes and the lengths of the branches reflect the distance between genes, thus tree clades represent coexpression modules ^[17]^[109]^[118]. The tree output file is usually in Newick format ^[119].

Biclustering generates clusters of rows and columns simultaneously ^[120]. In the case of gene expression, rows are genes and columns are samples. Biclustering is usually depicted in the form of a coexpression heatmap. Based on their expression level, genes are mapped into clusters with the main objective to find homogeneous submatrices called biclusters which may overlap, or discover local expression patterns according to certain experimental conditions ^[121]. Due to this process, biological information about these clusters can be extracted. This information refers not only to the correlated genes but also to the identification of genes that do not act the same way in all conditions ^[122].

A popular non-hierarchical clustering method is k-means, a partitioning method that subdivides the genes into a predefined k number of clusters ^[118]^[123]. The k-means method initially sets k points that function as cluster centre points (centroids). Each gene is then assigned to the cluster with the closest centroid. New positions for the cluster centroids are set as the average of the genes of the cluster, and gene assignment begins anew. The previous two steps continue until no more genes change cluster ^[118]^[124]. However, it is difficult to determine the optimal number of k points and multiple runs of the algorithm may result in different components for each cluster.

The self-organizing map (SOM) method is closely related to k-means, also starting with a predetermined number of cluster centroids. In the case of SOMs though, the centroids are linked in a prespecified geometrical configuration ^[124]. Each iteration involves randomly selecting a gene and moving the closest centroid in the direction of this gene, as well as its neighbouring centroids on the grid ^[118]. In this fashion, neighbouring centroids in the initial geometry tend to be mapped to nearby centroids in k-dimensional space ^[125]. Clusters that are closest to each other in the initial arrangement, tend to be more similar to each other than those that are further apart ^[124]. The end result is a grid of clusters, in which neighbouring clusters show related expression patterns ^[118].

Gene coexpression trees produced through clustering cannot portray anti-coexpressed genes and are limited to classifying a gene into a single functional cluster, although genes may possess multiple functions and participate in different metabolic pathways ^[126].

6. Gene List Functional Enrichment Analysis

The purpose of a gene coexpression analysis is to discover functional gene partners to a gene of interest. Biological functions can be attributed to genes of unknown role, based on the verified functions of their coexpressed gene partners ^[12], an approach known as “guilt by association”. By identifying the most coexpressed genes to a gene of interest or the subnetwork or subtree that the gene of interest belongs to, from a GCN or a gene coexpression tree, respectively, lists of highly coexpressed genes are created. The predominant biological functions, metabolic pathways, regulating transcription factors, disease associations, etc, for such a gene list can be determined through functional enrichment analysis.

Biological term enrichment categories include: gene ontologies ^[127], biological and metabolic pathways ^[128], protein structures ^[129], gene-disease associations ^[130], regulatory motifs ^[131], experimentally verified transcription factor binding sites ^[132], etc. Public online tools performing enrichment analysis of coexpressed gene lists that result from coexpression analyses include g:Profiler ^[133], Enrichr ^[134], WebGestalt ^[135], FLAME ^[136], DAVID ^[137] and GOnet ^[138]. More specifically, g:Profiler offers enrichment analyses for more than 700 organisms. FLAME can perform many visualisations on the input gene list but its enrichment analysis is based on g:Profiler calculations. Enrichr offers an immense list of available biological term compilations, but is available only for six model species. Compared to the other tools, DAVID and WebGestalt can be used with or without a reference gene list, with WebGestalt allowing for detailed parameter customisation before analysis. Most of the tools also offer integrated functions for gene ID conversions. Finally, GOnet can perform gene ontology enrichment analysis only for human and mouse, but is unique in visualising the input genes and their corresponding enriched gene ontologies as well as the ontology hierarchy and relationships between ontologies as a graph.

7. Coexpression Tools

7.1. Global Coexpression Web Tools

COXPRESdb ^[139] provides gene coexpression relationships, for nine animal and two fungal species: Homo sapiens, Mus musculus, Rattus norvegicus, Gallus gallus, Macaca mulatta, Canis lupus familiaris, Danio rerio, Drosophila melanogaster, Caenorhabditis elegans, Saccharomyces cerevisiae and Schizosaccharomyces pombe. ATTED-II ^[108] is the sister database to COXPRESdb, providing coexpression data for nine plant species: Arabidopsis thaliana, Brassica rapa, Glycine max, Medicago truncatula, Oryza sativa, Populus trichocarpa, Solanum lycopersicum, Vitis vinifera and Zea mays. COXPRESdb and ATTED-II contain both microarray and RNA-Seq data and are constantly evolving with new features and increasing numbers of samples. The databases use the Logit Score transformed mutual ranks as a gene coexpression measure and RNA-Seq data are processed with their own Matataki ^[140] quantification software, an algorithm optimised for execution speed. The coexpression results are portrayed as coexpressed gene lists, sorted in descending LS order of coexpressed genes with the gene of interest, based on representative gene expression data combining both RNA-Seq and microarrays. Adjacent lists display results from all other available transcriptomic subsets, such as microarray samples from specific conditions, etc. Furthermore, to increase the robustness of the analysis, coexpression results of orthologous genes of closely related species are also displayed. Finally, the top coexpressed partners to a gene of interest are portrayed as coexpression networks in the gene’s information page (Figure 3).

/media/item_content/202208/6302da62c4eb1biology-11-01019-g003.png

Figure 3. Coexpression results of ATTED-II and COXPRESdb: (a) GCN of the top coexpressed partners to CTL2, found in the gene’s information page; (b) GCN of the top coexpressed gene partners to NRP1, found in the gene’s information page. Coloured circles refer to different KEGG pathways.

Arabidopsis Coexpression Tool (ACT) ^[114]^[115]^[126] studies gene coexpression in 21,273 Arabidopsis thaliana genes using high-quality healthy microarray samples. The latest version of ACT is based on 3500 Affymetrix Arabidopsis ATH1 Genome Array GeneChip samples from ArrayExpress, GEO and NASCArrays. Expression data were produced using the SCAN algorithm along with Brainarray CDF. Genes were clustered using UPGMA hierarchical clustering to create a gene coexpression tree. Using a single gene as input, a subclade containing the driver gene and its coexpressed genes is produced (Figure 4a). The subtree size can be increased or decreased. Multiple biological term enrichment analyses are offered and the coexpression subtree and its corresponding gene list can be exported to various external tools for further downstream analysis. ACT’s sister web tool for Homo sapiens is Human Gene Correlation Analysis (HGCA) ^[13]. HGCA1.0 is based on 1959 Affymetrix Human Genome U133 Plus 2.0 samples of various cells and tissues. Gene expression data were produced using the MAS5.0 algorithm with default CDF. Pairwise PCCs were measured for all probe sets and were grouped using neighbour joining ^[141]. Similar to ACT, users select a driver probe set which corresponds to the gene of interest. Users can choose between two outputs: a coexpressed gene list or a gene coexpression tree. Over-representation analysis for multiple biological categories is also available. HGCA1.5 is based on the same samples as HGCA1.0. Nevertheless, primary data are processed in a manner identical to that of ACT. HGCA2.0 is a major upgrade as expression data from 55,431 genes were produced from GTEx RNA-Seq gene count data of 3500 samples, using qsmooth normalisation. The downstream data processing is similar to that of HGCA1.5. HGCA1.5 and HGCA2.0 output gene coexpression trees (Figure 4b).

/media/item_content/202208/6302da7429fe6biology-11-01019-g004.png

Figure 4. Coexpression results of ACT and HGCA2.0: (a) Default coexpression subtree in ACT using CTL2 as driver gene. The subtree contains nine genes (including the driver gene) and possesses five ancestral nodes; (b) Default coexpression subtree in HGCA2.0 using NRP1 as driver gene. The subtree contains 34 genes (including the driver gene) and possesses five ancestral nodes.

EXPath 2.0 ^[142] allows the user to perform various transcriptomic-based analyses for six plant species: Arabidopsis thaliana, Oryza sativa, Zea mays, Solanum lycopersicum, Glycine max, and Medicago truncatula. EXPath 2.0 contains both microarray and RNA-Seq data from various conditions. Single gene analysis in EXPath 2.0 has multiple outputs: EXPath offers information for a gene of interest, including its biological terms, sample-specific expression and top correlated or anti-correlated genes. A multiple gene query results in a weighted GCN that includes both positively and negatively coexpressed genes. Finally, GO and pathway enrichment, as well as differential expression gene analyses are available.

PLAΝt coEXpression (PLANEX) ^[143] is a coexpression database for eight plant species: Arabidopsis thaliana, Glycine max, Hordeum vulgare, Oryza sativa, Solanum lycopersicum, Triticum aestivum, Vitis vinifera and Zea mays. This database presents a list of coexpressed genes ranked by their PCCs. Positive and negative cut-offs were determined by finding the top 1% of the positive and the top 1% of the negatively correlated gene pairs. Furthermore, a GCN can also be presented. Another functionality is the comparison of the coexpression between any user-selected gene pair. Compared with other similar databases, in PLANEX’s case the probes were mapped against representative genes by string match instead of BLAST ^[144], thus producing positive results if each base in a probe sequence matched perfectly with the representative gene sequence without any gap. In addition, the PCC was subjected to PCA, for the identification of a gene set with changing expression over different experiments.

Correlation Networks (CorNet) ^[145] is an online tool for network construction in Arabidopsis thaliana. CorNet is based on microarray and RNA-Seq samples and can perform coexpression, protein–protein or regulatory interaction analyses. Using pre-defined or user-uploaded primary datasets, CorNet displays the coexpressed genes to a single gene or a list of genes. Various customisation options are available: selecting between Pearson or Spearman correlation coefficients and setting a correlation threshold, p-value cut-off, the number of resulting coexpressed genes and whether the GCN will contain relationships between the coexpressed genes. The output is either a GCN which is visualised through Cytoscape (Figure 5) or a coexpressed gene list.

/media/item_content/202208/6302da94132fabiology-11-01019-g005.png

Figure 5. GCN of ten coexpressed partners to CTL2 in CorNet, visualised through Cytoscape. The GCN includes the coexpression inter-relationships.

8. General Guidelines for Coexpression Tool Selection

At first, the user should decide whether the tool for the species of interest should be global or tissue/cell-type specific. Then, a collection of global or tissue-specific tools, depending on the previous selection, might be run for analysis and the user could form a consensus list of coexpressed genes that are present in the results of the majority of the tools. Alternatively, the user might assess the performance of each tool, based on various indications for an efficient depiction of the coexpression landscape. First of all, the number of samples used by each tool is an important factor, with higher sample numbers resulting in more reliable coexpression relationships, as a small sample number might introduce sparse correlations ^[3]. Sample variability is equally important to ensure that the dataset is not skewed towards a certain tissue, when global coexpression is studied. In addition, high-quality samples and the application of batch correction increases the quality of coexpression ^[80]^[106]^[146]^[147]^[148].

Tools that are based on up-to-date genome/transcriptome data or biological terms are preferable, e.g., microarray-based tools using a custom CDF are innately better than those using the default one. The mathematical rigor of the underlying statistics of a coexpression tool may also improve its performance. This might be assessed by the complexity of the correlation calculation method, as well as by the resulting depiction of coexpression. The latter can be evaluated by the ability of the tool to reproduce known biology: The output of each tool could be cross-checked with the existing bibliography by searching for validated gene partners in the coexpression lists or validated biological processes in the statistically significant enriched biological terms. Enrichment analysis can be performed either internally, by some coexpression tools, or by exporting the coexpressed gene list to external webtools such as WebGestalt, where either pre-set or user-defined reference gene lists may also be used.

References

Schneider, M.V.; Orchard, S. Omics Technologies, Data and Bioinformatics Principles. In Bioinformatics for Omics Data: Methods and Protocols; Mayer, B., Ed.; Humana Press: Totowa, NJ, USA, 2011; pp. 3–30.
Barabasi, A.L.; Oltvai, Z.N. Network biology: Understanding the cell’s functional organization. Nat. Rev. Genet. 2004, 5, 101–113.
Usadel, B.; Obayashi, T.; Mutwil, M.; Giorgi, F.M.; Bassel, G.W.; Tanimoto, M.; Chow, A.; Steinhauser, D.; Persson, S.; Provart, N.J. Co-expression tools for plant biology: Opportunities for hypothesis generation and caveats. Plant Cell Environ. 2009, 32, 1633–1651.
Emamjomeh, A.; Saboori Robat, E.; Zahiri, J.; Solouki, M.; Khosravi, P. Gene co-expression network reconstruction: A review on computational methods for inferring functional information from plant-based expression data. Plant Biotechnol. Rep. 2017, 11, 71–86.
Pavlopoulos, G.A.; Secrier, M.; Moschopoulos, C.N.; Soldatos, T.G.; Kossida, S.; Aerts, J.; Schneider, R.; Bagos, P.G. Using graph theory to analyze biological networks. BioData Min. 2011, 4, 10.
Pellegrini, M.; Haynor, D.; Johnson, J.M. Protein interaction networks. Expert Rev. Proteom. 2004, 1, 239–249.
Emmert-Streib, F.; Dehmer, M.; Haibe-Kains, B. Gene regulatory networks and their applications: Understanding biological and medical problems in terms of networks. Front. Cell Dev. Biol. 2014, 2, 38.
Albert, R.; DasGupta, B.; Dondi, R.; Kachalo, S.; Sontag, E.; Zelikovsky, A.; Westbrooks, K. A novel method for signal transduction network inference from indirect experimental evidence. J. Comput. Biol. 2007, 14, 927–949.
Jeong, H.; Tombor, B.; Albert, R.; Oltvai, Z.N.; Barabasi, A.L. The large-scale organization of metabolic networks. Nature 2000, 407, 651–654.
Tieri, P.; Farina, L.; Petti, M.; Astolfi, L.; Paci, P.; Castiglione, F. Network Inference and Reconstruction in Bioinformatics. In Encyclopedia of Bioinformatics and Computational Biology; Ranganathan, S., Gribskov, M., Nakai, K., Schönbach, C., Eds.; Academic Press: Oxford, UK, 2019; pp. 805–813.
Fionda, V. Networks in Biology. In Encyclopedia of Bioinformatics and Computational Biology; Ranganathan, S., Gribskov, M., Nakai, K., Schönbach, C., Eds.; Academic Press: Oxford, UK, 2019; pp. 915–921.
Serin, E.A.R.; Nijveen, H.; Hilhorst, H.W.M.; Ligterink, W. Learning from Co-expression Networks: Possibilities and Challenges. Front. Plant Sci. 2016, 7, 444.
Michalopoulos, I.; Pavlopoulos, G.A.; Malatras, A.; Karelas, A.; Kostadima, M.A.; Schneider, R.; Kossida, S. Human gene correlation analysis (HGCA): A tool for the identification of transcriptionally co-expressed genes. BMC Res. Notes 2012, 5, 265.
Petereit, J.; Smith, S.; Harris, F.C., Jr.; Schlauch, K.A. Petal: Co-expression network modelling in R. BMC Syst. Biol. 2016, 10, 51.
He, F.; Maslov, S. Pan- and core- network analysis of co-expression genes in a model plant. Sci. Rep. 2016, 6, 38956.
Liseron-Monfils, C.; Ware, D. Revealing gene regulation and associations through biological networks. Curr. Plant Biol. 2015, 3–4, 30–39.
van Dam, S.; Vosa, U.; van der Graaf, A.; Franke, L.; de Magalhaes, J.P. Gene co-expression analysis for functional classification and gene-disease predictions. Brief. Bioinform. 2018, 19, 575–592.
Leal, L.G.; Lopez, C.; Lopez-Kleine, L. Construction and comparison of gene co-expression networks shows complex plant immune responses. PeerJ 2014, 2, e610.
Peng, J.; Wang, T.; Huc, J.; Wang, Y.; Chen, J. Constructing Networks of Organelle Functional Modules in Arabidopsis. Curr. Genom. 2016, 17, 427–438.
Schena, M.; Shalon, D.; Davis, R.W.; Brown, P.O. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 1995, 270, 467–470.
Wang, Z.; Gerstein, M.; Snyder, M. RNA-Seq: A revolutionary tool for transcriptomics. Nat. Rev. Genet. 2009, 10, 57–63.
Barrett, T.; Wilhite, S.E.; Ledoux, P.; Evangelista, C.; Kim, I.F.; Tomashevsky, M.; Marshall, K.A.; Phillippy, K.H.; Sherman, P.M.; Holko, M.; et al. NCBI GEO: Archive for functional genomics data sets—Update. Nucleic Acids Res. 2013, 41, D991–D995.
Parkinson, H.; Kapushesky, M.; Shojatalab, M.; Abeygunawardena, N.; Coulson, R.; Farne, A.; Holloway, E.; Kolesnykov, N.; Lilja, P.; Lukk, M.; et al. ArrayExpress–A public database of microarray experiments and gene expression profiles. Nucleic Acids Res. 2007, 35, D747–D750.
Papatheodorou, I.; Moreno, P.; Manning, J.; Fuentes, A.M.; George, N.; Fexova, S.; Fonseca, N.A.; Fullgrabe, A.; Green, M.; Huang, N.; et al. Expression Atlas update: From tissues to single cells. Nucleic Acids Res. 2020, 48, D77–D83.
Kodama, Y.; Shumway, M.; Leinonen, R.; International Nucleotide Sequence Database, C. The Sequence Read Archive: Explosive growth of sequencing data. Nucleic Acids Res. 2012, 40, D54–D56.
GTEx Consortium. The Genotype-Tissue Expression (GTEx) project. Nat. Genet. 2013, 45, 580–585.
Hutter, C.; Zenklusen, J.C. The Cancer Genome Atlas: Creating Lasting Value beyond Its Data. Cell 2018, 173, 283–285.
Amid, C.; Alako, B.T.F.; Balavenkataraman Kadhirvelu, V.; Burdett, T.; Burgin, J.; Fan, J.; Harrison, P.W.; Holt, S.; Hussein, A.; Ivanov, E.; et al. The European Nucleotide Archive in 2019. Nucleic Acids Res. 2020, 48, D70–D76.
Aoki, K.; Ogata, Y.; Shibata, D. Approaches for extracting practical information from gene co-expression networks in plant biology. Plant Cell Physiol. 2007, 48, 381–390.
Langfelder, P.; Horvath, S. WGCNA Package FAQ. Available online: https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/faq.html (accessed on 5 June 2022).
Lockhart, D.J.; Dong, H.; Byrne, M.C.; Follettie, M.T.; Gallo, M.V.; Chee, M.S.; Mittmann, M.; Wang, C.; Kobayashi, M.; Horton, H.; et al. Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat. Biotechnol. 1996, 14, 1675–1680.
Wolber, P.K.; Collins, P.J.; Lucas, A.B.; De Witte, A.; Shannon, K.W. The Agilent in situ-synthesized microarray platform. Methods Enzymol. 2006, 410, 28–57.
Kuhn, K.; Baker, S.C.; Chudin, E.; Lieu, M.H.; Oeser, S.; Bennett, H.; Rigault, P.; Barker, D.; McDaniel, T.K.; Chee, M.S. A novel, high-performance random array platform for quantitative gene expression profiling. Genome Res. 2004, 14, 2347–2356.
Hubbell, E.; Liu, W.M.; Mei, R. Robust estimators for expression analysis. Bioinformatics 2002, 18, 1585–1592.
Irizarry, R.A.; Bolstad, B.M.; Collin, F.; Cope, L.M.; Hobbs, B.; Speed, T.P. Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res. 2003, 31, e15.
Wu, Z.; Irizarry, R.A.; Gentleman, R.; Martinez-Murillo, F.; Spencer, F. A Model-Based Background Adjustment for Oligonucleotide Expression Arrays. J. Am. Stat. Assoc. 2004, 99, 909–917.
Hubbell, E. Affymetrix Technical Notes: Guide to Probe Logarithmic Intensity Error (PLIER) Estimation. Available online: http://tools.thermofisher.com/content/sfs/brochures/plier_technote.pdf (accessed on 5 June 2022).
Piccolo, S.R.; Sun, Y.; Campbell, J.D.; Lenburg, M.E.; Bild, A.H.; Johnson, W.E. A single-sample microarray normalization method to facilitate personalized-medicine workflows. Genomics 2012, 100, 337–344.
Applied Biosystems. Applied Biosystems 3730 and 3730xl DNA Analyzers. Available online: http://tools.thermofisher.com/content/sfs/brochures/cms_042636.pdf (accessed on 5 June 2022).
Jain, M.; Olsen, H.E.; Paten, B.; Akeson, M. The Oxford Nanopore MinION: Delivery of nanopore sequencing to the genomics community. Genome Biol. 2016, 17, 239.
Bentley, D.R.; Balasubramanian, S.; Swerdlow, H.P.; Smith, G.P.; Milton, J.; Brown, C.G.; Hall, K.P.; Evers, D.J.; Barnes, C.L.; Bignell, H.R.; et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 2008, 456, 53–59.
Margulies, M.; Egholm, M.; Altman, W.E.; Attiya, S.; Bader, J.S.; Bemben, L.A.; Berka, J.; Braverman, M.S.; Chen, Y.J.; Chen, Z.; et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature 2005, 437, 376–380.
Schadt, E.E.; Turner, S.; Kasarskis, A. A window into third-generation sequencing. Hum. Mol. Genet. 2010, 19, R227–R240.
Branton, D.; Deamer, D.W.; Marziali, A.; Bayley, H.; Benner, S.A.; Butler, T.; Di Ventra, M.; Garaj, S.; Hibbs, A.; Huang, X.; et al. The potential and challenges of nanopore sequencing. Nat. Biotechnol. 2008, 26, 1146–1153.
Cock, P.J.; Fields, C.J.; Goto, N.; Heuer, M.L.; Rice, P.M. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 2010, 38, 1767–1771.
Hong, M.; Tao, S.; Zhang, L.; Diao, L.T.; Huang, X.; Huang, S.; Xie, S.J.; Xiao, Z.D.; Zhang, H. RNA sequencing: New technologies and applications in cancer research. J. Hematol. Oncol. 2020, 13, 166.
Macmanes, M.D. On the optimal trimming of high-throughput mRNA sequence data. Front. Genet. 2014, 5, 13.
Andrews, S. FastQC: A Quality Control Tool for High Throughput Sequence Data. Available online: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ (accessed on 5 June 2022).
Ewels, P.; Magnusson, M.; Lundin, S.; Kaller, M. MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics 2016, 32, 3047–3048.
Fukasawa, Y.; Ermini, L.; Wang, H.; Carty, K.; Cheung, M.S. LongQC: A Quality Control Tool for Third Generation Sequencing Long Read Data. G3 Genes Genomes Genet. 2020, 10, 1193–1196.
Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet. J. 2011, 17, 3.
Chen, S.; Zhou, Y.; Chen, Y.; Gu, J. Fastp: An ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 2018, 34, i884–i890.
Bolger, A.M.; Lohse, M.; Usadel, B. Trimmomatic: A flexible trimmer for Illumina sequence data. Bioinformatics 2014, 30, 2114–2120.
Kim, D.; Pertea, G.; Trapnell, C.; Pimentel, H.; Kelley, R.; Salzberg, S.L. TopHat2: Accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 2013, 14, R36.
Kim, D.; Paggi, J.M.; Park, C.; Bennett, C.; Salzberg, S.L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat. Biotechnol. 2019, 37, 907–915.
Boratyn, G.M.; Thierry-Mieg, J.; Thierry-Mieg, D.; Busby, B.; Madden, T.L. Magic-BLAST, an accurate RNA-seq aligner for long and short reads. BMC Bioinform. 2019, 20, 405.
Marić, J.; Sović, I.; Križanović, K.; Nagarajan, N.; Šikić, M. Graphmap2—Splice-aware RNA-seq mapper for long reads. bioRxiv 2019.
Lin, H.N.; Hsu, W.L. DART: A fast and accurate RNA-seq mapper with a partitioning strategy. Bioinformatics 2018, 34, 190–197.
Liu, B.; Liu, Y.; Li, J.; Guo, H.; Zang, T.; Wang, Y. deSALT: Fast and accurate long transcriptomic read alignment with de Bruijn graph-based index. Genome Biol. 2019, 20, 274.
Langmead, B.; Salzberg, S.L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 2012, 9, 357–359.
Li, H. Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics 2018, 34, 3094–3100.
Dobin, A.; Davis, C.A.; Schlesinger, F.; Drenkow, J.; Zaleski, C.; Jha, S.; Batut, P.; Chaisson, M.; Gingeras, T.R. STAR: Ultrafast universal RNA-seq aligner. Bioinformatics 2013, 29, 15–21.
Wu, T.D.; Reeder, J.; Lawrence, M.; Becker, G.; Brauer, M.J. GMAP and GSNAP for Genomic Sequence Alignment: Enhancements to Speed, Accuracy, and Functionality. Methods Mol. Biol. 2016, 1418, 283–334.
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv 2013, arXiv:1303.3997.
Li, H.; Handsaker, B.; Wysoker, A.; Fennell, T.; Ruan, J.; Homer, N.; Marth, G.; Abecasis, G.; Durbin, R.; Genome Project Data Processing, S. The Sequence Alignment/Map format and SAMtools. Bioinformatics 2009, 25, 2078–2079.
Stein, L. Generic Feature Format Version 3 (GFF3). Available online: https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md (accessed on 5 June 2022).
Trapnell, C.; Williams, B.A.; Pertea, G.; Mortazavi, A.; Kwan, G.; van Baren, M.J.; Salzberg, S.L.; Wold, B.J.; Pachter, L. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 2010, 28, 511–515.
Liao, Y.; Smyth, G.K.; Shi, W. featureCounts: An efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 2014, 30, 923–930.
Anders, S.; Pyl, P.T.; Huber, W. HTSeq—A Python framework to work with high-throughput sequencing data. Bioinformatics 2015, 31, 166–169.
Dillies, M.A.; Rau, A.; Aubert, J.; Hennequet-Antier, C.; Jeanmougin, M.; Servant, N.; Keime, C.; Marot, G.; Castel, D.; Estelle, J.; et al. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief. Bioinform. 2013, 14, 671–683.
Bolstad, B.M.; Irizarry, R.A.; Astrand, M.; Speed, T.P. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 2003, 19, 185–193.
Bullard, J.H.; Purdom, E.; Hansen, K.D.; Dudoit, S. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinform. 2010, 11, 94.
Wagner, G.P.; Kin, K.; Lynch, V.J. Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples. Theory Biosci. 2012, 131, 281–285.
Mortazavi, A.; Williams, B.A.; McCue, K.; Schaeffer, L.; Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods 2008, 5, 621–628.
Robinson, M.D.; Oshlack, A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010, 11, R25.
Love, M.I.; Huber, W.; Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014, 15, 550.
Hicks, S.C.; Okrah, K.; Paulson, J.N.; Quackenbush, J.; Irizarry, R.A.; Bravo, H.C. Smooth quantile normalization. Biostatistics 2018, 19, 185–198.
Bray, N.L.; Pimentel, H.; Melsted, P.; Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 2016, 34, 525–527.
Patro, R.; Duggal, G.; Love, M.I.; Irizarry, R.A.; Kingsford, C. Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods 2017, 14, 417–419.
Vandenbon, A. Evaluation of critical data processing steps for reliable prediction of gene co-expression from large collections of RNA-seq data. PLoS ONE 2022, 17, e0263344.
Tang, F.; Barbacioru, C.; Wang, Y.; Nordman, E.; Lee, C.; Xu, N.; Wang, X.; Bodeau, J.; Tuch, B.B.; Siddiqui, A.; et al. mRNA-Seq whole-transcriptome analysis of a single cell. Nat. Methods 2009, 6, 377–382.
Hwang, B.; Lee, J.H.; Bang, D. Single-cell RNA sequencing technologies and bioinformatics pipelines. Exp. Mol. Med. 2018, 50, 1–14.
Chen, G.; Ning, B.; Shi, T. Single-Cell RNA-Seq Technologies and Related Computational Data Analysis. Front. Genet. 2019, 10, 317.
Li, W.V.; Li, J.J. An accurate and robust imputation method scImpute for single-cell RNA-seq data. Nat. Commun. 2018, 9, 997.
Huang, M.; Wang, J.; Torre, E.; Dueck, H.; Shaffer, S.; Bonasio, R.; Murray, J.I.; Raj, A.; Li, M.; Zhang, N.R. SAVER: Gene expression recovery for single-cell RNA sequencing. Nat. Methods 2018, 15, 539–542.
Van Dijk, D.; Sharma, R.; Nainys, J.; Yim, K.; Kathail, P.; Carr, A.J.; Burdziak, C.; Moon, K.R.; Chaffer, C.L.; Pattabiraman, D.; et al. Recovering Gene Interactions from Single-Cell Data Using Data Diffusion. Cell 2018, 174, 716–729.e727.
Dai, M.; Wang, P.; Boyd, A.D.; Kostov, G.; Athey, B.; Jones, E.G.; Bunney, W.E.; Myers, R.M.; Speed, T.P.; Akil, H.; et al. Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data. Nucleic Acids Res. 2005, 33, e175.
Chen, L.; Sun, F.; Yang, X.; Jin, Y.; Shi, M.; Wang, L.; Shi, Y.; Zhan, C.; Wang, Q. Correlation between RNA-Seq and microarrays results using TCGA data. Gene 2017, 628, 200–204.
Malatras, A.; Michalopoulos, I.; Duguez, S.; Butler-Browne, G.; Spuler, S.; Duddy, W.J. MyoMiner: Explore gene co-expression in normal and pathological muscle. BMC Med. Genom. 2020, 13, 67.
Obayashi, T.; Aoki, Y.; Tadaka, S.; Kagaya, Y.; Kinoshita, K. ATTED-II in 2018: A Plant Coexpression Database Based on Investigation of the Statistical Property of the Mutual Rank Index. Plant Cell Physiol. 2018, 59, e3.
Leek, J.T.; Scharpf, R.B.; Bravo, H.C.; Simcha, D.; Langmead, B.; Johnson, W.E.; Geman, D.; Baggerly, K.; Irizarry, R.A. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 2010, 11, 733–739.
Pearson, K. LIII. On lines and planes of closest fit to systems of points in space. Lond. Edinb. Dublin Philos. Mag. 1901, 2, 559–572.
Sokal, R.R.; Michener, C.D. A statistical method for evaluating systematic relationships. Univ. Kansas Sci. Bull. 1958, 38, 1409–1438.
Johnson, W.E.; Li, C.; Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 2007, 8, 118–127.
Leek, J.T.; Storey, J.D. A general framework for multiple testing dependence. Proc. Natl. Acad. Sci. USA 2008, 105, 18718–18723.
Buettner, F.; Pratanwanich, N.; McCarthy, D.J.; Marioni, J.C.; Stegle, O. f-scLVM: Scalable and versatile factor analysis for single-cell RNA-seq. Genome Biol. 2017, 18, 212.
Haghverdi, L.; Lun, A.T.L.; Morgan, M.D.; Marioni, J.C. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 2018, 36, 421–427.
Buttner, M.; Miao, Z.; Wolf, F.A.; Teichmann, S.A.; Theis, F.J. A test metric for assessing single-cell RNA-seq batch correction. Nat. Methods 2019, 16, 43–49.
Minkowski, H. Geometrie Der Zahlen; Teubner: Leipzig, Germany, 1910.
Pearson, K. VII. Note on regression and inheritance in the case of two parents. Proc. R. Soc. Lond. 1895, 58, 240–242.
Amaratunga, D.; Cabrera, J. Analysis of Data From Viral DNA Microchips. J. Am. Stat. Assoc. 2001, 96, 1161–1170.
Jaskowiak, P.A.; Campello, R.J.; Costa, I.G. On the selection of appropriate distances for gene expression data clustering. BMC Bioinform. 2014, 15, S2.
Spearman, C. ‘General intelligence’, objectively determined and measured. Am. J. Psychol. 1904, 15, 201–292.
Myers, J.L.; Well, A.D. Research Design and Statistical Analysis, 2nd ed.; Lawrence Erlbaum Associates: Mahwah, NJ, USA, 2003.
Kendall, M.G. A new measure of rank correlation. Biometrika 1938, 30, 81–93.
Kumari, S.; Nie, J.; Chen, H.S.; Ma, H.; Stewart, R.; Li, X.; Lu, M.Z.; Taylor, W.M.; Wei, H. Evaluation of gene association methods for coexpression network construction and biological knowledge discovery. PLoS ONE 2012, 7, e50411.
Obayashi, T.; Hayashi, S.; Saeki, M.; Ohta, H.; Kinoshita, K. ATTED-II provides coexpressed gene networks for Arabidopsis. Nucleic Acids Res. 2009, 37, D987–D991.
Obayashi, T.; Hibara, H.; Kagaya, Y.; Aoki, Y.; Kinoshita, K. ATTED-II v11: A Plant Gene Coexpression Database Using a Sample Balancing Technique by Subagging of Principal Components. Plant Cell Physiol. 2022, 63, 869–881.
Bansal, M.; Belcastro, V.; Ambesi-Impiombato, A.; di Bernardo, D. How to infer gene networks from expression profiles. Mol. Syst. Biol. 2007, 3, 78.
Shannon, C.E. A Mathematical Theory of Communication. Bell Syst. Tech. J. 1948, 27, 379–423+623–656.
Steuer, R.; Kurths, J.; Daub, C.O.; Weise, J.; Selbig, J. The mutual information: Detecting and evaluating dependencies between variables. Bioinformatics 2002, 18, S231–S240.
Obayashi, T.; Kinoshita, K. Coexpression landscape in ATTED-II: Usage of gene list and gene network for various types of pathways. J. Plant Res. 2010, 123, 311–319.
Langfelder, P.; Horvath, S. WGCNA: An R package for weighted correlation network analysis. BMC Bioinform. 2008, 9, 559.
Jen, C.H.; Manfield, I.W.; Michalopoulos, I.; Pinney, J.W.; Willats, W.G.; Gilmartin, P.M.; Westhead, D.R. The Arabidopsis co-expression tool (ACT): A WWW-based tool and database for microarray-based gene expression analysis. Plant J. 2006, 46, 336–348.
Manfield, I.W.; Jen, C.H.; Pinney, J.W.; Michalopoulos, I.; Bradford, J.R.; Gilmartin, P.M.; Westhead, D.R. Arabidopsis Co-expression Tool (ACT): Web server tools for microarray-based gene expression analysis. Nucleic Acids Res. 2006, 34, W504–W509.
Chen, P.; Wang, F.; Feng, J.; Zhou, R.; Chang, Y.; Liu, J.; Zhao, Q. Co-expression network analysis identified six hub genes in association with metastasis risk and prognosis in hepatocellular carcinoma. Oncotarget 2017, 8, 48948–48958.
Yuan, L.; Chen, L.; Qian, K.; Qian, G.; Wu, C.L.; Wang, X.; Xiao, Y. Co-expression network analysis identified six hub genes in association with progression and prognosis in human clear cell renal cell carcinoma (ccRCC). Genom. Data 2017, 14, 132–140.
D’Haeseleer, P. How does gene expression clustering work? Nat. Biotechnol. 2005, 23, 1499.
Olsen, G. The ”Newick’s 8:45” Tree Format Standard. Available online: https://evolution.genetics.washington.edu/phylip/newick_doc.html (accessed on 5 June 2022).
Hartigan, J.A. Direct Clustering of a Data Matrix. J. Am. Stat. Assoc. 1972, 67, 123–129.
Padilha, V.A.; Campello, R.J.G.B. A systematic comparative evaluation of biclustering techniques. BMC Bioinform. 2017, 18, 55.
Eren, K.; Deveci, M.; Kucuktunc, O.; Catalyurek, U.V. A comparative analysis of biclustering algorithms for gene expression data. Brief. Bioinform. 2012, 14, 279–292.
Hartigan, J. Clustering Algorithms; John Wiley & Sons: New York, NY, USA, 1975.
Heyer, L.J.; Kruglyak, S.; Yooseph, S. Exploring expression data: Identification and analysis of coexpressed genes. Genome Res. 1999, 9, 1106–1115.
Tamayo, P.; Slonim, D.; Mesirov, J.; Zhu, Q.; Kitareewan, S.; Dmitrovsky, E.; Lander, E.S.; Golub, T.R. Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proc. Natl. Acad. Sci. USA 1999, 96, 2907–2912.
Zogopoulos, V.L.; Saxami, G.; Malatras, A.; Angelopoulou, A.; Jen, C.H.; Duddy, W.J.; Daras, G.; Hatzopoulos, P.; Westhead, D.R.; Michalopoulos, I. Arabidopsis Coexpression Tool: A tool for gene coexpression analysis in Arabidopsis thaliana. iScience 2021, 24, 102848.
Gene Ontology Consortium. The Gene Ontology resource: Enriching a GOld mine. Nucleic Acids Res. 2021, 49, D325–D334.
Kanehisa, M.; Furumichi, M.; Tanabe, M.; Sato, Y.; Morishima, K. KEGG: New perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 2017, 45, D353–D361.
Mistry, J.; Chuguransky, S.; Williams, L.; Qureshi, M.; Salazar, G.A.; Sonnhammer, E.L.L.; Tosatto, S.C.E.; Paladin, L.; Raj, S.; Richardson, L.J.; et al. Pfam: The protein families database in 2021. Nucleic Acids Res. 2021, 49, D412–D419.
Pinero, J.; Ramirez-Anguita, J.M.; Sauch-Pitarch, J.; Ronzano, F.; Centeno, E.; Sanz, F.; Furlong, L.I. The DisGeNET knowledge platform for disease genomics: 2019 update. Nucleic Acids Res. 2020, 48, D845–D855.
Castro-Mondragon, J.A.; Riudavets-Puig, R.; Rauluseviciute, I.; Berhanu Lemma, R.; Turchi, L.; Blanc-Mathieu, R.; Lucas, J.; Boddie, P.; Khan, A.; Manosalva Perez, N.; et al. JASPAR 2022: The 9th release of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 2022, 50, D165–D173.
Encode Project Consortium. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature 2020, 583, 699–710.
Raudvere, U.; Kolberg, L.; Kuzmin, I.; Arak, T.; Adler, P.; Peterson, H.; Vilo, J. g:Profiler: A web server for functional enrichment analysis and conversions of gene lists (2019 update). Nucleic Acids Res. 2019, 47, W191–W198.
Kuleshov, M.V.; Jones, M.R.; Rouillard, A.D.; Fernandez, N.F.; Duan, Q.; Wang, Z.; Koplev, S.; Jenkins, S.L.; Jagodnik, K.M.; Lachmann, A.; et al. Enrichr: A comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res. 2016, 44, W90–W97.
Liao, Y.; Wang, J.; Jaehnig, E.J.; Shi, Z.; Zhang, B. WebGestalt 2019: Gene set analysis toolkit with revamped UIs and APIs. Nucleic Acids Res. 2019, 47, W199–W205.
Thanati, F.; Karatzas, E.; Baltoumas, F.A.; Stravopodis, D.J.; Eliopoulos, A.G.; Pavlopoulos, G.A. FLAME: A Web Tool for Functional and Literature Enrichment Analysis of Multiple Gene Lists. Biology 2021, 10, 665.
Huang, D.W.; Sherman, B.T.; Lempicki, R.A. Bioinformatics enrichment tools: Paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 2009, 37, 1–13.
Pomaznoy, M.; Ha, B.; Peters, B. GOnet: A tool for interactive Gene Ontology analysis. BMC Bioinform. 2018, 19, 470.
Okamura, Y.; Aoki, Y.; Obayashi, T.; Tadaka, S.; Ito, S.; Narise, T.; Kinoshita, K. COXPRESdb in 2015: Coexpression database for animal species by DNA-microarray and RNAseq-based expression data with multiple quality assessment systems. Nucleic Acids Res. 2015, 43, D82–D86.
Okamura, Y.; Kinoshita, K. Matataki: An ultrafast mRNA quantification method for large-scale reanalysis of RNA-Seq data. BMC Bioinform. 2018, 19, 266.
Saitou, N.; Nei, M. The neighbor-joining method: A new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 1987, 4, 406–425.
Tseng, K.C.; Li, G.Z.; Hung, Y.C.; Chow, C.N.; Wu, N.Y.; Chien, Y.Y.; Zheng, H.Q.; Lee, T.Y.; Kuo, P.L.; Chang, S.B.; et al. EXPath 2.0: An Updated Database for Integrating High-Throughput Gene Expression Data with Biological Pathways. Plant Cell Physiol. 2020, 61, 1818–1827.
Yim, W.C.; Yu, Y.; Song, K.; Jang, C.S.; Lee, B.M. PLANEX: The plant co-expression database. BMC Plant Biol. 2013, 13, 83.
Altschul, S.F.; Gish, W.; Miller, W.; Myers, E.W.; Lipman, D.J. Basic local alignment search tool. J. Mol. Biol. 1990, 215, 403–410.
De Bodt, S.; Hollunder, J.; Nelissen, H.; Meulemeester, N.; Inze, D. CORNET 2.0: Integrating plant coexpression, protein-protein interactions, regulatory interactions, gene associations and functional annotations. New Phytol. 2012, 195, 707–720.
Ostlund, G.; Sonnhammer, E.L. Avoiding pitfalls in gene (co)expression meta-analysis. Genomics 2014, 103, 21–30.
Michiels, S.; Koscielny, S.; Hill, C. Prediction of cancer outcome with microarrays: A multiple random validation strategy. Lancet 2005, 365, 488–492.
Huang, J.; Vendramin, S.; Shi, L.; McGinnis, K.M. Construction and Optimization of a Large Gene Coexpression Network in Maize Using RNA-Seq Data. Plant. Physiol. 2017, 175, 568–583.

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.

Upload a video for this entry

Information

Subjects: Biochemistry & Molecular Biology

Contributors MDPI registered users' name will be linked to their SciProfiles pages. To register with us, please refer to https://encyclopedia.pub/register : Vasileios L. Zogopoulos , Georgia Saxami , Apostolos Malatras ,

Konstantinos Papadopoulos

, Ioanna Tsotra , Vassiliki A. Iconomidou , Ioannis Michalopoulos

View Times: 946

Update Date: 22 Aug 2022

Table of Contents

Notice

You are not a member of the advisory board for this topic. If you want to update advisory board member profile, please contact office@encyclopedia.pub.

Confirm

Only members of the Encyclopedia advisory board for this topic are allowed to note entries. Would you like to become an advisory board member of the Encyclopedia?

Yes

${ textCharacter }/${ maxCharacter }

Submit

Cancel

There is no comment~

${ textCharacter }/${ maxCharacter }

Submit

Cancel

${ selectedItem.replyTextCharacter }/${ selectedItem.replyMaxCharacter }

Submit

Cancel

Confirm

Are you sure to Delete?

Yes No