2.2. RNA-Seq Data Analysis
Since its introduction, RNA-Seq has been steadily increasing as the method of choice to measure gene expressions accurately. The RNA-Seq technology that studies the aggregated mRNA of cell populations or tissue parts is also referred as bulk RNA-Seq. RNA-Seq is based on next-generation sequencing (NGS) where the length of the reads does not exceed 700 bps
[39] and third-generation sequencing where the read length can be more than 150,000 bps
[40]. Next-generation sequencing technologies include Illumina
[41], 454 Life Science
[42], etc, while third-generation sequencers include PacBio
[43], Nanopore
[44], etc. The raw data produced by RNA-Seq experiments are FASTQ
[45] files, containing the sequence reads, as well as a quality value for each base. The pre-processing of RNA-Seq data
[46] consists of:
-
quality control and trimming of sequence reads
-
mapping reads to a reference genome or transcriptome
-
producing gene read counts
-
normalisation
The first step of the pre-processing pipeline includes the quality assessment of the sequence reads and subsequent trimming of the adapter sequences and low-quality reads
[47]. Software for quality control includes FastQC
[48] which produces per-sample reports and MultiQC
[49] which aggregates these reports, producing a single summary report and LongQC
[50] which is specific for third-generation sequencing data. Software for trimming includes Cutadapt
[51], fastp
[52] and Trimmomatic
[53]. Complete removal, also known as hard-clipping, is usually performed exclusively on the adapter sequences to save up storage space and facilitate downstream analysis. Soft-clipping refers to tagging low-quality reads or adapter sequences, so that they can be ignored in later steps of the analysis. Soft-clipping is preferrable to hard-clipping, as important information regarding the reads is not completely lost. Next, the trimmed reads are aligned to FASTA-formatted sequences of their corresponding reference genome. This step is performed using specific alignment software depending on the sequence read length: Aligners such as TopHat2
[54] and HISAT2
[55] are used for short reads, Magic-Blast
[56], Graphmap2
[57], DART
[58] LAMSA
[58] and deSALT
[59] for long reads and Bowtie 2
[60], minimap2
[61], STAR
[62] GMAP
[63] and BWA-MEM
[64] for both types of reads. Some aligners can also perform soft-clipping of bases from the left or right end of the read sequence
[62] and unmapped reads will always be soft-clipped during the alignment step. This process produces a BAM-formatted
[65] file which contains the mapping of the reads to the reference genome. This output is then combined with a General Feature Format 3 (GFF3) file
[66] which contains the genomic feature coordinates, to count the gene reads, using programs such as Cufflinks
[67], featureCounts
[68] and HTSeq
[69]. Aligners may also use GFF3 annotations upfront. The exon joints provided by GFF files, accelerate the mapping process and increase the quality of the spliced alignments. Finally, to calculate the gene expression values, the resulting gene read count data are normalised. Algorithms such as Total Count
[70], Quantile
[71] and Upper Quartile
[72], are purely based on arithmetic calculations concerning the read counts and their distributions in the samples, while TPM
[73] and RPKM
[74] take transcript length into account. TMM
[75] and DESeq
[76] use a mathematical and biological combination and qsmooth
[77] normalises read counts based on the assumption that the distribution of samples should differ on a global scale, but not in each biological group/tissue. After normalisation, log
2 transformation of expression data is applied (
Figure 1). Other software, such as Kallisto
[78] and Salmon
[79], use a different approach, pseudoaligning reads to a reference transcriptome, producing gene expression data two orders of magnitude faster than other pipelines. The selection of the normalisation algorithm impacts the quality of the resulting GCNs
[80], thus, different normalisation procedures might be chosen for condition-independent or condition-dependent analyses.
2.3. Single-Cell RNA-Seq in Coexpression Analysis
Single-cell RNA-Seq (scRNA-Seq) is a recently emerging RNA-Seq-based technology which studies the transcriptome of single cells
[81]. The pre-processing pipeline of scRNA-Seq data is similar to that of bulk RNA-Seq data. However there are certain additional steps that need to be performed, to account for the high heterogeneity of single-cell data
[82]. A common phenomenon in scRNA-Seq data, is the appearance of a large amount of zero counts of genes that are truly expressed in other cells of the same type, known as dropout events
[83]. In order to fill in the missing values, imputation methods, such as scImpute
[84], SAVER
[85] and MAGIC
[86], have been developed. The produced expression matrix includes the expression values of genes per sample which in this case refers to a single cell.
2.4. Microarrays vs. RNA-Seq in Coexpression Analysis
The end result of both microarray and RNA-Seq data pre-processing is a file containing gene expression values per sample. Affymetrix-based chips use an outdated default CDF, so several probe sets either do not correspond to any known gene or correspond to more than one genes, and some genes are recognised by no probe set or by more than one probe sets. Thus, a custom CDF that better reflects current genomic and transcriptomic knowledge is recommended. One such example is the frequently updated BrainArray CDF
[87] which ensures that each probe set corresponds to a single gene and vice versa.
RNA-Seq is a rapidly evolving technology with a larger, ever-increasing amount of publicly available data. As opposed to microarrays, RNA-Seq can accurately measure all known genes of an organism and has higher sensitivity. However, the expression estimations of RNA-Seq and microarrays are comparable, especially in genes with average expression
[88]. Thus, the resulting gene coexpression landscapes which derive from RNA-Seq and microarrays are close
[89] and biological pathway enrichments are similar
[90]. The drawbacks of RNA-Seq include the significantly longer execution time of data pre-processing and higher computational resource requirements, as well as the use of pipelines of not yet fully optimised algorithms. On the contrary, all steps in microarray pre-processing are performed by a single, quick, light and optimised algorithm (
Figure 1).
Irrespective of the transcriptomic technology, pre-processing of existing raw transcriptomic data from public repositories is imperative, as it ensures data uniformity which is essential for subsequent coexpression analysis. Reanalysis of the original primary data with modern normalisation algorithms and genomic annotations, can highly improve the estimation of gene expressions and thus, the coexpression landscape. This is crucial, especially in the case of microarray data analysis, as it was reported that up to 50% of the genes that were identified as differentially expressed in Affymetrix-based studies where default CDF was used, might be artifacts
[87].
2.5. Batch Correction
There are many conditions which may vary during the course of an experiment (such as reagents, equipment, personnel, etc.) and may introduce batch effects, which is a common source of variation in both microarray and RNA-Seq data
[91]. In the case of condition dependent (tissue-specific) coexpression analysis where data from multiple studies are combined, another layer of batch effects is introduced: experiments from different laboratories. Thus, batch effect identification and subsequent correction is an important step after expression data pre-processing. Usually, the studies that each sample belongs to, are used to define the batches, although the date and time of each experiment may be used as batch surrogates. Existence of batch effects is confirmed through visual inspection via principal component analysis (PCA)
[92] and hierarchical clustering
[93]. Batch effects are present if samples from the same study which derive from different biological conditions are clustered together, whereas the clusters should have been made up of the samples of the same conditions, regardless of study source. Batch-corrected microarray-based coexpression analysis using ComBat
[94], produces combined correlations which are more consistent with each single study’s correlations
[89], while a larger number of high quality GCNs are produced when ComBat batch correction is applied to normalised RNA-Seq data
[80]. While ComBat requires manual denoting of the sources of the batch effects, SVA
[95] can automatically estimate them, and subsequently applies ComBat correction. SVA is useful in cases where there are indications of technical variations (e.g., observed by PCA) but their source is not evident. scRNA-Seq samples are much more prone to technical variations, due to the low amount of genetic material isolated from each cell
[82]. In this case, batch effect correction is perfomed by scRNA-Seq specific methods, such as f-scLVM
[96], MNN
[97] and kBET
[98].
3. Selection of Coexpression Measure and Construction of Similarity Matrices
After the acquisition of gene expression data, the correlation of expression between each gene pair needs to be calculated. This is performed through a vast variety of approaches:
Distance-based measures calculate the dissimilarity between the expression of a pair of genes. Traditional distance measures are based on Minkowski distances
[99]:
where m is a positive integer and xi and yi are the expression values of x and y genes in the ith sample. Euclidean and Manhattan distances are cases of Minkowski distance, depending on the value of m. In Manhattan distance, m = 1:
In one of the most used distance measures, Euclidean distance, m = 2:
Correlation metrics describe the tendency of the expression levels of a pair of genes, to increase or decrease simultaneously across different samples
[3][4]. They produce coefficients ranging from −1 (perfect anti-correlation) to +1 (perfect correlation), with values near 0 indicating no correlation.
The Pearson correlation coefficient (PCC or
r)
[100] is a measure that depicts the linear correlation between two genes,
x and
y, and is calculated as follows:
Correlation metrics describe the tendency of the expression levels of a pair of genes, to increase or decrease simultaneously across different samples
[3][4]. They produce coefficients ranging from −1 (perfect anti-correlation) to +1 (perfect correlation), with values near 0 indicating no correlation.
The Pearson correlation coefficient (PCC or
r)
[100] is a measure that depicts the linear correlation between two genes,
x and
y, and is calculated as follows:
where
n is the number of samples and
xi and
yi are the expression values of
x and
y genes in the
ith sample. PCC is useful for detecting correlation between genes that may have different average expression levels, however in some cases it is sensitive to outliers
[3][12] resulting in false-positive results when the number of samples is small and pre-processing is based on quantile normalisation
[101].
Uncentred correlation (Cosine similarity)
[102] depicts the similarity between the expression of two gene pairs and, in contrast to centred PCC, it does not take into account the mean expression of each gene. It is given by:
Spearman’s rank correlation coefficient (
ρ)
[103] is calculated as the PCC of the rankings of the expression values. In cases where there are no ranking ties,
ρ can be calculated as follows
[104]:
where Dj is the difference between the ranks of the corresponding values of genes x and y.
As a parametric measure, PCC is used if gene expression values follow normal distributions across samples, otherwise a nonparametric method, such as Spearman’s rank correlation coefficient, should be used. The selection of the algorithm can be based on a normality test. As Spearman’s correlation coefficient uses expression ranks instead of expression values, ρ is less sensitive to extreme data values.
Kendall’s rank correlation coefficient (
τ)
[105] is a measure of nonlinear dependence between two random variables. It is suitable for identifying key genes that increase or decline in monotonic fashions in expression data collected during a biological process or developmental stage
[106]. For any pair of observations
{(xi,xj), (yi,yj)}
of expressions of genes
x and
y in samples
i and
j, where
i <
j, if (
xi > xj AND
yi > yj) OR (
xi < xj AND
yi < yj), the pair is concordant, if (
xi > xj AND
yi < yj) OR (
x i<
xj AND
yi > yj) the pair is discordant, or if
xi = xj OR
yi = yj, the pair is neither concordant nor discordant. Kendall’s correlation coefficient is given by
[106]:
where n is the number of samples, nc is the number of concordant observation pairs, nd the number of discordant pairs, tk is the number of observations tied at k rank of x and ul is the number of observations tied at l rank of y. In cases where there are no tied observations, the following formula is used:
Since Kendall’s rank correlation coefficient is used to identify monotonic relationships, it is used as an alternative to Spearman’s.
The aforementioned correlation coefficient values are used to compute the Mutual Rank (MR)
[107] score as follows:
where
Rxy is the rank of the correlation of genes
x and
y in the descending list of all gene correlations of
x. Since MR is a distance measure, with smaller values meaning higher correlation, a Logit Score (LS) transformation
[108] is applied:
where N is the total number of genes studied. Higher values of LS indicate stronger correlations.
Finally, Mutual Information (MI) is a method that detects the amount of information obtained about the expression of one gene by observing the expression of another gene
[109]. MI is based on Shannon’s theory of communication
[110] and is calculated by subtracting the joint entropy of two genes
X and
Y from the sum of their entropies
[111]:
4. Selection of Significance Thresholds for Network Construction
Once a correlation measure has been chosen, a correlation matrix which contains all pairwise gene correlation coefficients cor(x,y) for any x and y genes, is constructed. The correlation matrix is a square matrix with M × M dimensions, where M is the number of studied genes. The diagonal values of the matrix are 1, as they correspond to the correlation of any gene to itself and the matrix is symmetric to the main diagonal, thus it can also be portrayed as an upper or lower triangular matrix, displaying each gene pair correlation once.
There are several ways to portray the correlation landscape of a large number of genes (
Figure 2). The simplest and commonest way to study gene coexpression, is by producing a list of most coexpressed genes to a “driver gene” i.e., the gene of interest. In this coexpressed gene list
[112], the correlations of the driver gene with all other genes are ranked according to their correlation coefficient, either in descending order to highlight the top positively correlated genes, or in ascending order to highlight the top negatively correlated genes. In effect, a coexpression list contains the ordered values of the correlation matrix row (or column) of the driver gene, thus it demonstrates singular gene coexpression relationships, without accounting for any interconnections among the coexpressed genes of the list.
Figure 2. Flowchart depicting the steps for performing gene coexpression analysis using gene expression data. Gene pairwise correlations are calculated and regardless of the chosen correlation measure, correlation values need to be transformed to similarity values and then to adjacency values. Gene coexpression can be depicted as lists, dendrograms or networks. Eventually, the results of the coexpression analysis need to be evaluated through enrichment analysis.
To overcome the aforementioned limitation, a more sophisticated way to study gene coexpression is the construction of a GCN, based on an M × M similarity matrix which scales all correlation values between 0 and 1. If the absolute correlation values are used for the construction of the matrix (sxy=|cor(x,y)|, where sxy is the similarity between x and y genes), then the similarity matrix is considered “unsigned”. In unsigned similarity matrices, positively and negatively correlated gene pairs cannot be distinguished.
5. Identification of Modules Using Clustering Techniques
Modules in a GCN can be defined as a group of genes that are densely linked
[113][114][115]. Highly connected genes within a network are called hub genes. These genes have been shown to be functionally significant
[116][117]. There are two types of hub genes named intra-modular and inter-modular hubs that are central to specific modules in the network or central to the entire network, respectively
[17].
Clustering is a method to group and visualise coexpressed genes, using a distance matrix as input. Genes that have similar expression patterns across multiple samples are grouped to produce sets of coexpressed genes
[17][109]. The most common clustering method is hierarchical clustering whose most popular implementation in gene coexpression is the unweighted pair group method with arithmetic mean (UPGMA)
[93]. Hierarchical clustering starts by connecting genes that are closest to each other and continues to connect resulting clusters based on their pairwise distances, eventually forming a tree (in this case, a gene coexpression tree). The leaves of the tree represent the genes and the lengths of the branches reflect the distance between genes, thus tree clades represent coexpression modules
[17][109][118]. The tree output file is usually in Newick format
[119].
Biclustering generates clusters of rows and columns simultaneously
[120]. In the case of gene expression, rows are genes and columns are samples. Biclustering is usually depicted in the form of a coexpression heatmap. Based on their expression level, genes are mapped into clusters with the main objective to find homogeneous submatrices called biclusters which may overlap, or discover local expression patterns according to certain experimental conditions
[121]. Due to this process, biological information about these clusters can be extracted. This information refers not only to the correlated genes but also to the identification of genes that do not act the same way in all conditions
[122].
A popular non-hierarchical clustering method is
k-means, a partitioning method that subdivides the genes into a predefined
k number of clusters
[118][123]. The
k-means method initially sets
k points that function as cluster centre points (centroids). Each gene is then assigned to the cluster with the closest centroid. New positions for the cluster centroids are set as the average of the genes of the cluster, and gene assignment begins anew. The previous two steps continue until no more genes change cluster
[118][124]. However, it is difficult to determine the optimal number of
k points and multiple runs of the algorithm may result in different components for each cluster.
The self-organizing map (SOM) method is closely related to
k-means, also starting with a predetermined number of cluster centroids. In the case of SOMs though, the centroids are linked in a prespecified geometrical configuration
[124]. Each iteration involves randomly selecting a gene and moving the closest centroid in the direction of this gene, as well as its neighbouring centroids on the grid
[118]. In this fashion, neighbouring centroids in the initial geometry tend to be mapped to nearby centroids in
k-dimensional space
[125]. Clusters that are closest to each other in the initial arrangement, tend to be more similar to each other than those that are further apart
[124]. The end result is a grid of clusters, in which neighbouring clusters show related expression patterns
[118].
Gene coexpression trees produced through clustering cannot portray anti-coexpressed genes and are limited to classifying a gene into a single functional cluster, although genes may possess multiple functions and participate in different metabolic pathways
[126].
6. Gene List Functional Enrichment Analysis
The purpose of a gene coexpression analysis is to discover functional gene partners to a gene of interest. Biological functions can be attributed to genes of unknown role, based on the verified functions of their coexpressed gene partners
[12], an approach known as “guilt by association”. By identifying the most coexpressed genes to a gene of interest or the subnetwork or subtree that the gene of interest belongs to, from a GCN or a gene coexpression tree, respectively, lists of highly coexpressed genes are created. The predominant biological functions, metabolic pathways, regulating transcription factors, disease associations, etc, for such a gene list can be determined through functional enrichment analysis.
Biological term enrichment categories include: gene ontologies
[127], biological and metabolic pathways
[128], protein structures
[129], gene-disease associations
[130], regulatory motifs
[131], experimentally verified transcription factor binding sites
[132], etc. Public online tools performing enrichment analysis of coexpressed gene lists that result from coexpression analyses include g:Profiler
[133], Enrichr
[134], WebGestalt
[135], FLAME
[136], DAVID
[137] and GOnet
[138]. More specifically, g:Profiler offers enrichment analyses for more than 700 organisms. FLAME can perform many visualisations on the input gene list but its enrichment analysis is based on g:Profiler calculations. Enrichr offers an immense list of available biological term compilations, but is available only for six model species. Compared to the other tools, DAVID and WebGestalt can be used with or without a reference gene list, with WebGestalt allowing for detailed parameter customisation before analysis. Most of the tools also offer integrated functions for gene ID conversions. Finally, GOnet can perform gene ontology enrichment analysis only for human and mouse, but is unique in visualising the input genes and their corresponding enriched gene ontologies as well as the ontology hierarchy and relationships between ontologies as a graph.
7. Coexpression Tools
7.1. Global Coexpression Web Tools
COXPRESdb [139] provides gene coexpression relationships, for nine animal and two fungal species:
Homo sapiens,
Mus musculus,
Rattus norvegicus,
Gallus gallus,
Macaca mulatta,
Canis lupus familiaris,
Danio rerio,
Drosophila melanogaster,
Caenorhabditis elegans,
Saccharomyces cerevisiae and
Schizosaccharomyces pombe.
ATTED-II [108] is the sister database to COXPRESdb, providing coexpression data for nine plant species:
Arabidopsis thaliana,
Brassica rapa,
Glycine max,
Medicago truncatula,
Oryza sativa,
Populus trichocarpa,
Solanum lycopersicum,
Vitis vinifera and
Zea mays. COXPRESdb and ATTED-II contain both microarray and RNA-Seq data and are constantly evolving with new features and increasing numbers of samples. The databases use the Logit Score transformed mutual ranks as a gene coexpression measure and RNA-Seq data are processed with their own Matataki
[140] quantification software, an algorithm optimised for execution speed. The coexpression results are portrayed as coexpressed gene lists, sorted in descending LS order of coexpressed genes with the gene of interest, based on representative gene expression data combining both RNA-Seq and microarrays. Adjacent lists display results from all other available transcriptomic subsets, such as microarray samples from specific conditions, etc. Furthermore, to increase the robustness of the analysis, coexpression results of orthologous genes of closely related species are also displayed. Finally, the top coexpressed partners to a gene of interest are portrayed as coexpression networks in the gene’s information page (
Figure 3).
Figure 3. Coexpression results of ATTED-II and COXPRESdb: (a) GCN of the top coexpressed partners to CTL2, found in the gene’s information page; (b) GCN of the top coexpressed gene partners to NRP1, found in the gene’s information page. Coloured circles refer to different KEGG pathways.
Arabidopsis Coexpression Tool (ACT) [114][115][126] studies gene coexpression in 21,273
Arabidopsis thaliana genes using high-quality healthy microarray samples. The latest version of ACT is based on 3500 Affymetrix Arabidopsis ATH1 Genome Array GeneChip samples from ArrayExpress, GEO and NASCArrays. Expression data were produced using the SCAN algorithm along with Brainarray CDF. Genes were clustered using UPGMA hierarchical clustering to create a gene coexpression tree. Using a single gene as input, a subclade containing the driver gene and its coexpressed genes is produced (
Figure 4a). The subtree size can be increased or decreased. Multiple biological term enrichment analyses are offered and the coexpression subtree and its corresponding gene list can be exported to various external tools for further downstream analysis. ACT’s sister web tool for
Homo sapiens is
Human Gene Correlation Analysis (HGCA) [13]. HGCA1.0 is based on 1959 Affymetrix Human Genome U133 Plus 2.0 samples of various cells and tissues. Gene expression data were produced using the MAS5.0 algorithm with default CDF. Pairwise PCCs were measured for all probe sets and were grouped using neighbour joining
[141]. Similar to ACT, users select a driver probe set which corresponds to the gene of interest. Users can choose between two outputs: a coexpressed gene list or a gene coexpression tree. Over-representation analysis for multiple biological categories is also available. HGCA1.5 is based on the same samples as HGCA1.0. Nevertheless, primary data are processed in a manner identical to that of ACT. HGCA2.0 is a major upgrade as expression data from 55,431 genes were produced from GTEx RNA-Seq gene count data of 3500 samples, using qsmooth normalisation. The downstream data processing is similar to that of HGCA1.5. HGCA1.5 and HGCA2.0 output gene coexpression trees (
Figure 4b).
Figure 4. Coexpression results of ACT and HGCA2.0: (a) Default coexpression subtree in ACT using CTL2 as driver gene. The subtree contains nine genes (including the driver gene) and possesses five ancestral nodes; (b) Default coexpression subtree in HGCA2.0 using NRP1 as driver gene. The subtree contains 34 genes (including the driver gene) and possesses five ancestral nodes.
EXPath 2.0 [142] allows the user to perform various transcriptomic-based analyses for six plant species:
Arabidopsis thaliana,
Oryza sativa,
Zea mays,
Solanum lycopersicum,
Glycine max, and
Medicago truncatula. EXPath 2.0 contains both microarray and RNA-Seq data from various conditions. Single gene analysis in EXPath 2.0 has multiple outputs: EXPath offers information for a gene of interest, including its biological terms, sample-specific expression and top correlated or anti-correlated genes. A multiple gene query results in a weighted GCN that includes both positively and negatively coexpressed genes. Finally, GO and pathway enrichment, as well as differential expression gene analyses are available.
PLAΝt coEXpression (PLANEX) [143] is a coexpression database for eight plant species:
Arabidopsis thaliana,
Glycine max,
Hordeum vulgare,
Oryza sativa,
Solanum lycopersicum,
Triticum aestivum,
Vitis vinifera and
Zea mays. This database presents a list of coexpressed genes ranked by their PCCs. Positive and negative cut-offs were determined by finding the top 1% of the positive and the top 1% of the negatively correlated gene pairs. Furthermore, a GCN can also be presented. Another functionality is the comparison of the coexpression between any user-selected gene pair. Compared with other similar databases, in PLANEX’s case the probes were mapped against representative genes by string match instead of BLAST
[144], thus producing positive results if each base in a probe sequence matched perfectly with the representative gene sequence without any gap. In addition, the PCC was subjected to PCA, for the identification of a gene set with changing expression over different experiments.
Correlation Networks (CorNet) [145] is an online tool for network construction in
Arabidopsis thaliana. CorNet is based on microarray and RNA-Seq samples and can perform coexpression, protein–protein or regulatory interaction analyses. Using pre-defined or user-uploaded primary datasets, CorNet displays the coexpressed genes to a single gene or a list of genes. Various customisation options are available: selecting between Pearson or Spearman correlation coefficients and setting a correlation threshold,
p-value cut-off, the number of resulting coexpressed genes and whether the GCN will contain relationships between the coexpressed genes. The output is either a GCN which is visualised through Cytoscape (
Figure 5) or a coexpressed gene list.
Figure 5. GCN of ten coexpressed partners to CTL2 in CorNet, visualised through Cytoscape. The GCN includes the coexpression inter-relationships.
8. General Guidelines for Coexpression Tool Selection
At first, the user should decide whether the tool for the species of interest should be global or tissue/cell-type specific. Then, a collection of global or tissue-specific tools, depending on the previous selection, might be run for analysis and the user could form a consensus list of coexpressed genes that are present in the results of the majority of the tools. Alternatively, the user might assess the performance of each tool, based on various indications for an efficient depiction of the coexpression landscape. First of all, the number of samples used by each tool is an important factor, with higher sample numbers resulting in more reliable coexpression relationships, as a small sample number might introduce sparse correlations
[3]. Sample variability is equally important to ensure that the dataset is not skewed towards a certain tissue, when global coexpression is studied. In addition, high-quality samples and the application of batch correction increases the quality of coexpression
[80][106][146][147][148].
Tools that are based on up-to-date genome/transcriptome data or biological terms are preferable, e.g., microarray-based tools using a custom CDF are innately better than those using the default one. The mathematical rigor of the underlying statistics of a coexpression tool may also improve its performance. This might be assessed by the complexity of the correlation calculation method, as well as by the resulting depiction of coexpression. The latter can be evaluated by the ability of the tool to reproduce known biology: The output of each tool could be cross-checked with the existing bibliography by searching for validated gene partners in the coexpression lists or validated biological processes in the statistically significant enriched biological terms. Enrichment analysis can be performed either internally, by some coexpression tools, or by exporting the coexpressed gene list to external webtools such as WebGestalt, where either pre-set or user-defined reference gene lists may also be used.