2. Bulk Tumor and Single-Cell Multi-Omics Data Analysis
2.1. Multi-Omics Data Processing and Integration
With the rapid development of high-throughput technology, genomic, transcriptomic, proteomic, and metabolomic profiles have provided ample sources of information for researchers to understand molecular disease mechanisms. Nevertheless, data generated from various commercially available platforms and customized arrays pose tremendous challenges for processing, analysis, and integration. The Genome Analysis Toolkit (GATK) is the industry standard for processing multi-omics data in bulk tumors and single cells, including identifying single nucleotides (SNPs) and indels, somatic short variants, copy number variations (CNV), and structural variations (SV) in germline DNA and RNAseq data
[42]. In addition to data generated from current sequencing technology, a huge amount of high-throughput data was generated from legacy DNA microarrays. A research group from the FDA reported that biomarkers and predictive models derived from legacy microarray data can accurately predict phenotypes in samples profiled with RNA sequencing, whereas RNA-seq-based models are less accurate in predicting microarray data
[43]. This section provides a brief overview of some software packages and methods used for bulk tumor multi-omics data processing and integration.
2.1.1. Copy Number Variation
Copy number variation (CNV) is a structural variation that is either a duplication or deletion event affecting a large number of base pairs. Deletions, amplifications, gains, and losses collectively termed CNVs, are found in all humans and other mammals
[44]. The number of CNVs can make up as much as 5–15% of the human genome
[45]. CNVs are a significant source of genomic diversity and driver of somatic and hereditary human diseases including cancer. However, compared to single-nucleotide variations (SNVs), CNVs are still under-investigated, despite their evolutionary significance and clinical relevance. This is a consequence of the inherent challenges in identifying CNVs in diverse populations of cells at low-to-intermediate frequencies
[46]. Using a recent method of a fluorescent gene functioning as a single-cell CNV reporter, CNVs are found to occur frequently and undergo selection with predictable dynamics across independently evolving replicate populations
[46]. CNVs have been applied in the molecular diagnosis of many diseases and non-invasive prenatal care. Nevertheless, CNVs have not reached their full potential as emerging biomarkers. Cancer immunotherapy targets, including
PD1, PDL1,
CD27, and
CD20 have more CNVs than SNVs in NSCLC tumors in The Cancer Genome Atlas (TCGA)
[47]. Tumor mutation burdens are used in cancer management, but not CNVs. The screening, diagnosis, prognosis, and monitoring of several illnesses, including cancer and cardiovascular disease, are likely to be significantly impacted by CNVs
[48].
Genomic alterations in DNA might interfere with the normal function of the genes. The genomic instability and structural dynamics of cancer cells require that CNV data be examined to discover the underlying associations between CNVs, gene/protein expression, and functional aberrations. Different platforms were used to profile genome-scale CNVs, including high-resolution SNP arrays (GeneChip Mapping 250K-Nsp array, Affymetrix), whole-genome tiling path aCGH (BCCRC whole genome tilling path array v2), and whole exome sequencing (SOLiD 5500xl)
[49]. Various CNV data processing methods were developed as described below.
PennCNV-Affy
[50], a Perl/C-based software tool, is the most commonly used method for CNV calling for data produced with SNP genotyping arrays. The first step is to process the raw CEL files and generate the signal intensity data. The second step is to split the signal file generated from step 1 into individual files. After the file splitting is completed, CNV calls will be generated by PennCNV. The output provides information on the CN state for each SNP probe. Normally, a CN < 2 indicates a deletion in copy number, and a CN > 2 indicates a duplication. For the SNP probes located within the same gene, the probe with the maximum intensity is used to represent the CN state for the gene.
Bioconductor packages CGHbase
[51] and CGHcall
[52] are often used to call the CNV in the aCGH data. The log
2 normalized ratios of Cy3/Cy5 are used as inputs. In CGHcall, the number of output classes can be selected among 3 classes (loss, normal, gain), 4 classes (loss, normal, gain, amplification), or 5 classes (double deletion, loss, normal, gain, amplification).
GISTIC2.0 is a pipeline used to find genes targeted by somatic copy-number alterations (SCNAs) in human cancers
[53]. GISTIC2.0 uses an iterative optimization algorithm to deconstruct each segmented copy-number profile into its most likely set of SCNAs. Compared with other methods, GISTIC2.0 is advantageous in separating arm-level and focal SCNAs explicitly by length.
CNV data generated by various platforms provide the corresponding chromosome location of each SNP. To harmonize the CNV data from various platforms, we can convert the genome assembly version from earlier versions, such as hg17, to hg38 by using the Python package
CruzDB, a fast and intuitive tool for the UCSC genome browser
[54]. Using the latest reference genome is an important step to ensure compatibility in the CNV data integration.
2.1.2. Categorization of Gene Regulation
Cancer is caused by dysregulated tumor suppressor genes or oncogenes. Due to genetic mutations or alterations in gene regulation, such genes are switched on or off and are expressed at abnormally high or low levels in tumor initiation and progression. It is important to define the up-regulation, normal, and down-regulation ranges by categorizing the gene expression data generated from high-throughput microarray or RNA sequencing. Housekeeping genes are generally used to categorize gene expression data.
Housekeeping genes are essential for the existence of the cell, regardless of their specific role in the tissue or organism. Housekeeping genes are expressed in all cells of an organism regardless of conditions (normal or pathophysiological), tissue type, developmental stage, cell cycle status, or external signals. Unlike in qRT-PCR, housekeeping genes are not generally used for normalization in RNA sequencing analysis. Therefore, the variation in gene expression measurements due to different sample preparation techniques is not accounted for in the RNA expression analysis. A set of stably expressed housekeeping genes in particular tissue types should be used for the corresponding research. For instance, a set of housekeeping genes were used for NSCLC
[55][56][57][58][59], including
ACTB,
B2M,
CDKN1B,
ESD,
FLOT2,
GAPDH,
GRB2,
GUSB,
HMBS,
HPRT1,
HSP90AB1,
IPO8,
LDHA,
NONO,
PGK1,
POLR2A,
PPIA,
PPIH,
PPP1CA,
RHOA,
RPL13A,
SDCBP,
TBP,
TFRC,
UBC,
YAP1, and
YWHAZ to define the threshold of gene expression level in multi-omics regulatory network studies
[47][60]. Specifically, the total percentage of up-regulated and down-regulated samples was fixed for all the housekeeping genes to be 30%, and the average standard deviation of the normal range for the selected housekeeping genes was calculated. This average standard deviation was applied to all other genes in the genome to define their normal, up-regulation, or down-regulation ranges
[47][60]. “Half SAM score” is recommended for differential gene expression analysis of data generated from microarrays and next-generation sequencing (NGS)
[61]. DEseq2 is commonly used for fold change and differential gene expression analysis of NGS data
[62].
2.1.3. Categorization of Protein Regulation
Protein expression represents how proteins are synthesized, modified, and regulated in an organism. The synthesis and regulation of proteins depend on the functional requirements in the cell. The blueprint for proteins is stored in DNA and decoded by a highly regulated transcriptional process that produces messenger RNA (mRNA). The information encoded by mRNA is subsequently translated into proteins as functional units of biological processes. Protein expression data generated from AQUA
[55] and Nano-LC-MS/MS
[63] are often log-transformed for differential expression analysis and Cox survival modeling.
The up-regulation, normal, and down-regulation ranges of protein expression also need to be defined, similar to gene expression. In a regulatory network analysis of NSCLC tumors
[63], the categorization of protein regulation was performed by using the normal range defined with NSCLC housekeeping genes
[55][56][57][58][59], including
B2M,
ESD,
FLOT2,
GAPDH,
GRB2,
HPRT1,
HSP90AB1,
LDHA,
NONO,
POLR2A,
PPP1CA,
RHOA,
SDCBP, and
TFRC, based on their protein expression in NSCLC tumors and non-cancerous adjacent tissues in Xu’s cohort
[64]. The total percentage of up-regulated and down-regulated samples was fixed for all the housekeeping proteins, and the average standard deviation of the normal range for the selected housekeeping proteins was calculated and applied to all other proteins in the genome to define their normal, up-regulation, or down-regulation ranges
[63].
2.2. Single-Cell Muti-Omics Data Processing
Each cell type has its distinct function. The single-cell analysis allows the study within a cell population to reveal how cell networks function
[65][66]. Ginkgo
[67] is an open-source web-based platform for single-cell CNV analysis. Single-cell transcriptomics simultaneously measures the gene expression level of individual cells in a given population
[68]. Single-cell whole-genome analyses by Linear Amplification via Transposon Insertion (LIANTI) can generate sufficient DNA for next-generation sequencing
[69]. In processing the single-cell gene expression data from Illumina HiSeq 2000, gene features are counted with the
featureCounts method
[70] based on the Gencode v19 transcriptome annotation. In processing the data from Illumina HiSeq 4000, the reads are mapped with
STAR aligner
[71] based on human genome reference GRCh38, and SAMtools
[72] is used to sort and index the mapped reads.
The dropout phenomenon, i.e., the RNA in the cell is not detected due to limitations of current experimental protocols, is severe in single-cell transcriptomic data. As a result, a large number of genes are expressed with a value of 0 in many cells. This makes it difficult to classify single-cell gene expression as in bulk tumors, and the housekeeping gene technique described above cannot achieve usable results. Thus, single-cell gene expression data is generally classified into two categories, “not expressed” for genes with a feature count of 0, and “expressed” for genes with a future count greater than 0 in regulatory networks
[73]. DEsingle
[74] in Bioconductor is a common method for single-cell differential expression analysis.
3. Hub Genes in Multi-Omics and Single-Cell Networks
Some hub genes in multi-omics networks were shown to be promising cancer biomarkers and therapeutic targets
[75][76]. Nevertheless, there were insufficient genome-scale investigations on multi-omics network hub genes and their biological and clinical relevance in human cancers. Graph theory centrality metrics can characterize hub genes. Common metrics include degree centrality (in-degree and out-degree centralities)
[77], eigenvector centrality
[78][79][80], betweenness centrality
[81][82], closeness centrality
[83][84][85], and VoteRank centrality
[86]. Degree centrality is simply the total number of neighbors of each node. The eigenvector centrality of a node is the sum of the centrality of its neighbors. Betweenness centrality is the frequency of a node appearing on the shortest paths of all node pairs in the entire network. Closeness centrality is the average length of the shortest paths between the node and all other nodes in the network. VoteRank centrality is selected with a voting score that is calculated by the sum of all neighbors’ voting abilities. Degree centrality and eigenvector centrality are also classified as local centrality metrics because only neighbors of each node are included in the calculation. Betweenness centrality, closeness centrality, and VoteRank centrality are categorized as global centrality metrics since the connectivity of the entire graph is used in the metrics computation. These centrality metrics are correlated in many cases
[85][87]. A Python package NetworkX
[88][89][90][91] calculates centrality metrics.
A barrier to this systematic evaluation is the limitations of existing computational methodologies in constructing genome-scale multi-omics GRNs, as summarized above. In a recent study
[89], our Prediction Logic Boolean Implication Networks (PLBINs) were used to construct 12 genome-scale GRNs of CNV, mRNA, and protein expression in NSCLC tumors. Seven centrality metrics were correlated with NSCLC tumorigenesis measured with T-statics in differential gene/protein expression between tumors vs. non-cancerous adjacent tissues (NATs), proliferation quantified with dependency scores from CRISPR-Cas9/RNAi screening of human NSCLC cell lines, and patient survival with hazard ratios from Cox modeling of The Cancer Genome Atlas (TCGA)
[89]. Hub genes in multi-omics networks involving gene/protein expression were found to be associated with oncogenic, proliferative potentials and poor patient survival. Hub genes with higher co-occurrences of CNV aberrations seemed to have tumor-suppressive and anti-proliferative properties. Regulated genes in hubs were linked to proliferative potential and worse patient survival, whereas regulatory genes in hubs were linked to anti-proliferative potential and better patient survival. Established cancer immunotherapy targets PD1, PDL1, CTLA4, and CD27 were top hub genes in most constructed multi-omics GRNs
[89]. These results show that multi-omics network centrality in bulk tumors can be used in the prioritization of biomarkers and therapeutic targets.
Similarly, our PLBINs
[73] were applied to genome-wide transcriptomic profiles of B cells from tumors and NATs
[90], T cells from peripheral blood lymphocytes (PBL)
[91], and tumor-infiltrating T-cell gene expression data of NSCLC patients. In each cell sample, a gene was defined as expressed (with a feature count > 0) or not-expressed (with a feature count = 0). The details of single-cell network construction were provided in our previously published study
[73].
Figure 1 shows the centrality distribution of the ICIs that were within the top 10th percentile in the constructed networks.
PD1 was ranked as a top hub gene in the T-cell PBL gene co-expression network in healthy donors.
CTLA4 was ranked as a top hub gene in the T-cell PBL gene co-expression network in NSCLC tumors.
CD27 was ranked as a top hub gene in the T-cell PBL gene co-expression network in NSCLC patients. These results are consistent with their functional involvements in T-cell immunity.
PDL1 was not ranked within the top 10th percentile of any of the examined centrality metrics in the constructed networks. None of these ICIs were ranked as top hub genes in B-cell gene co-expression networks in tumors or NATs.
Figure 1. Distribution of centrality metrics in single-cell gene co-expression networks with CD27, CTLA4, or PD1 ranked within the top 10th percentile. Each subplot represented a centrality metric: (A). Degree centrality; (B). Closeness centrality; (C). Betweenness centrality; (D). VoteRank centrality. Each violin plot showed the distribution of the centrality metric in one specific network: I. T-cell PBL gene co-expression network in normal samples. II. T-cell PBL gene co-expression network in NSCLC patients. III. T-cell gene co-expression network in NSCLC tumors.
In a previous study, we identified a gene co-expression network missing in NSCLC tumor B cells using PLBINs
[73][92]. Genes in this network either promote proliferation in human NSCLC epithelial cells or are indicative of NSCLC patient outcomes at both mRNA and protein expression levels in bulk tumors. These network genes were associated with drug response to 10 therapeutic regimens in 135 human NSCLC cell lines. Based on this single B-cell co-expression network, we discovered tyrosine kinase inhibitor lestaurtinib as a new drug option for treating NSCLC
[73]. Here, we examined if this clinically relevant single B-cell network had higher average centrality compared with 1000 random networks with the same number of genes selected from the genome. The results showed that the previously published B-cell network had significantly higher average centrality (
p < 0.05) than 1000 random networks selected from genome-scale single B-cell networks in tumors and NATs, single T-cell PBL networks in NSCLC patients and healthy donors, and T-cell network in NSCLC tumors (
Figure 2). These results support the relevance of single-cell network hub genes in tumor biology.
Figure 2. The comparison of centrality metrics of our published single B-cell network vs. randomly selected networks with the same number of genes. The p values showed the percentage of randomly selected genes having a higher ranked average centrality metric than the clinically relevant single B-cell network. Each column in the plot showed a centrality metric: I. Degree centrality; II. Eigenvector centrality; III. Closeness centrality; IV. Betweenness centrality; V. VoteRank centrality. Each row represented a single-cell gene co-expression network constructed in normal PBL T cells, NSCLC PBL T cells, tumor infiltrating T cells, normal B cells, and tumor infiltrating B cells, respectively. NS: not statistically significant.