Large-Scale Transcriptomes from Multiple Cancer Types

Large-Scale Transcriptomes from Multiple Cancer Types: History

Please note this is an old version of this entry, which may differ significantly from the current revision.

Subjects: Biology

Contributor: Baoting Nong

Various abnormalities of transcriptional regulation revealed by RNA sequencing (RNA-seq) have been reported in cancers. However, strategies to integrate multi-modal information from RNA-seq, which would help uncover more disease mechanisms, are still limited. Here, we present PipeOne, a cross-platform one-stop analysis workflow for large-scale transcriptome data. It was developed based on Nextflow, a reproducible workflow management system. PipeOne is composed of three modules, data processing and feature matrices construction, disease feature prioritization, and disease subtyping. It first integrates eight different tools to extract different information from RNA-seq data, and then used random forest algorithm to study and stratify patients according to evidences from multiple-modal information. Its application in five cancers (colon, liver, kidney, stomach, or thyroid; total samples n = 2024) identified various dysregulated key features (such as PVT1 expression and ABI3BP alternative splicing) and pathways (especially liver and kidney dysfunction) shared by multiple cancers. Furthermore, we demonstrated clinically-relevant patient subtypes in four of five cancers, with most subtypes characterized by distinct driver somatic mutations, such as TP53, TTN, BRAF, HRAS, MET, KMT2D, and KMT2C mutations. Importantly, these subtyping results were frequently contributed by dysregulated biological processes, such as ribosome biogenesis, RNA binding, and mitochondria functions. PipeOne is efficient and accurate in studying different cancer types to reveal the specificity and cross-cancer contributing factors of each cancer.It could be easily applied to other diseases and is available at GitHub.

TCGA
RNA-seq workflow
feature prioritization
cancer subtyping
somatic mutation
alternative splicing
ribosome
mitochondria

1. Introduction

RNA sequencing (RNA-seq) has been widely used in functional genomics studies [1,2,3]. Various information can be obtained from RNA-seq, including gene expression levels, alternative splicing (AS), alternative polyadenylation (APA), gene fusions, RNA-editing, and single nucleotide polymorphisms (SNP). More than 90% of human genes undergo AS [4,5], which largely increases the complexity of human transcriptome and proteome [6]. AS deregulation may lead to diseases [7], including cancer [8]. About 70% of pre-mRNAs undergo APA and produce multiple transcript isoforms with various lengths of 3′ untranslated regions (UTR) [9,10,11]. Gene fusions create chimeric genes, usually resulting from chromosomal rearrangements [12]. Some fusions are cancer drivers, therapeutic targets, and diagnostic biomarkers [13]. Large scale analyses of RNA-seq data from the Genotype-Tissue Expression (GTEx) project [14] and The Cancer Genome Atlas (TCGA) project [15] suggest that adenosine-to-inosine (A-to-I) RNA editing events are prevalent in normal tissues [16] and in cancer [17].

Besides mRNA, a large number of non-coding RNAs can be detected by RNA-seq, such as circular RNAs (circRNA) [18], long non-coding RNA (lncRNA, linear) [19]. CircRNAs are generated by a mechanism called back-splicing, in contrast to canonical splicing for linear RNAs, and these two splicing mechanisms may compete with each other [20]. One of the functions of circRNAs is serving as miRNA sponges [21]. LncRNAs have been demonstrated to be functional in different cellular activities and dysregulated in various cancers [22,23]. For example, lncRNA PVT1 drives oncogene MYC expression in various types of cancer cells [24]. Furthermore, retrotransposons are a large group of mobile DNA in the genome [25] that have the potential to be transcribed (retrotranscriptome) [26,27] and may be involved in many diseases including cancer [28]. In particular, human endogenous retroviruses (a type of retrotransposons) are stage-specifically transcribed during human embryonic development [29].

Biological processes in the cell interact with each other. For example, RNA-editing can affect AS and circular RNA biogenesis [30], RNA regulators may affect AS and APA [31], and gene fusions may dramatically change the transcriptome [32]. Focusing on only one single type of information may result in failure to identify critical factors underlying diseases. Therefore, combining multi-modal information in one model is critical to pinpoint the key players in pathological conditions. However, such a tool integrating all types of RNA-seq analyses is still lacking, although numerous analysis packages have been developed to perform specific analysis aforementioned [33]. PipeOne is a one-stop RNA-seq analysis pipeline that can integrate multi-modal information from large scale RNA-seq data to systematically identify key factors underlying diseases and stratify disease subtypes. Its application in five cancer types revealed shared cancer driver genes and pathways, and clinically-relevant cancer subtypes with genetic support from somatic mutations. PipeOne is freely available at https://github.com/nongbaoting/PipeOne. (version 1.1.0, accessed on 7 September 2020).

2. Current Insights

During RNA-seq raw data processing, PipeOne not only included classical procedures of data analysis, e.g., quality control, alignment, transcriptome reconstruction, gene quantification, but also contained novel lncRNA prediction, circRNA prediction, RNA editing prediction, fusion prediction, retrotranscriptome quantification, alternative splicing event detection, variants calling. Compared to RNAcocktail and VIPER, PipeOne harbored more functions, including prediction of novel lncRNAs and circRNAs, retrotranscriptome, and alternative splicing event detection. These will greatly enrich the information derived from RNA-seq data for downstream analysis, in which PipeOne focused on integrating multi-modal information to perform feature prioritization and disease subtyping. To the best of our knowledge, existing tools did not utilize such broad aspect information from RNA-seq for integration analysis. PipeOne did not implement many functions provided in VIPER, however, most of those procedures in RNA-seq analysis are classical, and could be performed by using common tools. For example, differential expression analysis could be perform by edgeR [65] or DESeq2 [66]. PipeOne was built based on Docker and Nextflow [55], making installation and management of workflow easy. Considering that users may not have root permission, PipeOne also provided the option of using Conda [67] for installation. These will alleviate users from tedious installation, configuration, and management. By using the Nextflow management system, advanced users could modify PipeOne to adapt their own research purposes.

One limitation of PipeOne is that it currently focuses on RNA-seq data only. Still, it may also be expanded to include features from other high-throughput data, for example, genome sequencing, DNA methylation profiling by microarray or sequencing, and proteomics by mass spectrometry, where any of these data are available Another limitation is subtyping evaluation. This study only assessed the subtyping clusters by Silhouette values and log-rank test-based overall survival analysis. In the future, more choices would be implemented to allow other ways to evaluate subtypes, for example, by comparing patient drug responses when this information is available. That will enable subtyping analysis for other diseases without survival information. In addition, future version of PipeOne could include features from other high-throughput data as mentioned above to perform multi-omics base subtype analysis and proper reduce noise effects may help to better subtyping. For example, DefFusion [68] can make better survival predictions by taking the noise effect into account when integrating multiple omics data.

This entry is adapted from the peer-reviewed paper 10.3390/genes12121865

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.