Large-Scale Transcriptomes from Multiple Cancer Types

Large-Scale Transcriptomes from Multiple Cancer Types: History

Please note this is an old version of this entry, which may differ significantly from the current revision.

Subjects: Biology

Contributor:

Various abnormalities of transcriptional regulation revealed by RNA sequencing (RNA-seq) have been reported in cancers. However, strategies to integrate multi-modal information from RNA-seq, which would help uncover more disease mechanisms, are still limited. Here, we present PipeOne, a cross-platform one-stop analysis workflow for large-scale transcriptome data. It was developed based on Nextflow, a reproducible workflow management system. PipeOne is composed of three modules, data processing and feature matrices construction, disease feature prioritization, and disease subtyping. It first integrates eight different tools to extract different information from RNA-seq data, and then used random forest algorithm to study and stratify patients according to evidences from multiple-modal information. Its application in five cancers (colon, liver, kidney, stomach, or thyroid; total samples n = 2024) identified various dysregulated key features (such as PVT1 expression and ABI3BP alternative splicing) and pathways (especially liver and kidney dysfunction) shared by multiple cancers. Furthermore, we demonstrated clinically-relevant patient subtypes in four of five cancers, with most subtypes characterized by distinct driver somatic mutations, such as TP53, TTN, BRAF, HRAS, MET, KMT2D, and KMT2C mutations. Importantly, these subtyping results were frequently contributed by dysregulated biological processes, such as ribosome biogenesis, RNA binding, and mitochondria functions. PipeOne is efficient and accurate in studying different cancer types to reveal the specificity and cross-cancer contributing factors of each cancer.It could be easily applied to other diseases and is available at GitHub.

TCGA
RNA-seq workflow
feature prioritization
cancer subtyping
somatic mutation
alternative splicing
ribosome
mitochondria

1. Introduction

RNA sequencing (RNA-seq) has been widely used in functional genomics studies ^[1]^[2]^[3]. Various information can be obtained from RNA-seq, including gene expression levels, alternative splicing (AS), alternative polyadenylation (APA), gene fusions, RNA-editing, and single nucleotide polymorphisms (SNP). More than 90% of human genes undergo AS ^[4]^[5], which largely increases the complexity of human transcriptome and proteome ^[6]. AS deregulation may lead to diseases ^[7], including cancer ^[8]. About 70% of pre-mRNAs undergo APA and produce multiple transcript isoforms with various lengths of 3′ untranslated regions (UTR) ^[9]^[10]^[11]. Gene fusions create chimeric genes, usually resulting from chromosomal rearrangements ^[12]. Some fusions are cancer drivers, therapeutic targets, and diagnostic biomarkers ^[13]. Large scale analyses of RNA-seq data from the Genotype-Tissue Expression (GTEx) project ^[14] and The Cancer Genome Atlas (TCGA) project ^[15] suggest that adenosine-to-inosine (A-to-I) RNA editing events are prevalent in normal tissues ^[16] and in cancer ^[17].

Besides mRNA, a large number of non-coding RNAs can be detected by RNA-seq, such as circular RNAs (circRNA) ^[18], long non-coding RNA (lncRNA, linear) ^[19]. CircRNAs are generated by a mechanism called back-splicing, in contrast to canonical splicing for linear RNAs, and these two splicing mechanisms may compete with each other ^[20]. One of the functions of circRNAs is serving as miRNA sponges ^[21]. LncRNAs have been demonstrated to be functional in different cellular activities and dysregulated in various cancers ^[22]^[23]. For example, lncRNA PVT1 drives oncogene MYC expression in various types of cancer cells ^[24]. Furthermore, retrotransposons are a large group of mobile DNA in the genome ^[25] that have the potential to be transcribed (retrotranscriptome) ^[26]^[27] and may be involved in many diseases including cancer ^[28]. In particular, human endogenous retroviruses (a type of retrotransposons) are stage-specifically transcribed during human embryonic development ^[29].

Biological processes in the cell interact with each other. For example, RNA-editing can affect AS and circular RNA biogenesis ^[30], RNA regulators may affect AS and APA ^[31], and gene fusions may dramatically change the transcriptome ^[32]. Focusing on only one single type of information may result in failure to identify critical factors underlying diseases. Therefore, combining multi-modal information in one model is critical to pinpoint the key players in pathological conditions. However, such a tool integrating all types of RNA-seq analyses is still lacking, although numerous analysis packages have been developed to perform specific analysis aforementioned ^[33]. PipeOne is a one-stop RNA-seq analysis pipeline that can integrate multi-modal information from large scale RNA-seq data to systematically identify key factors underlying diseases and stratify disease subtypes. Its application in five cancer types revealed shared cancer driver genes and pathways, and clinically-relevant cancer subtypes with genetic support from somatic mutations. PipeOne is freely available at https://github.com/nongbaoting/PipeOne. (version 1.1.0, accessed on 7 September 2020).

2. Current Insights

During RNA-seq raw data processing, PipeOne not only included classical procedures of data analysis, e.g., quality control, alignment, transcriptome reconstruction, gene quantification, but also contained novel lncRNA prediction, circRNA prediction, RNA editing prediction, fusion prediction, retrotranscriptome quantification, alternative splicing event detection, variants calling. Compared to RNAcocktail and VIPER, PipeOne harbored more functions, including prediction of novel lncRNAs and circRNAs, retrotranscriptome, and alternative splicing event detection. These will greatly enrich the information derived from RNA-seq data for downstream analysis, in which PipeOne focused on integrating multi-modal information to perform feature prioritization and disease subtyping. To the best of our knowledge, existing tools did not utilize such broad aspect information from RNA-seq for integration analysis. PipeOne did not implement many functions provided in VIPER, however, most of those procedures in RNA-seq analysis are classical, and could be performed by using common tools. For example, differential expression analysis could be perform by edgeR ^[34] or DESeq2 ^[35]. PipeOne was built based on Docker and Nextflow ^[36], making installation and management of workflow easy. Considering that users may not have root permission, PipeOne also provided the option of using Conda ^[37] for installation. These will alleviate users from tedious installation, configuration, and management. By using the Nextflow management system, advanced users could modify PipeOne to adapt their own research purposes.

One limitation of PipeOne is that it currently focuses on RNA-seq data only. Still, it may also be expanded to include features from other high-throughput data, for example, genome sequencing, DNA methylation profiling by microarray or sequencing, and proteomics by mass spectrometry, where any of these data are available Another limitation is subtyping evaluation. This study only assessed the subtyping clusters by Silhouette values and log-rank test-based overall survival analysis. In the future, more choices would be implemented to allow other ways to evaluate subtypes, for example, by comparing patient drug responses when this information is available. That will enable subtyping analysis for other diseases without survival information. In addition, future version of PipeOne could include features from other high-throughput data as mentioned above to perform multi-omics base subtype analysis and proper reduce noise effects may help to better subtyping. For example, DefFusion ^[38] can make better survival predictions by taking the noise effect into account when integrating multiple omics data.

This entry is adapted from the peer-reviewed paper 10.3390/genes12121865

References

Emrich, S.J.; Barbazuk, W.B.; Li, L.; Schnable, P.S. Gene discovery and annotation using LCM-454 transcriptome sequencing. Genome Res. 2007, 17, 69–73.
Wang, Z.; Gerstein, M.; Snyder, M. RNA-Seq: A revolutionary tool for transcriptomics. Nat. Rev. Genet. 2009, 10, 57–63.
Stark, R.; Grzelak, M.; Hadfield, J. RNA sequencing: The teenage years. Nat. Rev. Genet. 2019, 20, 631–656.
Pan, Q.; Shai, O.; Lee, L.J.; Frey, B.J.; Blencowe, B.J. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat. Genet. 2008, 40, 1413–1415.
Wang, E.T.; Sandberg, R.; Luo, S.; Khrebtukova, I.; Zhang, L.; Mayr, C.; Kingsmore, S.F.; Schroth, G.P.; Burge, C.B. Alternative isoform regulation in human tissue transcriptomes. Nature 2008, 456, 470–476.
Nilsen, T.W.; Graveley, B.R. Expansion of the eukaryotic proteome by alternative splicing. Nature 2010, 463, 457–463.
Scotti, M.M.; Swanson, M.S. RNA mis-splicing in disease. Nat. Rev. Genet. 2016, 17, 19–32.
Singh, B.; Eyras, E. The role of alternative splicing in cancer. Transcription 2017, 8, 91–98.
Shi, Y.; Di Giammartino, D.C.; Taylor, D.; Sarkeshik, A.; Rice, W.J.; Yates, J.R., 3rd; Frank, J.; Manley, J.L. Molecular architecture of the human pre-mRNA 3’ processing complex. Mol. Cell 2009, 33, 365–376.
Tian, B.; Manley, J.L. Alternative polyadenylation of mRNA precursors. Nat. Rev. Mol. Cell Biol. 2017, 18, 18–30.
Derti, A.; Garrett-Engele, P.; Macisaac, K.D.; Stevens, R.C.; Sriram, S.; Chen, R.; Rohl, C.A.; Johnson, J.M.; Babak, T. A quantitative atlas of polyadenylation in five mammals. Genome Res. 2012, 22, 1173–1183.
Latysheva, N.S.; Babu, M.M. Discovering and understanding oncogenic gene fusions through data intensive computational approaches. Nucleic Acids Res. 2016, 44, 4487–4503.
Yoshihara, K.; Wang, Q.; Torres-Garcia, W.; Zheng, S.; Vegesna, R.; Kim, H.; Verhaak, R.G. The landscape and therapeutic relevance of cancer-associated transcript fusions. Oncogene 2015, 34, 4845–4854.
Lonsdale, J.; Thomas, J.; Salvatore, M.; Phillips, R.; Lo, E.; Shad, S.; Hasz, R.; Walters, G.; Garcia, F.; Young, N.; et al. The Genotype-Tissue Expression (GTEx) project. Nat. Genet. 2013, 45, 580–585.
Weinstein, J.N.; Collisson, E.A.; Mills, G.B.; Shaw, K.R.; Ozenberger, B.A.; Ellrott, K.; Shmulevich, I.; Sander, C.; Stuart, J.M. The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet. 2013, 45, 1113–1120.
Tan, M.H.; Li, Q.; Shanmugam, R.; Piskol, R.; Kohler, J.; Young, A.N.; Liu, K.I.; Zhang, R.; Ramaswami, G.; Ariyoshi, K.; et al. Dynamic landscape and regulation of RNA editing in mammals. Nature 2017, 550, 249–254.
Han, L.; Diao, L.; Yu, S.; Xu, X.; Li, J.; Zhang, R.; Yang, Y.; Werner, H.M.J.; Eterovic, A.K.; Yuan, Y.; et al. The Genomic Landscape and Clinical Relevance of A-to-I RNA Editing in Human Cancers. Cancer Cell 2015, 28, 515–528.
Gao, Y.; Wang, J.; Zhao, F. CIRI: An efficient and unbiased algorithm for de novo circular RNA identification. Genome Biol 2015, 16, 4.
Iyer, M.K.; Niknafs, Y.S.; Malik, R.; Singhal, U.; Sahu, A.; Hosono, Y.; Barrette, T.R.; Prensner, J.R.; Evans, J.R.; Zhao, S.; et al. The landscape of long noncoding RNAs in the human transcriptome. Nat. Genet. 2015, 47, 199–208.
Ashwal-Fluss, R.; Meyer, M.; Pamudurti, N.R.; Ivanov, A.; Bartok, O.; Hanan, M.; Evantal, N.; Memczak, S.; Rajewsky, N.; Kadener, S. circRNA biogenesis competes with pre-mRNA splicing. Mol. Cell 2014, 56, 55–66.
Piwecka, M.; Glazar, P.; Hernandez-Miranda, L.R.; Memczak, S.; Wolf, S.A.; Rybak-Wolf, A.; Filipchyk, A.; Klironomos, F.; Cerda Jara, C.A.; Fenske, P.; et al. Loss of a mammalian circular RNA locus causes miRNA deregulation and affects brain function. Science 2017, 357, aam8526.
Slack, F.J.; Chinnaiyan, A.M. The role of non-coding rnas in oncology. Cell 2019, 179, 1033–1055.
Yan, X.; Hu, Z.; Feng, Y.; Hu, X.; Yuan, J.; Zhao, S.D.; Zhang, Y.; Yang, L.; Shan, W.; He, Q.; et al. Comprehensive Genomic Characterization of Long Non-coding RNAs across Human Cancers. Cancer Cell 2015, 28, 529–540.
Tseng, Y.Y.; Moriarity, B.S.; Gong, W.; Akiyama, R.; Tiwari, A.; Kawakami, H.; Ronning, P.; Reuland, B.; Guenther, K.; Beadnell, T.C.; et al. PVT1 dependence in cancer with MYC copy-number increase. Nature 2014, 512, 82–86.
De Koning, A.P.J.; Gu, W.; Castoe, T.A.; Batzer, M.A.; Pollock, D.D. Repetitive Elements May Comprise Over Two-Thirds of the Human Genome. PLoS Genet. 2011, 7, e1002384.
Bendall, M.L.; de Mulder, M.; Iñiguez, L.P.; Lecanda-Sánchez, A.; Pérez-Losada, M.; Ostrowski, M.A.; Jones, R.B.; Mulder, L.C.F.; Reyes-Terán, G.; Crandall, K.A.; et al. Telescope: Characterization of the retrotranscriptome by accurate estimation of transposable element expression. PLoS Comput. Biol. 2019, 15, e1006453.
Goodier, J.L. Restricting retrotransposons: A review. Mobile DNA 2016, 7, 16.
Payer, L.M.; Burns, K.H. Transposable elements in human genetic disease. Nat. Rev. Genet. 2019, 20, 760–772.
Göke, J.; Lu, X.; Chan, Y.S.; Ng, H.H.; Ly, L.H.; Sachs, F.; Szczerbinska, I. Dynamic transcription of distinct classes of endogenous retroviral elements marks specific populations of early human embryonic cells. Cell Stem Cell 2015, 16, 135–141.
Eisenberg, E.; Levanon, E.Y. A-to-I RNA editing-immune protector and transcriptome diversifier. Nat. Rev. Genet. 2018, 19, 473–490.
Yee, B.A.; Pratt, G.A.; Graveley, B.R.; Van Nostrand, E.L.; Yeo, G.W. RBP-Maps enables robust generation of splicing regulatory maps. RNA 2019, 25, 193–204.
Modi, H.; McDonald, T.; Chu, S.; Yee, J.K.; Forman, S.J.; Bhatia, R. Role of BCR/ABL gene-expression levels in determining the phenotype and imatinib sensitivity of transformed human hematopoietic cells. Blood 2007, 109, 5411–5421.
Conesa, A.; Madrigal, P.; Tarazona, S.; Gomez-Cabrero, D.; Cervera, A.; McPherson, A.; Szczesniak, M.W.; Gaffney, D.J.; Elo, L.L.; Zhang, X.; et al. A survey of best practices for RNA-seq data analysis. Genome Biol. 2016, 17, 13.
Robinson, M.D.; McCarthy, D.J.; Smyth, G.K. edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 2010, 26, 139–140.
Love, M.I.; Huber, W.; Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014, 15, 550.
Di Tommaso, P.; Chatzou, M.; Floden, E.W.; Barja, P.P.; Palumbo, E.; Notredame, C. Nextflow enables reproducible computational workflows. Nat. Biotechnol. 2017, 35, 316–319.
Grüning, B.; Dale, R.; Sjödin, A.; Chapman, B.A.; Rowe, J.; Tomkins-Tinch, C.H.; Valieris, R.; Köster, J.; The Bioconda, T. Bioconda: Sustainable and comprehensive software distribution for the life sciences. Nat. Methods 2018, 15, 475–476.
Wang, W.; Zhang, X.; Dai, D.-Q. DeFusion: A denoised network regularization framework for multi-omics integration. Brief. Bioinform. 2021, 22, bbab057.

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.