Colorectal cancer (CRC) is the third most prevalent cancer worldwide, with nearly two million newly diagnosed cases each year. The survival of patients with CRC greatly depends on the cancer stage at the time of diagnosis, with worse prognosis for more advanced cases. Consequently, considerable effort has been directed towards improving population screening programs for early diagnosis and identifying prognostic markers that can better inform treatment strategies.
1. Introduction
Cancer is among the most prominent life-threatening diseases in the world, with over 19 million newly diagnosed cases in 2020 alone
[1]. Notably, colorectal cancer (CRC) is one of the most commonly diagnosed cancers, with nearly 2 million new cases each year (~10% of all new cancer cases). Moreover, CRC is the second leading cause of all cancer-related deaths, claiming almost 1 million lives in 2020
[1]. Although extremely deadly in the advanced stages, the development of CRC is gradual. Beginning from the pathological transformation of normal colonic epithelium to adenomatous polyp, CRC ultimately leads to invasive cancer
[2]. CRC progression is generally categorized into five stages (0 to IV), depending on the extensiveness and clinical features
[3]. The lethality of CRC is largely correlated to the stage of the disease at diagnosis. At its early stages (stages 0–II), CRC is very treatable, with 5-year survival rates as high as 90%. However, only 38% of CRC cases are diagnosed at an early, localized stage. By the later stages, the 5-year survival rate dramatically decreases to as low as 14%
[4]. Hence, early diagnosis of CRC is key to saving lives.
While CRC has historically affected older populations, recent trends show an increase in cases in those under 50 years old. This has resulted in a decrease in the median age of patients with CRC from 72 to 66 years old. This is particularly concerning, as early onset CRC is often diagnosed at advanced (less treatable) stages, as compared to CRC in the traditional patient population
[5]. Despite these negative trends, overall CRC mortality and incidence rates have consistently improved each year, reflecting the rise in preventative screenings, new testing, and targeted treatments
[4].
Given the prominence and severity of this disease, there has been a large effort through recent research to better understand its causes, prognosis, and outcome. However, further research is still needed to improve prevention and treatment through more novel discoveries. This includes the identification of diagnostic and prognostic biomarkers. Compared to traditional diagnostic methods, such as colonoscopy or prognostic methods, including measuring tumor size and metastasis, the detection of biomarkers is less invasive. Unlike traditional tests, biomarker analysis can be carried out using urine, fecal, plasma, saliva or serum samples
[6]. Thus, biomarkers have the potential to distinguish between benign and cancerous tumors (polyps vs. carcinomas) less invasively, while providing more accurate predictions of disease progression, likelihood of relapse, and even chance of onset
[6].
Biomarkers have been a tremendous success at better informing diagnosis, prognosis, treatment, and preventative measures in other cancers
[7,8,9,10][7][8][9][10]. One of the most famous examples, mutations in
BRCA1/BRCA2 genes, have given patients the ability to reliably assess their risk of developing breast cancer in their lifetime, as well as their risk of relapse after a first bout of breast cancer. Additionally, mutations in
BRCA1/BRCA2 genes can be used for informing patient care and treatment strategies after disease onset
[7]. Similarly, the diagnostic lncRNA biomarker,
PCA3, has been approved for clinical use in suspected cases of prostate cancer
[11,12][11][12]. The ratio of
PCA3, which encodes a prostate-specific RNA, and a prostate-specific antigen (PSA), is measured in urine samples and used to increase the specificity of the diagnosis
[11,12,13][11][12][13]. In CRC, many different molecules have been identified as potential biomarkers, including lncRNAs. LncRNAs are RNA molecules longer than 200 bp that do not code for proteins. They have been broadly classified into sense, antisense, bidirectional, intronic and intergenic lncRNAs, depending on their relative position to protein-coding genes
[14]. The vast majority of characterized lncRNAs are synthesized by Polymerase II, and subsequently spliced and 5′-capped. Additionally, some lncRNAs are also polyadenylated
[15]. lncRNAs are poorly conserved, showing fewer exons, and generally have limited expression. Many lncRNAs are localized in the cell nucleus, where they exert regulatory functions by binding to DNA or DNA-associated proteins
[16]. Other lncRNAs are transported to the cytosol, where they can interact with other cytosolic molecules. LncRNA mechanisms of action are generally classified into the following four main groups: chromatin regulation, gene regulation, scaffolding and condensation, and post-transcriptional regulation, as illustrated in
Figure 1 [17]. LncRNA expression is generally limited by space and time, as it is often tissue or cell-type specific. Alterations in the pattern of expression of lncRNAs have been recurrently reported in cancer, where they can act as either oncogenes or tumor suppressors
[18]. Overexpression and/or downregulation of lncRNAs in tumors is often associated with additional epigenetic alterations, such as DNA (de)methylation of promoters or enhancers
[19,20,21][19][20][21]. It has been shown that differential expression of a subset of lncRNAs is associated with CRC heterogeneous features and also with functional pathways that mediate CRC, such as TGF-β and WNT pathways, immunity, epithelial-mesenchymal-transition (EMT), and angiogenesis
[22]. The biomarker potential of lncRNAs has been increasingly studied in CRC in recent years
[23]. However, most of these candidate lncRNAs currently lack proper experimental validation or characterization to be considered promising targets.
Figure 1. lncRNAs in CRC. Schematic description of the most frequent mechanisms of action (from A–F) reported for the list of the candidate lncRNAs biomarkers (indicated by numbered circles) listed in Table 2, with strong supporting evidence in CRC. Highlighted in blue are the names of the lncRNAs with more than one mechanism reported. Created using Biorender.com.
Figure 2 provides an overview of the discovery and validation approaches for identifying clinically relevant lncRNAs, as described in detail in this review.
Figure 2. Methods for detecting and validating lncRNA biomarkers in CRC. After detection of candidate lncRNAs via the methods shown above (see text for full descriptions), validation techniques are used to characterize and assess which candidate lncRNAs are the most suitable for future studies. Regulatory assays may be carried out in cell lines, organoids, or in vivo models. Well characterized lncRNAs can be used as diagnostic/prognostic tools, as targets in future therapies, and as subjects of mechanistic studies. Created with Biorender.com.
2. Approaches to Identify Relevant lncRNAs in CRC
Traditionally, approaches for determining the diagnosis and prognosis of CRC cases have been limited to non-molecular factors. In regard to diagnosis, colonoscopy and subsequent biopsy have been the gold standard in CRC screening
[24]. Similarly, prognostic tools have been dominated by clinical and histological criteria including measurements of tumor size, tumor grade (stages described previously), and patient age among others
[25]. However, recent research has started to move away from these approaches. Instead, focus has shifted towards identifying the prognostic and therapeutic potential of molecular biomarkers, including lncRNAs. LncRNAs comprise the majority of noncoding RNAs, many of which have unknown functions
[14,26][14][26]. With the rise of next-generation sequencing (NGS) technologies and the subsequent ability to collect and analyze large volumes of data, many lncRNAs with prognostic and therapeutic potential in CRC have been identified. However, many of these candidates result from large-scale approaches that do not constitute conclusive proof, and therefore require further validation. Here,
wresearche
rs provide a summary of the most used methods for identifying lncRNAs involved in CRC and discuss the advantages and disadvantages of each.
2.1. RNA Sequencing
With the increased accessibility of NGS, many researchers have begun to study transcriptional alterations in cancer through RNA sequencing (RNA-Seq). This technique allows researchers to reconstruct and quantify the expression of transcripts present in biological samples
[27]. RNA-Seq studies that compare a CRC tumor and healthy tissue from the same patient can be used to uncover the differential expression and somatic mutations of various lncRNAs. These differentially expressed or mutated lncRNAs have the potential to be involved in the onset and progression of CRC, warranting further study.
Due to its untargeted approach, RNA-Seq uniquely allows for the discovery of novel lncRNAs. However, RNA-Seq does have some disadvantages. One significant challenge in detecting lncRNAs through RNA-Seq is their low relative abundance. Compared to protein coding genes, lncRNAs are extremely lowly expressed, constituting a minute fraction of the total RNA transcripts in a sample. To address this issue, target enrichment techniques utilizing probe-based strategies have been developed, enabling more effective lncRNA detection
[28,29][28][29]. Given the large amounts of data that RNA-Seq analyses produce, there is also a risk of false positive detection of transcripts. This can be due to noisy expression or from transcripts that encode for proteins
[27]. In fact, one study that investigated the reproducibility of differential expression results from identical replicates found that up to 8% of differentially expressed (DE) genes identified by RNA-Seq were false positives, even when using stringent identification parameters
[30]. Maybe most concerningly, some studies have questioned the reproducibility of RNA-Seq results, as the resulting analyses are often dependent on the quality control, alignment, and quantification tools that are used in the analysis pipeline
[31,32][31][32]. Moreover, the lack of normalization in analytical statistical methods often means that results are over dispersed and replicate dependent
[32,33][32][33].
2.2. Microarrays
Microarrays are another genome-wide screening approach that can be used to identify lncRNAs involved in CRC. Microarrays are glass slides, lined with selected DNA oligonucleotide sequences in known locations that can hybridize specific lncRNAs (converted to cDNA) from a biological sample. Complementary base pairing between the sample and the oligonucleotide sequences on the chip produces light proportional to gene expression. Thus, hybridization allows for the detection of gene expression changes in a previously selected group of candidate lncRNAs in cancer cells
[34,35][34][35]. Compared to RNA-Seq, microarrays are often less costly and require less complex and extensive bioinformatic analysis. However, they do require a preselection of lncRNAs that are suspected to have a biologically relevant function in CRC. Consequently, microarrays do not allow for the discovery of novel lncRNAs
[27]. Recent work has also called into question the reliability and reproducibility of microarrays, due to unstable surface deposition chemistries
[36]. A comparison of five different microarray data platforms revealed that there is poor concordance between systems in their output of results
[37,38][37][38]. Additionally, poor sensitivity in detecting lowly expressed molecules, such as lncRNAs, limits detection efficiency
[39].
2.3. CRISPR-Cas9 Screening
CRISPR-Cas9 screening for lncRNAs is one of the most optimizable modes of discovery for identifying candidate lncRNAs because of the many variables that can be altered when building a screen
[40,41][40][41]. CRISPR-Cas9 screens can be designed with cells perturbed through loss of function techniques (inhibition or deletion) or gain of function techniques (activation). Screenings can also be specific to a number of lncRNA targets or to larger lncRNA pools. After selecting a perturbation method and the target lncRNAs, a single-guide RNA (sgRNA) library must be selected. NGS data of sgRNA counts are then used to identify lncRNAs associated with diseases such as CRC
[42,43][42][43]. Compared to RNA-Seq or microarray analyses, there is no need to assume that differential expression implies function. Instead, CRISPR-Cas9 screening in a variety of cell lines can be used to identify lncRNAs through cancerous phenotypes, such as proliferation or drug resistance, independent of differential expression
[42]. A major benefit of CRISPR-Cas9 screening is its customizability and effective targeting of lncRNAs
[44]. However, statistical analyses of these screens are often hindered by the use of limited replicates
[43]. Given the recent implementation of this technique, there is also a need to establish benchmarks to improve evaluation and reproducibility
[42].
2.4. Bioinformatic Approaches
Several bioinformatic approaches that analyze genomic or transcriptomic data can be used to identify lncRNAs with prognostic or therapeutic potential in CRC. One such approach is the “detecting lncRNA cancer association” (DRACA) method, which was developed to evaluate potential molecular biomarkers by predicting lncRNA-cancer associations
[45,46][45][46]. DRACA uses matrix factorization to consider interactions between lncRNAs, cancer prognosis, and other factors, such as miRNAs and genes, to predict lncRNA-cancer association. Using already available data on cancer prognosis, this approach results in both novel and biologically meaningful discoveries
[46]. Likewise, the tool OncodriveFML was developed to identify known somatic mutations in genomic elements (such as lncRNAs) to predict those that have undergone positive selection during tumorigenesis. These lncRNAs have a high functional mutation bias and further role in CRC
[47]. While innovative in its approach, OncodriveFML does have some limitations. OncodriveFML relies on the characterization of the functional impact of mutations in its calculations, meaning less characterized mutations in noncoding regions will not be counted. It also only predicts based on nucleotide substitutions, forgoing the identification of lncRNAs with insertions or deletions
[47]. Another tool, the ExInAtor, also relies on mutational patterns of tumoral DNA, rather than changes in gene expression, to identify tumor driver lncRNAs
[48]. This approach is especially advantageous for its specificity, rapid computation, and its ability to identify lncRNAs involved in tumorigenesis and not solely in upstream regulatory processes. However, ExInAtor does not evaluate the functional impact of mutations in lncRNAs, limiting the sensitivity of this approach and leaving room for many false negatives. Candidate lncRNAs identified by this approach also unexpectedly harbored many repeats and had lower GC content, indicating a possible bias of the tool
[48]. An additional source is the RNA Atlas expanded with non-coding RNAs, which covers more than 3310 novel lncRNAs from RNAseq experiments that were performed in more than 300 human tissues and cell lines, including CRC
[49].