Using a modified RNA-sequencing (RNA-seq) approach, we discovered a new family of unusually short RNAs mapping to ribosomal RNA 5.8S, which we named dodecaRNAs (doRNAs), according to the number of core nucleotides (12 nt) their members contain.
RNA sequencing;small RNA;non-coding RNA;RT-qPCR;5.8S rRNA
High-throughput sequencing (HTS) technologies revolutionized molecular biology and genetics by allowing the sequencing of entire genomes and transcriptomes [1,2,3,4,5,6]
. The use of RNA sequencing (RNA-seq), in particular, markedly expanded the repertoire of non-coding RNA (ncRNA) species [7
which are now recognized as critical regulators of gene expression. ncRNAs have been reported, among other things, to control the binding of transcription factors and to regulate alternative splicing and messenger RNA (mRNA) translation [9
which allows cells to rapidly adjust their gene expression programming to respond and adapt to a changing environment, including cellular stress conditions 
. Notably, ncRNAs are involved in cell proliferation and development, and thus, their impairment likely contributes to the etiology of various diseases, including cancer [17,18,19,20]
ncRNAs are classified according to their origin, length, and/or function [21,22,23,24]
. Small RNAs (sRNAs) comprise transcripts less than 200 nucleotides (nt) in length. Small RNA sequences may be transcribed from dedicated sequences and promoters 
, or derived from several pre-existing RNA species, including mRNA introns or exons, transfer RNAs (tRNAs) or ribosomal RNAs (rRNAs) [26,27]
. This yields an extremely diverse population of sRNAs most often involved in the specific recognition of nucleic acid targets through complementary base pairing [28,29]
Historically, the discovery of microRNAs (miRNAs, 19 to 24 nt) was delayed by half a century because of the dogma prevailing at the time stating that such short RNAs could not be biologically relevant, leading researchers to perceive them as mere degradation products [30,31]
. Despite the now recognized importance of sRNAs and the lessons learned from the past, a comparable belief still exists today that arbitrarily draws the limit for function or interest to sRNAs longer than 16 nt. It is believed that any endogenous sequences shorter than 16 nt may not be specific, be mapped with confidence to the genome or have biological significance. Therefore, these are readily discarded either prior to library construction or from sequencing datasets in a systematic manner. This is done in order to improve the signal-to-noise ratio, improve the depth of sequencing or facilitate downstream computational analyses 
, allegedly without the risk of losing important information. These beliefs are strong and have tainted the standardized pipeline of most, if not all, HTS platforms and procedures currently available to researchers worldwide.
The very high abundance of rRNAs in biological samples—they may represent ~80% of total RNA of a cell—is also perceived with a negative a priori. Focusing on improving the sensitivity of HTS experiments, most researchers choose to eliminate rRNAs by using rRNA removal kits, which also eliminates the possibility of obtaining information on sRNAs derived from the most abundant cellular RNA. Known as small ribosomal RNAs (srRNAs), they form an emerging family of ncRNAs [27,33]
that display essential functions in gene regulation and development [34,35]
. Whether endogenous sRNA species shorter than 16 nt or derived from rRNA exist remains unknown.
2. sRNA-Seq Analyses Revealed the Existence of 12-nt and 13-nt sRNAs
In a previous study, we have used sRNA-seq analysis, and we investigated the sRNA profile (8–30 nt) of six different species (H. sapiens
, M. musculus
, D. melanogaster
, A. thaliana
, S. pombe
, S. cerevisiae
) from 11 samples, and revealed the existence of very small RNAs of discrete sizes, with a large part coming from rRNA (Lambert et al., manuscript submitted).
Together, in these RNA-seq data (Figure 1
A), a relatively high abundance of 12-nt and 13-nt RNAs was observed in human, mouse, Drosophila
and S. pombe
samples, but not in A. thaliana
and S. cerevisiae
. Notably, 13-nt RNAs were more abundant than 12-nt RNAs in human and S. pombe
samples, compared to mouse samples, where they seem to be equally represented. On the contrary, 12-nt RNAs were more abundant than 13-nt RNAs in Drosophila
A,B). In fact, in these four organisms, these two-size ranges of RNAs (12 and 13 nt) represented between 22 to 74% of all RNAs sequenced in the 8 to 30 nt window of RNA length (Figure 1
B); and the mouse neuronal N2a cell line is the most enriched in 12-nt and 13-nt RNAs, when added together.
Figure 1. Relative abundance of 12 and 13 nt sRNA sequences obtained by sRNA-seq analyses of 11 different biological samples derived from 6 different species. (A) RPM abundance of RNA of 8 to 30 nt from 11 samples. (B) Relative abundance of 12-nt, 13-nt and other RNAs, expressed as RPM. (C) Relative proportion of the most abundant 12-nt RNA (RNA a) and 13-nt RNA (RNA b), compared with the other 12-nt and 13-nt RNAs detected by sRNA-seq (% of total reads). PMN, polymorphonuclear leukocytes; PMP, platelet-derived microparticles; HUVEC, human umbilical vein endothelial cells; HEK293, human embryonic kidney 293 cells; OC3, Old Cerebellum 3; N2a, mouse neuroblastoma cells; NIH/3T3, mouse embryonic fibroblast cells.
Further bioinformatics analyses revealed that a unique 12-nt RNA sequence represented at least 70% of all 12-nt long RNAs for H. sapiens
, M. musculus
and D. melanogaster
. We observed the same results for 13-nt RNAs (Supplementary Figure S1)
. This abundant 13-nt sequence was the same as the 12-nt RNA, but with an extra Cytosine (C) at its 5′ end. The sequence of the human and mouse 12-nt and 13-nt RNAs was identical. In total, two sequences accounted for 90% RNA reads (Figure 1
C). On the opposite, while 12-nt and 13-nt RNAs are detected in S. pombe
, no specific sequence was more abundant than the others.
Our sRNA-seq data support the existence of very small RNAs that are 12 nt or 13 nt long, more abundant than microRNAs, and with species and cell specificity. In particular, two main sequences represented most of these 12-nt and 13-nt RNAs in human, mouse and Drosophila
samples, but were absent from A. thaliana
, S. pombe
and S. cerevisiae
3. The Two Most Abundant 12-nt and 13-nt Sequences Likely Derive from 5.8S Ribosomal RNA
First, screening of these 12-nt and 13-nt RNA sequences for vector, adaptor, linker and primer contamination did not yield any positive match, excluding the possibility that they represent an artifact (on genomic and transcript database). Using NCBI BLAST, we mapped the two most abundant 12-nt and 13-nt RNA sequences to the transcriptome of each organism. In human, both sequences perfectly matched with ribosomal RNAs; the 5.8S rRNA and its longer 45S rRNA precursor. The same results were obtained when mapping these sequences to the murine transcriptome, as both are conserved between the two species. In Drosophila
, despite the difference in nucleotide composition, the corresponding, equivalently abundant 12-nt and 13-nt RNA sequences also matched to Drosophila
’s 5.8S and 45S rRNAs (Table 1
). The orthologous 12-nt Drosophila
RNA differed from the corresponding human and mouse sequences by 2 nt, whereas the 13-nt Drosophila
RNA harbor an Adenine (A) at its 5′ end, instead of a Cytosine (C) for the human and mouse—in both cases, the extra 5′ nucleotide matched to the corresponding nucleotide on the longer 45S rRNA (Figure 2
Figure 2. doRNA and C-doRNA sequences map to the 5′ end of the 5.8S rRNA. Schematic representation of doRNA and C-doRNA sequence alignment on the 45S rRNA in humans, mice and flies, using NCBI Nucleotide Reference Sequence (RefSeq) database. ETS, external transcribed spacer; ITS, internal transcribed spacer; rRNA, ribosomal RNA.
Table 1. Mapping of the doRNA and C-doRNA sequences to the human, mouse and fly transcriptomes using the nucleotide NCBI Basic Local Alignment Search Tool (BLAST) on NCBI database. The parameters used were the N blast, the standard database “nucleotide collection (nr/nt)” and the species “Homo sapiens,” or “Mus musculus,” or “Drosophila melanogaster.” The program used was BLASTN 2.12.0+. Results with 100% of identity and query coverage are shown in the table.
|Homo sapiens RNA, 5.8S ribosomal N3 (RNA5-8SN3), rRNA
|Homo sapiens RNA, 45S pre-ribosomal N2 (RNA45SN2), rRNA
|Mus musculus 5.8S rRNA
|Mus musculus 18S rRNA, 5.8S rRNA and 28S rRNA
|Drosophila melanogaster pre-rRNA (pre-rRNA:CR45847), preRNA
|Drosophila melanogaster pre-rRNA (pre-rRNA:CR45846), preRNA
|Drosophila melanogaster pre-rRNA (pre-rRNA:CR45845), preRNA
|Drosophila melanogaster 5.8S rRNA (5.8SrRNA:CR45852)
|Drosophila melanogaster 5.8S and 2S rRNA
|Pre-rRNA, rRNA precursor; rRNA, ribosomal RNA.
We propose to name this new RNA family dodecaRNAs (doRNAs), with respect to the number of core nucleotides (12 nt) their members contain. The most abundant 13-nt variant of doRNA harbors, in human and mouse, a C at its 5′ end and was consequently termed C-doRNA. We found that the doRNA sequence mapped directly to the 5′ end of the 5.8S rRNA (Figure 2
). Thus, doRNAs might be formed through a specific and controlled cleavage of the 5.8S rRNA or transient rRNA precursors leading to 5.8S rRNA (e.g., 45S, 36S, 32S, 12S, 8S rRNAs). As rRNAs are the most abundant RNAs in cells (80% of all RNAs) 
, it is very likely that the doRNA and C-doRNA sequences, which are similarly overly abundant (e.g., compared to microRNAs) in the 8- to 30-nt window of RNA sizes, are sRNA fragments derived from rRNA (rRFs) 
This possibility is reinforced by the presence, in human and mouse 5.8S rRNA sequences, of a recurrent 2′-O-ribose methylation of the Uracil (U) positioned immediately downstream to the 3′ nucleotide of doRNAs 
. This feature suggests that the modified U may be a signal or a determinant for the generation of their 3′ extremity (Figure 2
). This feature reinforces the possibility that doRNAs originate from the 5.8S rRNA or longer precursors containing it.