1. Introduction
RNA editing is a widespread post-transcriptional modification in eukaryotes that includes several types of biochemical changes such as deletions, insertions, and base conversions
[1]. Adenosine-to-inosine (A-to-I) RNA editing is the most prevalent type of RNA editing in metazoans, and is catalyzed by adenosine deaminase acting on the RNA (ADAR) enzyme
[2], which acts on double-stranded RNA (dsRNA). There are three ADAR genes in the mammalian genome: (1) ADAR1, which encodes for two protein isoforms, ADAR1p150 and ADAR1p110; (2) ADAR2; and (3) the catalytically inactive ADAR3
[2][3]. In addition to A-to-I editing, there is also cytidine-to-uridine (C-to-U) RNA editing, which is catalyzed by apolipoprotein B and mainly found in plants, although it is also present in mammals
[4].
Among the RNA modifications that have been identified to date
[5], A-to-I and C-to-U RNA editing are the only ones that can be readily detected from RNA sequencing (RNA-seq) data since the detection of other RNA modifications requires chemical modifications (e.g., bisulfite RNA-seq to detect 5-methylcytosine (m5C) sites
[6][7]) or immunoprecipitation before the sequencing library preparation (e.g., N6-methyladenosine sequencing (m6A-seq)
[8]). RNA editing sites (RES) can be identified by comparing matching RNA-seq and DNA sequencing (DNA-seq)
[9]. In this approach, RES can be identified as adenosine to guanine (A > G) or cytidine to thymidine (C > T) mismatches for A-to-I or C-to-U RNA editing, respectively
[10]. However, this approach may not be cost-effective or feasible due to the requirement of matching RNA- and DNA-seq data from the same sample of cells or tissues. To circumvent this problem, computational strategies to detect RES from RNA-seq data alone have been developed. To correctly identify RES using these approaches, several filters must be implemented within the associated pipeline to reduce the number of false positives. These filters are designed to minimize the presence of sequencing errors and known nucleotide differences among individuals (e.g., single nucleotide polymorphisms (SNPs))
[11]. Additionally, the usage of strand-specific RNA-seq protocols is preferred as it facilitates the identification of RES in regions with overlapping transcripts generated from opposite strands
[12][13].
With over one million predicted A-to-I RNA editing sites in human cells
[14], the biological effects of RNA editing are diverse, including apoptosis and cell survival
[15][16], neural functions
[17][18], and immune responses
[19][20], such as the recognition of self to non-self dsRNA
[21]. Given that RES can be readily detected from RNA-seq data, interest in analyzing RNA editing patterns has significantly increased in recent years
[22]. To cope with such increased demands, there are several bioinformatic tools available with different approaches for the identification and prediction of RES.
2. Availability of RNA Editing Detection Tools
In total, 10 major RNA editing detection tools were compared according to the following criteria: required formats for input files, handling of strandedness and/or replicate samples, and the availability of read mappers or usage of the specific mappers. A detailed overview of the features of each tool is provided in the following text.
REDItools
[23] was the earliest RNA editing detection tool to be introduced in the field and can be used for both the detection of known RES registered in RNA editing databases (e.g., REDIportal
[24]) and the de novo prediction of RES without requiring a priori RNA editing information. This can be achieved either by using RNA-seq data alone or by comparing RNA- and DNA-seq data. If matching DNA-seq data are not available, variants are compared to an empirical substitution distribution and only statistically significant positions are reported. This tool takes pre-aligned reads in BAM format as the input. Several filters and quality checks can be applied to remove positions according to different parameters, such as base and mapping quality score, position coverage, and the removal of substitution in homopolymeric regions (i.e., regions that include stretches of the same nucleotide (e.g., AAAAA or TTTTT)). The same authors modified REDItools by introducing an optimized and parallel multi-node version, REDItools2
[25]. REDItools2 provides two modes of analysis: (1) a serial and (2) a parallel mode, which requires the installation of a message-passing interface (MPI) implementation.
GIREMI
[26] identifies RES from RNA-seq data alone based on allelic linkage
[27]; that is, the tendency of genetic markers physically near to each other to be inherited together during meiosis chromosomal crossover. Two nearby SNPs should maintain the same haplotype, while RES are likely to vary among reads relative to nearby SNPs. Based on this assumption, this method calculates the mutual information between publicly available SNP sites (such as those found in the dbSNP database
[28]) and uncharacterized RNA variants in the RNA-seq sample. In addition to allelic linkage, a generalized linear model was employed to enhance the predictive power. GIREMI takes BAM files and a list of filtered single nucleotide variants (SNVs) as the input and if several BAM files are provided, they are treated as replicates and are combined into one dataset. The list of SNVs must include both SNVs available in databases (e.g., dbSNP database) and SNVs present in the RNA-seq samples, obligating users to perform genotype-calling analysis. Additionally, variant call format (VCF) files generated by variant callers must be adapted to the input format specified by the authors. Due to the increased demand for long reads to cover the full-length transcriptome (developed by Pacific Bioscience (PacBio) (in Menlo Park, California, United States of America) and Oxford Nanopore Technologies (ONT) (in Oxford Science Park, Oxford, UK)), the same laboratory developed a new method called L-GIREMI
[29], which uses a similar approach as GIREMI for the prediction of RES from long-read RNA-seq data.
RES-Scanner
[30] performs genome-wide identification and annotation of RES for any species, although its usage is limited by the availability of matching RNA- and DNA-seq data for the same sample, which is not always possible. The annotation of RES is only possible if annotation files with relevant features are provided. RES-Scanner uses a combination of different statistical models (e.g., Bayesian, binomial, and frequency) for homozygous genotype calling and filters (e.g., mapping quality, removal of duplicates, and depth of reads) to remove potential false-positive RES. RES-Scanner uses either raw reads in FASTQ format or pre-aligned reads in BAM format as the input. In the former case, BWA
[31] is used as a read mapper to align FASTQ files with a combination of the reference genome and exonic sequences surrounding known splice junctions. A new version of RES-Scanner, RES-Scanner2, is now available, which can identify hyper-editing sites; that is, regions with high editing levels due to the tendency of ADAR1 to edit sites in clusters
[32][33].
RNAEditor
[11] is a fully automated pipeline that takes uncompressed FASTQ files as the input and performs all the necessary steps to predict RES. This tool has both a command line and a graphical user interface (GUI) version. The approach of RNAEditor consists of three steps: (1) read mapping via BWA; (2) SNP calling and purification via the GATK tool
[34]; and (3) the annotation of RES. To reduce the number of false-positive RES, known SNP sites are filtered out using the information gathered from the dbSNP database, 1000 Genomes Project
[35], and HAPMAP project
[36]. Additionally, a clustering algorithm is used to detect highly edited regions, termed editing islands, because ADAR1 catalyzes deamination in clusters of dsRNA sequences. RNAEditor has been updated to support more recent versions of some dependent software programs, including Python3 and PyQt5, although it still requires an old version of GATK (v3.7) and Java (v8), obligating users to downgrade these dependent software products. Furthermore, RNAEditor uses discontinued datasets, such as the HAPMAP project (last release: Ensemble 97). Thus, compatibility with the existing and most up-to-date operating systems and programs is low.
JACUSA
[37] detects SNVs by comparing RNA-DNA or RNA-RNA sequencing samples and integrating information from replicate experiments. The identification of position-specific RES using RNA- and DNA-seq data is straightforward, as RNA editing is a post-transcriptional modification that is not present in genomic DNA sequences. For RNA-RNA comparisons, RNA-seq data from two different conditions (e.g., control and knockdown samples) are compared to identify differential RES. In addition, JACUSA removes commonly known artefacts such as those produced by mapping programs or sequencing technologies. JACUSA2
[38], its successor, has a shorter running time and captures more complex read signatures, including substitutions, insertions, deletions, and read truncations. Additionally, it includes a new mode of analysis to detect read arrest events in pair-end read samples. Read arrest events lead to shorter reads and can happen during the library preparation due to the premature termination of the reverse transcriptase because of RNA degradation or structures or the presence of pseudouridines
[39]. Both JACUSA and JACUSA2 use BAM files as the input. JACUSA2 is implemented with a complementary R package called JACUSA2helper. This R package was developed to assist with the downstream processing of JACUSA2 results, such as filtering and plotting. For example, JACUSA2helper removes the sites that have been marked by JACUSA2 as artefacts. It also filters sites according to their coverage and variants.
SPRINT
[40] identifies RES and hyper-editing sites without any pre-processing step to remove known SNPs. This is advantageous because not all SNP sites should be considered as false-positive RES, as individual differences (e.g., one person compared to another) may be present on each SNP site. The analysis of SPRINT consists of three steps: (1) read processing; (2) SNV calling; and (3) the identification of RES. The last step is performed by clustering SNV duplets, which are defined as consecutive SNVs that have the same type of variation, such as two consecutive SNVs corresponding to A > G differences in RNA-seq reads compared to the reference genome. The assumption here is that a pair of consecutive SNVs tend to be true RES if they are within 400 nucleotides (nt) from each other, while a pair of consecutive SNVs tend to be SNPs if they are within 1600 nt. These assumptions are because ADAR enzymes tend to act in clusters. SPRINT can take uncompressed FASTQ files or BAM files as the input. In the former case, BWA is used as a read mapper. It should be noted that the identification of hyper-editing sites is only available if raw FASTQ files are used for the analysis.
RESIC
[41] can classify and identify both RES and hyper-edited regions for any organisms and any number of input datasets by using an alignment graph model and multiple filtering steps. This tool accepts uncompressed FASTQ files as the input, although the usage of RESIC is not very convenient, as users cannot provide file names as command options, but must manually edit the Python script specifying the input data. Optionally, users can specify a negative dataset, i.e., FASTQ files that do not exhibit the desired editing phenomena, such as those produced in RNA-seq experiments from cells with inactivated ADAR genes (i.e., ADAR ablated samples). Negative datasets are used to exclude changes associated with SNPs. If negative datasets are not available, a list of SNPs in variant call format (VCF), such as those available in the dbSNP database, can be used. However, negative datasets and the VCF file from the dbSNP database cannot be provided at the same time. In the first part of the analysis, FASTQ files are aligned to the reference genome using bowtie
[42] as a read mapper. Then, a graph aligner model is used to identify and classify RES into different categories (e.g., non-repetitive and hyper-non-repetitive A to C). During the identification of RES, several filters are employed, such as the removal of ambiguous reads and low-coverage sites.
Machine learning (ML) is increasingly being adopted within computer science and genomics and several ML-based tools for RES detection have been developed in recent years. For example, RDDpred
[43] is the first pipeline developed based on an ML approach for the prediction of RES. RDDpred uses the information from RNA-editing databases (i.e., RADAR
[44] and DARNED
[45]) and the mapping errors set (MES) method
[46] to generate a positive and a negative training set of RES, respectively. The MES method identifies the regions that are likely to generate SNP site errors by simulating randomly mutated sequencing reads that are aligned to the reference genome and analyzed using variant callers. In the first part of the analysis, positive and negative RES are identified using these training sets, which are then employed in the next part of the analysis using the prediction model. The prediction model, which is a random forest predictor, analyses the remaining sites to predict RES that are not registered in the RNA editing databases. RDDpred requires BAM files and several software programs (e.g., SAMtools and BCFtools
[47], WEKA
[48], and BAMtools
[49]). If several BAM files are provided, they are treated as replicates. To maximize reproducibility, RDDpred includes these external software programs within the package, making its installation easier. However, most of these are outdated versions and may require the re-compilation of the source code in some systems. It should be noted that the information provided to predict RES (i.e., from the positive and negative datasets) was generated using the old human assembly release 19 (GRCh37) and has not been updated. In addition, BAM files must be generated using the same version of the human assembly.
RED-ML
[50] incorporates information from different features (e.g., reads, sequencing and alignment artefacts) and the properties of RES (i.e., the sequence context, whether the candidate site is in Alu regions, and the editing type) in order to make predictions using a logistic regression classifier to identify RES in a genome-wide manner. If DNA-seq data are available, SNPs can be specified from DNA-seq data using variant callers (e.g., GATK) and incorporated into the analysis. Although RED-ML takes BAM files as the input, a special reference should be employed for the BAM files generated from the BWA read aligner. This reference file should combine the human assembly release 19 (GRCh37) and exonic sequences surrounding all known splice junctions. The usage of this tool is constrained by several limitations, such as its exclusive applicability to human RNA-seq data, dependency on BAM files generated using the outdated human assembly release 19 (GRCh37), and the ability to identify only those sites with relatively high editing levels.
DeepRed
[51] uses deep learning and ensemble learning to directly identify candidate RES. To perform the analysis, this tool only needs a list of SNVs that contain information about the chromosomal positions, the reference allele, and the alternative allele. For this purpose, a processing step with variant callers (e.g., GATK of BCFtools) is mandatory, although there is no requirement for filtering steps or annotation. One benefit of this tool is its ability to process RNA-seq datasets from various species. Although DeepRed has relatively few dependent software requirements, it is dependent on MATLAB, which requires a paid license.
In addition to the above-mentioned stand-alone tools, there are also several online resources for detecting RES, such as DREAM (Detection of RES associated with miRNAs)
[52], AIRlINER (Assessment of editing sites in non-repetitive regions)
[53], and REP (Prediction of editing sites and their effect in humans)
[54]. The advantage of using online tools is that the users do not need high computational power and disk space, although the uploading time and the capacity of handling multiple datasets can be challenging.
Lastly, there are databases for RES. The two previously mentioned databases, RADAR (last updated in 2014) and DARNED (last updated in 2012), include RES for humans, mice, and fruit flies. Other databases are (1) REDIdb (last updated in 2018)
[55] for plant organellar genomes; (2) REDIportal (last updated in 2020)
[24] for humans and mice; (3) PED (last updated in 2019)
[56] for plants; (4) EDK (last updated in 2018)
[57] for disease-related editing events in humans; and (5) TCEA (last updated in 2018)
[58] for cancer-related editing events. One should note that many of the tools and databases mentioned above have been outdated for several years, which calls for major updates for these tools and the introduction of newer tools for detecting RES and categorizing the identified RES in a database format.