Strategies for Identification of Neoantigens

Strategies for Identification of Neoantigens: History

Please note this is an old version of this entry, which may differ significantly from the current revision.

Subjects: Oncology

Contributor:

This review provides an overview of currently available approaches applied for
neoantigens discovery—tumor-specific peptides that appeared due to the mutation process and distinguish tumors from normal tissues. Focusing on genomics-based approaches and computational pipelines, we cover all steps required for selecting appropriate candidate peptides starting from NGS-derived data. Moreover, additional approaches such as mass-spectrometry-based and structure-based methods are discussed highlighting their advantages and disadvantages. This review also provides a description of available complex bioinformatics pipelines ensuring automated data processing resulting in a list of neoantigens. We propose the possible ideal pipeline that could be implemented in the neoantigens identification process. We discuss the integration of results generated by di erent approaches to improve the accuracy of neoantigens selection.

neoantigens
whole-exome and RNA sequencing
peptide-MHC binding predictions
pMHC-TCR binding predictions

Genomics-Based Approaches and Current Bioinformatics Pipelines

Currently, genomics-based strategies are some of the most promising in the field of neoantigen development. The widespread use of NGS-based techniques stimulates the development of bioinformatics tools, including those that are implemented in clinical practice. In the context of neoantigen discovery, it is important to note that the accuracy of peptide selection significantly depends on bioinformatics pipelines that are applied to the processing of the data obtained by WES and RNA-seq. A limited number of complex pipelines for these purposes were developed and described during the last several years [64–68]. For detailed information of selected currently available genomics-centered pipelines, see Table 1. Most of them combine principally the same set of tools (in terms of tool class) intended to carry out the main steps of analysis including raw data pre-processing to remove low-quality data, mapping to the reference genome, somatic mutation calling, mutated peptide sequences isolation and peptide ranking according to their predicted capacity to be presented on the cell surface by the MHC and recognized by the TCR. A general overview of these steps is shown in Figure 1. Current comprehensive best practices in bioinformatics in the context of neoantigen identification are presented in [50]. In this report, authors not only describe the appropriate tools for each analysis step but also provide fundamental guidelines that could serve as a basis for creating standardized consensus rules for neoantigen research.

In general, the preliminary data that can be extracted during WES/WGS and RNA-seq analysis consist of bam-files containing sequenced reads aligned to the reference genome, a set of germline/somatic mutations, the patient's HLA allotypes, estimates of gene expression levels as well as information regarding the abundance of transcript isoforms. Somatic mutation data and RNA-seq alignments are also used to determine mutant protein sequences.

Somatic variant identification is one of the most important and, at the same time, one of the most delicate parts of all pipelines. It is now firmly established that not only neoantigens arising from single nucleotide variants (SNV) could be candidates for vaccines. Other mutation types are also considered to be sources of neoantigens. Among them are INDELs (short insertions and deletions) [69], gene fusions [70], exon-exon junctions [71], intron retentions [72] and some other alternative splicing events [46]. RNA transcription and splicing errors [73], as well as RNA editing examples [46], could also be recognized as neoantigen sources. Non-coding genome regions such as non-coding exons, UTRs, non-coding RNAs, and others could also be neoantigen sources [74]. This list could potentially be extended by V(D)J recombination and somatic hypermutation events [75,76] that are important for blood malignancies, and sequences of viruses that are associated with some tumors [77,78]. Proteasome-generated spliced peptides, as well as peptides bearing tumor-specific post-translational modifications, could also be a source of neoantigens [60,79], but they are out of the scope of NGS-based approaches. It was reported that neoantigens resulting from non-SNV variants could make up to 15% of all neoantigens [80]. Some authors state that non-coding regions could be the main source of neoepitopes [74]. Moreover, recent proteogenomics studies of ovarian cancer revealed that the composition of tumor-specific antigens resulting from non-mutated non-exonic regions includes 29% of intronic and 22% of intergenic sequences, and most importantly, many of them are shared across tumors [81]. Thus the variety of mutation types makes it necessary to select the right tool for the identification of each type if this tool is available. Tools for the identification of some of the mutation types listed above are discussed in [50]; additionally, comprehensive comparisons can be found in [63,82,83]. Mutect2 and Strelka2 are the most reliable somatic variant callers for SNV identification [63]. It is advisable to run several somatic callers simultaneously, which could potentially improve calling accuracy [84]. It is also good practice to conduct manual verification of somatic mutation caller results by viewing them in genomic browsers and to carry out additional validation utilizing targeted sequencing approaches [50]. Identification of other neoantigen sources is also possible due to tools such as Strelka [85] and EBCall [86], which are designed for indel calling, and Pindel [87], which is a specialized tool for calling large INDELs. A variety of tools for gene fusion identification were also developed, such as INTEGRATE [44] (and INTEGRATE-neo pipeline [88]), STAR-fusion [45], etc. There is now a clear demand for the development of tools that can provide proper identification of all the neoantigen sources listed above.

Furthermore, after detecting all the variants of interest, one wants to know whether they could, in principle, yield a neoantigen that has a chance to bind to the MHC molecule. Firstly, it is well known that the immunoproteasome has a limited specificity, which means that not every possible mutated peptide will be produced during protein degradation [89]. Secondly, not all peptides produced by the proteasome would reach the required compartment of cells and could, in principle, interact with the MHC. It is known that before being presented by MHC class I, peptides are at first transported into the EPR by special transporters known as TAP (Transporter associated with antigen processing) and then trimmed by ER-related aminopeptidases (ERAP) [90]. There are several tools assessing TAP transport efficiency for peptides [91–94] and a number of tools that allow us to take proteasome cleavage specificity into account, such as NetChop20S, ProteaSMM [93,95] for MHC class I pathway and PepCleaveCD4, MHC II NP [96,97] for MHC class II pathway. It should also be taken into consideration that genes that code transporters of antigen-presenting machinery such as TAP1, TAP2, B2M, etc., can have mutations influencing their activity, and that these genes can have different expression levels in various tumor types, which has an additional impact on peptide presentation [98,99]. Thus, taking proteasome cleavage specificity and TAP transport limitations into account, the final list of peptides based on identified somatic variants should be created and subjected to subsequent prioritization procedures.

As mentioned above, currently available epitope prediction algorithms are based on the idea that the affinity of the peptide to a given MHC class molecule is the dominant contributor to neoantigen immunogenicity, and thus this parameter is considered to be the primary factor for peptide prioritization. It relies on the observation that only about 1 of 10,000 peptides resulting from protein degradation will be presented by the MHC [100]. It is also well-known that different MHC allotypes differ in specificity with respect to peptide binding. Therefore, it is crucial to know the HLA type before ranking peptides. The gold standard for HLA allotype determination is clinical HLA typing by sequence-specific PCR [101,102]. However, currently available HLA typers based on WES/RNA-seq data provide a high enough accuracy rate and can also be used for HLA allotype identification when a clinical HLA type is unavailable. Although HLA class I typing algorithms can reach an accuracy of up to 99% [103,104], HLA class II typers remain less effective and require additional development. It is no less important to estimate HLA locus gene expression as well as to determine somatic mutation patterns in this locus, as they both can be a cause of neoantigen presentation loss leading to resistance to immunotherapy [105–107].

Prediction of peptide-MHC binding affinity is the most critical step of the neoantigen discovery process. Many tools for such analysis exist [57,108–110]. These tools utilize large-scale peptide-MHC binding affinity data derived from biochemical measurements and eluted ligands data obtained by high-throughput mass-spectrometry analysis of MHC ligandome [57,111] to train machine learning-based classifiers that can identify binders and non-binders and calculate affinity scores. The machine learning approaches include linear regression (LR) and artificial neural networks (ANN). Depending on the experimental data that are used to train these algorithms, they can be classified on binding affinity (BA) trained methods, eluted ligands (EL) trained methods, and mixed trained methods utilizing both BA- and EL-datasets. Since the performance of different algorithms varies, a number of comprehensive benchmarking studies were carried out to compare the accuracy of these tools [48,49,112,113]. For instance, according to [49], where a dataset for 32 HLA class I and 24 HLA class II was used, ANN-based approaches showed better performance than LR-based, and among 19 predictors that were benchmarked, MHCflurry (AUC = 0.911 ± 0.010) and ann_align (AUC = 0.911 ± 0.004) showed the highest accuracy in terms of the AUC (Area Under ROC Curve) for MHC class I 9-mer and MHC class II 15-mer, respectively, in binding versus non-binding classification. In another benchmarking study [48], using an experimentally validated dataset with binding affinity data for 743 peptides (8- to 11-mers), derived from the HPV16 E6 and E7 proteins, none of the algorithms outperformed the others. However, different algorithms showed better performance for particular HLA types and peptide lengths [48]. In one of the most recent benchmarking studies [114], the performance of 15 algorithms was tested on a dataset described in [115], which contains 220 naturally processed vaccinia virus (VACV) peptides that were eluted from VACV-infected cells and tested for T cell immune response in infected C57Bl/6 mice. ANN-based NetMHCpan 4.0-L (AUC = 0.977), NetMHCpan 4.0-B (AUC = 0.975) and MHCflurry-L (AUC = 0.973) were reported to achieve the best performance which was in general agreement with the results previously reported in [49]. More recently, improved versions of NetMHCpan (v.4.1) and NetMHCIIpan (v.4.0) as well as MHCflurry (v.2.0) were presented [57,109]. In [57] NNAlign_MA was used to update NetMHCpan and NetMHCIIpan which outperformed the current state-of-the-art methods including NetMHCpan 4.0 and MHCflurry. O’Donnell et al. incorporated an antigen processing predictor that uses data on MHC ligands, identified by mass-spectrometry, into MHCflurry 2.0 [109], allowing it to achieve better accuracy than the currently available tools. It seems logical that the simultaneous use of several MHC-binding predictors could improve peptide prioritization. It should be noted that currently available MHC-binding predictors suffer from inadequate support for rare MHC alleles and poor performance for MHC class II molecules. Another significant inherited weakness of this approach is the failure to consider the effect of post-translational modification on binding affinity. Despite these weaknesses, this approach is the gold standard in the prediction of MHC-peptide interactions.

It is well-known that not all peptides presented by the MHC (pMHC complexes) trigger T cell activation [116,117]. For instance, in [117], the authors summarized data on candidate neoantigens predicted to be MHC-binders from 13 suitable published works, which included information about assessing the peptides’ immunogenic potential. It turned out that only 53 of 1948 neopeptide-MHC combinations elicited T cell response. In [118] it was reported that among 50 long peptides (MHC-binding prediction was performed using NetMHC 3.0) that were selected based on non-synonymous 563 somatic mutations in genes that are expressed in B16F10 murine melanoma, only one-third were immunogenic, and 60% of them elicited immune response directed against the mutated sequences. According to [119], only 25 of 66 27-mer peptides selected by predicted binding affinity to MHC I and MHC II and expression level were immunogenic according to IFNg ELISpot assay. Remarkably, in mouse models, the majority of immunogenic neoantigens (up to 90%) were associated with CD4⁺ T cell response [118–120]. Since the primary goal of neoantigen identification (in the context of cancer vaccines development) is to select those that would trigger or boost T-cell-mediated immune response (preferably CD8⁺ T cell response), it is essential to know which of the peptides with a high MHC binding affinity will be recognized by T-cells. This brings about the challenge of determining the specificity of MHC-epitope-TCR interactions, which could be an additional layer of the neoantigens ranking process. It is an established fact that T cells recognize pMHC complexes predominantly by the complementarity determining region 3 (CDR3) loops of the TCR [121]. Based on the fact that different individuals having different TCR repertoires can recognize the same epitopes arising from the same agents (e.g., immunodominant viral epitopes [122–124]), one may suggest that such epitopes have intrinsic patterns that make them more recognizable by the TCR. On the other hand, it was observed that TCR repertoires that are specific to the same epitope have similarities in their core sequences [125]. Such reasoning allows us to suggest that it is possible to perform a simulation based on sequences of peptides and TCR repertoires. Several approaches to predicting epitope-TCR binding were developed (e.g., TCRex [126], NetTCR [127], Repitope [128], ERGO [129], Deepwalk approach [130]). For instance, TCRex is based on the principle that similar TCR sequences often target the same epitope [126], Repitope is based on the idea that sequences of epitopes contain some intrinsic hidden pattern that is prone to activating T cell response [128]. Unfortunately, this class of tools is at the initial stage of development, and their prediction power suffers from insufficient training data on TCR–epitope interactions. Meanwhile, in the present time, other strategies are being successfully implemented to improve the immunogenicity of neoantigens [131,132]. Thus, in [131] the weak B16F10 neoantigens described in [118] were fused to the transmembrane domain of diphtheria toxin (DTT), significantly enhancing their ability to elicit CD8⁺ T cell response and inhibit tumor growth. A bi-adjuvant vaccine containing a neoantigen supplemented with two adjuvants such as the Toll-like receptor (TLR) 7/8 agonist R848 and the TLR9 agonist CpG, boosted the immunogenicity of the neoantigen due to efficient co-delivery and synergism of adjuvants [132].

2. Mass Spectrometry-Based Approaches

Genomics-based approaches represent the gold standard that is applied for neoantigen vaccine development, including in silico peptide prediction. Neoantigen candidate selection relies on the spectra of somatic mutations identified by WES/RNA-seq. This approach suffers from a lack of direct experimental evidence of the real presence of predicted epitopes on the cell surface as a complex with MHC molecules [153,154]. Lacking data could be obtained using high-throughput mass spectrometry techniques [153,154] that at present allow us to analyze large amounts of peptides or whole proteins simultaneously. This review does not aim to give a detailed characterization of MS-based approaches; for a comprehensive review on this topic, the reader could refer to [56,153,154].A typical MS workflow (IP-based) starts with immunoprecipitation of MHC-peptide complexes using beads conjugated with MHC-specific antibodies or beads bound with dummy antibodies as negative controls. Subsequent washing steps ensure the removal of unbound and non-specifically bound peptides, whereupon the eluted material is subjected to MS analysis. Another strategy is mild acid elution (MAE) of MHC-bound peptides from the cell surface by treatment under mildly acidic conditions [155], followed by MS analysis. This method has a significant false-positive rate and low specificity due to contamination with a large quantity of non-specific peptides. Detailed comparison of IP- and MAE-based approaches are presented in [156]. To find information on MHC peptidome identification by MS approach, the reader could refer to Zhang et al. [154] where authors provide a summary of 40 studies that were carried out from 1990 to 2019.

Unlike the genomics-based approach, which only provides for neoantigen prediction, mass-spectrometry allows us to take a real snapshot of the total MHC-bound protein interactome. Additionally, it could reveal not only neoantigens that originate from somatic mutation variants but also those which arise due to proteasome-mediated peptide splicing [157,158]. Using mass-spectrometry, it was shown that the proportion of spliced peptides relative to peptides displayed by HLA class I varies from 2-6% reported in [159] to 30% reported in [60]. Moreover, MS allows us to identify the post-translational modifications (PTM) of peptides bound to the MHC, thus shedding light on the importance of PTM for binding affinity [59]. Mass-spectrometry derived data served for the development of the first tool allowing to predict the interaction between HLA class I molecules and phosphorylated peptides [160]. In addition, MS-based profiling of HLA peptidome could generate high-quality training data that could potentially significantly improve current prediction models [57,111,161], and could also be used for benchmarking available tools.

Nevertheless, MS also has some limitations. They include low sensitivity and reproducibility. These problems are especially acute for low-abundance peptides, including tumor-specific neoantigens. Moreover, the washing stages of MHC-peptide complexes during IP could result in a loss of bound peptides. These issues impose a limitation on the initial quantities of biological material. For typical experiments, 1 g of tumor tissue or anywhere from a hundred million to billions of cells are required [156]. It should also be noted that cancer cells and tumor tissues have different HLA molecules; thus, peptides that were identified from this type of material are relevant for different HLA molecules, adding the problem of specificity of the HLA ligandome.

In summary, by combining genomics-based predictions with high-throughput HLA-ligandome mass-spectrometry data, the performance of neoantigen discovery procedures could be significantly enhanced. For instance, the currently available ProGeo-neo pipeline [150] utilizes LC-MS/MS data to verify NGS-based derived neoantigen candidates.

3. Structure-Based Approaches

Structure-based predictions are another option that can improve the state of the art in the context of neoantigen discovery [61,162]. While the genomics-based approach utilizes sequence-based methods, the structure-based prediction is additionally capable of uncovering the significance of peptide structure and physicochemical properties, as well as the importance of post-translational modifications, such as phosphorylation [163], citrullination [164], and glycosylation [165], for peptide binding to the MHC and the TCR. Moreover, structure-based approaches could yield predictions that will be applicable to all types of MHC and TCR receptors, mitigating the limitations of small training datasets for rare MHC alleles, which are required for machine learning-based predictions.

Despite the slow progress in the development of structure-based approaches due to the need for serious computational resources and high-resolution models, some attempts in this direction were made. In 2000 Schueler-Furman et al. [166] developed an approach utilizing a pairwise potential matrix that can be applied to a wide range of MHC I molecules for predicting peptide binding. In the following years, new algorithms for the prediction of peptide-MHC complexes binding were developed. PePSSI (peptide-MHC prediction of structure through solvated interfaces) [167] is an approach that allows predicting the structure of peptides bound to HLA-A2. It includes a sampling of peptide backbone conformations and flexible movement of MHC side chains and can explicitly take water molecules at the pMHC interface into account. Initially, PePSSI was tested to predict the conformation of eight peptides bound to HLA-A2, for which crystallography data are available. Analysis of predicted structures in comparison with structures derived from X-ray models showed them to be in good agreement. In [168] a method based on molecular dynamics simulations and estimation of free energy of binding between peptides and HLA molecules was proposed. Another approach, HLAffy, is based on the strength of a mechanistic model of peptide-HLA recognition [169]. It can predict epitopes for any class I HLA by assessing the binging affinity of peptide-HLA complexes by learning pair potentials that are important for peptide binding. Notably, this list of methods and descriptions of structure-based approaches is not exhaustive. For a more comprehensive review of this topic, please refer to [162].

As was mentioned above, some neoantigens that have a high binding affinity to MHC will not be effectively recognized by the TCR [170,171], which makes them unable to trigger T cell-mediated immune response. This fact allows us to suggest the existence of some peptide features that determine their recognition by the TCR independently of MHC binding. In recently published works, it was reported that immunogenic peptides are enriched in hydrophobic and aromatic amino acids at positions interacting with the TCR [172,173]. Other parameters that are believed to influence TCR binding are amino acid charge and bulkiness, WT and mutant sequence divergence and sequence entropy [65,174]. Currently, available tools attempt to solve these challenges by considering these features in the context of the peptide sequence [65,173,174,175]. However, it is evident that the impact of properties such as amino acid charge and size and the composition of hydrophobic residues should be taken into account in the conformation of the peptide bound to the MHC. In this connection, structure-based predictions could be one of the possible ways to determine the impact of physicochemical features of peptides on their immunogenic potential [61,176,177]. In [177], the authors developed a flexible backbone docking protocol called TCRFlexDock utilizing RosettaDock and ZRANK and benchmarked it using 20 structures of TCR/pMHC (17 for MHC class I and 3 for MHC class II) complexes, for which resolved structures of unbound components are available. Testing revealed that protein–protein docking algorithms are able to produce accurate structural models of TCR/pMHC based on unbound component structures [177]. In [176], the authors used a force-field approach utilizing refined versions of FoldX and Rosetta force fields to perform prediction of related targets of the TCR. TCR:p:MHCII complex-based benchmark containing epitope and non-epitope containing pMHC complexes was developed, and immunogenicity was estimated by calculating interaction energies between the TCR and each of the p:MHCII complexes. It was found that the predictive power of this approach depends on the ability to predict protein-MHC complex binding and model the structure of the TCR:p:MHC complex [176]. Riley et al. [61] developed a procedure for accurate and rapid modeling of the structure of nonameric peptides bound with a common class I MHC type HLA-A2 and applied it for analyzing a dataset containing thousands of immunogenic, non-immunogenic and non-HLA2-A2 binding peptides. After that, they trained a neural network (NN) on structural features that affect TCR and peptide binding energies. It was shown that structurally-parameterized NN outperformed other methods that do not include explicit structural or energetic properties in the assessment of CD8⁺ T cell response of HLA-A2 bound nonameric peptides [61]. Thus, a combination of MHC-binding prediction based on NGS-data with a structure-based approach could significantly improve the accuracy of immunogenic peptide selection that is of special importance in the context of peptide-based cancer vaccine development.

This entry is adapted from the peer-reviewed paper 10.3390/cancers12102879

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.