Comparative Method for Measuring Peptide Structure

Comparative Method for Measuring Peptide Structure: Comparison

Please note this is a comparison between Version 4 by Robert Friedman and Version 3 by Robert Friedman.

Small peptides are an important component of the vertebrate immune system. They are important molecules for distinguishing proteins that originate from the host versus proteins derived from a pathogenic organism, such as a virus or bacterium. Therefore, these peptides are central to the vertebrate host response to intracellular and extracellular pathogens. Computational models for peptide prediction are based on a narrow sample of data with an emphasis on the position and chemical properties of the amino acids. In prior studies, this approach has led to models with higher predictability as compared to models that rely on the geometrical arrangement of atoms. However, protein structure data from experiment and prediction is a source for model building models, and, therefore, knowledge on the role of small peptides in the vertebrate immune system.

protein structure
small peptides
vertebrate immunity
pathogenic organisms
adaptive immunity

1. Background

The adaptive immune system of vertebrates is a system of cells and molecules whose role is to distinguish self from non-self. Therefore, a vertebrate host has the potential to detect and clear pathogenic organisms. A major component of this system involves a linear chain of amino acids, the small peptides. The small peptide is of interest since the host immune system relies on it as a marker for a determination on whether a protein originates from itself, or instead of a foreign origin, such as a virus or bacterium. This system can also identify its own cells as foreign if they are genetically altered by a process that leads to the production of non-self molecules^[1][2].

These small peptides of interest are formed by cleavage of proteins in the cells of the host, and they form the basis for the cellular processes of immune surveillance, and for identifying pathogens along with cells that operate outside their expected genetic programming^[1][3]. However, the adaptive immune system can falsely identify a peptide as originating from non-self in cases where it is noa self-pept, aide, the phenomenon known as aof autoimmunity. An example is where a subset of T cells, named for their development in the thymus, falsely detects small peptides as presented on the surface of cells as originating from non-self, and subsequently signals the immune system to eliminate these cells^[4][5][6].

The mechanism for small peptide detection is reliant on a molecular binding between the peptide and a major histocompatibility complex (MHC) receptor molecule that is expressed in the majority of cells of a vertebrate. Further, this mechanism is refined by a process of training the T cell population, which disfavors individual T cells that attack normal cells while favoring the proliferation of those that attack non-self. This is not a deterministic process, however. The dictates of probability are present in biological systems, including in the generation of genetic diversity for the different MHC receptor types, the cleavage process for generating small peptides from a protein, the timeliness of the immune response to molecular evidence of a pathogen, the binding strength of peptides to a MHC receptor, and the requisite sample of peptides for pathogen detection. This system is in contrast to a human designed system, where the structure and function originates by an artificial design, along with low tolerance for this kind of variability.

In prior experimental studies, the sampling of small peptide data has not been uniformly distributed^[1]. For example, only a small percent of MHC molecules have been studied in regard to their association with small peptides. The problem is the allelic distribution that corresponds to these molecules. While there are about around a dozen genetic loci in clusters that code for a MHC receptor, the number of alleles among these loci is very high as compared to the other genetic loci across the typical vertebrate genome. In the human population, the number of total alleles across the MHC loci is estimated as a value in the thousands^[1]. These are active genetic sites of evolutionary change and generation of diversity, and unlike the other regions of the genome, has been unhindered at the genetic level by the putative bottleneck that reduced our effective population to mere thousands of individuals^[7]. Likewise, studyies of these small peptides isare generally restricted to human, along with the animal models that serve as their proxy infor the study of biomedicine.

Moreover, there is a preference in scientific analysis that corresponds to MHC class type. The class 1 MHC receptor is generally favored over that of class 2 for modeling the MHC-peptide association, partly because in class 1 MHC some of the amino acids of the peptide are confined within pockets of the MHC molecule^[1][8]. This has led to predictive models of MHC-peptide binding that parameterize the position and chemical types of the amino acids of these peptides. These models have exceeded the predictiveness as compared to those based solelyolely based on geometrical data of atomic arrangements^[1]. As a result, it has been difficult to reliably model the association between the class 2 MHC receptor and a peptide^[1]. Therefore, the geometrical features are expected to provide an insight and important contribution tfor building models of peptide binding to MHC, particularly where future studies lead to morea broadlyer sampleding of data.

Recently, deep learning and the related machine language approaches have led to advances in knowledge of protein structure and the potential for modeling the association between proteins and other molecules^[9][10][11]. These methods are capable of highly predictive models that incorporate disparate kinds of data, such as in the use of both geometrical and chemical features in estimating the binding affinity for an small peptide to an MHC receptorMHC receptor to a small peptide. Moreover, these methods are highly efficient in the case where the modeling is dependent on a very large number of parameters, as in a lot of many cases of the interactions of biomolecules in a biological systemar interactions. Consequently, the deep learning approaches have led to successes in the prediction of protein structure across the many clades of life^[12]. These approaches are complemented by the analysis of interpretable metrics thatfor estimateing the geometrical similarity among proteins^[13][14][15].

Immunogenetics relies on collecting data samples and building models as expected in the pursuit of any scientific knowledge. Ideally, this is expected to lead to a meaningful synthesis that is unmired by the collector's fondness for naming schemes and ungrounded collations of terms and studies^[16]. The latter perspective resembles the practice of creating the images of science, akin to an art form, while not achieving the aim of extending knowledge by the purposeful modeling of natural phenomena^[17].

2. Metrics of Peptide Structure Similarity

2.1. TM-score and the RMSD metric

There are a large number of methods for measurement of the geometrical similarity among proteins^[18]. In particular, one method is by the template modeling score (TM-score), which is based on an algorithm, and a performant implementation in code, for measuring the similarity between any two protein molecules^[13]. Further, the compiled program from this open source code computes a root-mean-square deviation (RMSD) metric, a similar measure to the TM-score, but, where the latter method is relatively less sensitive to the non-local interactions of a molecular topology, along with the added bevanefittage of model invariance to the size of the protein. However, thise RMSD metric is also interpretable.

Given a cellular protein, there is empirical support for a range of TM-score values that are meaningful in the context of protein structure similarity, and, likewise, for dissimilarity. A value above 0.5 is considered a significant result of similarity, while a value below 0.17 represents a comparison that is indistinguishable from a comparison of randomly selected proteins^[19]. The values for this metric are further bounded by the values of 0 and 1.

The TM-score metric is applicable to the analysis of small peptides. For instance, a potential method is it is possible to sample data of protein structure data^[12], and select all unordered pairs of proteins for comparingson, and then finding an empirical distribution of values forof this metric. Predicted protein structures in PDB format This practice may be used to establish significance values for this metric, based on the above sampling procedure. The next subsection describes a method to obtain data for the predictions of the three dimensional structure across cellbiological proteins.

2.2. Peptide Structure Data Files

Predictions of protein structular life e in PDB formatted files are available as files on the internetcross the many forms of cellular life: https://alphafold.ebi.ac.uk/download. These files are fstormatted as an archival file type (.tar), so the "tar" program is useful for extracting the files as contained within the archival file. The archive file sizes are generally large, so an alternative to a conventional web-based retrieval is to use the "curl" program at the command line, which is capable of resuming from a partial file transfer as occurscan result from an internetloss of the network connection loss. An example is as followbelow is for the mouse data:

curl -O https://ftp.ebi.ac.uk/pub/databases/alphafold/latest/UP000000589_10090_MOUSE_v4.tar

The above archive file w(tar fill containe format) has both PDB and mmCIF formatted file types for each protein. A command line iss shown below for restricting the file extraction process to PDB files only:

tar -xvf UP000000589_10090_MOUSE_v4.tar *.pdb.gz

As noted above by the "gz" file extension, each PDB formatted file has been compressed to a smaller file size in a binary format (gzip file format), so a decompression operation will lead to a plain text file with the PDB protein structure data (text file format). To decompress these files in a single operation, the "gzip" program is oftencommonly used:

gzip -d *.gz

Since the archive file (tar fitle format) itself is not compressed, but the data files contained within the archive are compressed, the unarchival operation to extract these files to a directory will use disk space somewhat larger thanimilar to that of the original archive file size. However, the decompression of the individual compressed data files (gzip file format) will occupy a much greater disk space thalarger space on the unarchival procedure since thstorage device because each of these data files are compressed by 25% of their original size. Therefore, if the archival filesuncompressed data files of interest occupy 4 gigabytes of spacdisk storage, then the decompression operation is expected to result in anotherthe use of an additional 16 gigabytes of disk space usage. Since this example corresponds to a single organism and its PDB filprotein structure data files, the disk space is manageabletorage requirement is consistent with a desktop computer. However, extending a study to other organisms, or theinclusion of protein structure predictions across the Swiss-Prot database, will lead to a very large disk space requirement. The file system of the disk storage mustis also expected to show robustness to the handling of manya very large number of files since the above proceduremethod can lead to the creation fromng of a few thousand to over a million data files.

2.2.1. Parsing the PDB Data Files

The PDB formatted data files are expected by software libraries tto conform to the standardized format as described at the following web site:

https://www.cgl.ucsf.edu/chimera/docs/UsersGuide/tutorials/pdbintro.html

Each PDB formatted data file and its protein data can contain more than a singlone model of protein structure. It may also describe more than one protein chain per model. Below is Python languagecomputer code for parsing the PDB data files by these two features, which then prints output that associates the models the model and protein chains with names for each data file^[20]:

# for a directory of PDB formatted files with

# extension name "pdb" (.pdb)

import glob

from Bio.PDB.PDBParser import PDBParser

# assign function

parser = PDBParser()

# input file

for file in glob.glob('./*.pdb'):

print("file: ", file)

# retrieve PDB structure

structure = parser.get_structure(file, file)

# iterate over models and chains in file

for model in structure:

print("model: ", model)

for chain in model:

print("chain: ", chain)

Other code samples are available at a GitHub web site for processing PDB data files^[20], including a template for splitting a PDB data file into multiple files, where each splitted file represents a window of 9 amino acid residues. The window is shifted by 1 residue per newly created file, so the procedure is equivalent to sliding a window along the sequence of amino acid sequenceresidues of a protein, and writing data to a file that corresponds to the window of 9 amino acids and their associated protein structure data. However, theis procedure leads to a disk space requirement usage that is orders of magnitude greatermore than the disk space occupied by the original PDB formatted data files. The count of files likewise increases by the same factor, while the Python code is not a performant language in processing these file operations, given it is expected to operate process the code along a single thread per process, and, furthermore, the code in this example is not necessarily translated by the Python code interpretedr as highly efficient machine code, whereas a low level programming language is designed for computational efficiency at the hardware levellevel of machine code.

2.2.2. Format of Data Files for TM-score

Below is another code sample. In this case, it resets the amino acid residue number, an index, of each PDB data file, since by default the TM-score expects that the sequence of residues have tin each of the two input data files start with the same index numbering scheme in the input data files:

import os

directory = 'C:/Protein/data'

files = os.listdir(directory)

for file in files:

if file.endswith('pdb'):

print(file)

pdb_file = file

with open(pdb_file, 'r') as f:

lines = f.readlines()

current_residue = None

start_residue = 1

current_residue_number = start_residue - 1

for i, line in enumerate(lines):

if line.startswith('ATOM'):

residue = line[22:26]

if residue != current_residue:

current_residue = residue

current_residue_number += 1

lines[i] = line[:22] + str(current_residue_number).rjust(4) \

+ line[26:]

if line.startswith('TER'):

residue = line[22:26]

if residue != current_residue:

current_residue = residue

lines[i] = line[:22] + \

str(current_residue_number).rjust(4) + line[26:]

with open(pdb_file, 'w') as f:

f.writelines(lines)

3. Peptide Structure Analysis in Immunogenetics

3.1. Significance Levels for TM-score

The TM-score metric is a powerful tool for measuring the structural similarity among proteptideins^[13]. This metric, andlong with available protein structure data, can be applied to the study of small peptides. However, the significance levels are not yet established for the expectation on the TM-score values in the case of small peptides. These levels can be estimated by computational analysis of randomly selected pairs of small peptides, such as by a sliding window analysis of protein structure data by residue, oas described above, or by simulation of the amino acid sequence of peptides. This knowledge would provide the groundwork for analysis of small peptides as derived from clinical daand other empirical data sources, such as in the case of a pathogen that has evolved to escape detection from the immune defenses of members of a host population, allowing for a reference on the expected numbers and types of amino acid changes in a pathogen for evasion of host immunity. The above sections refer to a linearly sampled peptide of the immune system, such as detected by a T cell, but the B cell is a separate question where the effective sampling of amino acids of proteins is according to their geometric proximity, and, therefore, not reliant on a linear arrangement of amino acids for detection of non-self molecules.

3.2. Local and Global Factors of Protein Structure

A complementary approach is to survey the world of possible small peptides as sampled from the protein structure data and test whether their geometrical structure is more influenced by physical factors at the local level as opposed to the global level of the molecule. The null hypothesis would be that any two small peptides with the same amino acid sequence, but sampled from different non-homologous proteins, would not show similarity in their protein structure, as measured by TM-score^[13]. However, this test is based on the prior assumption that the TM-score has a previously established level of significance for rejecting this null hypothesis, and this assumption may be met by the suggestion in an above section. Another assumption is on the availability of small peptide data for this test. For a peptide sequence of 9 amino acids, where there are 20 types of amino acids, the naive probability of finding any two randomly selected matching pairs of these peptides is 1 in 20 to the 9th power, which resolves to 1 in 520 billion pairs. However, sampling smaller peptides, with fewer residues, will lead to finding identical pairs of peptides in a large database of protein structure. This test is also reliant on the utility of the TM-score metric at these peptide lengths. If local factors of protein structure are in fact predictive of the structure of a small peptide, then it is possible to apply this knowledge to prediction of immunogenic peptides in Nature, and the geometrical distance between a known peptide and a predicted peptide is measureable.

3.3. Geometrical Analysis of Small Peptides

As described in the introduction section, the applicable set of current models is dependent on position and chemical type of amino acids in the study of immunogenic peptides. It is possible to further explore geometric-based models, and models based on geometry and chemical properties, to make predictions on small peptides as they are processed and detected by the immune system of vertebrates. With sufficient data availability, then the deep learning approaches are of interest to model these immunological systems, particularly where traditional approaches based on interpretable parameters are not successful. Since the small peptides are a basic mechanism of adaptive immunity in the jawed vertebrates, it is essential to collect data, beyond a narrowly sampled effort, or a limited survey, for building models. Without these models, there is low predictability on the evolution of pathogens and the practioners of the science will tend to form a misperception of the system.

References

Bjoern Peters; Morten Nielsen; Alessandro Sette. T Cell Epitope Predictions. Annu. Rev. Immunol. 2020, 38, 123-145.
Victor H. Engelhard. Structure of peptides associated with MHC class I molecules. Curr. Opin. Immunol. 1994, 6, 13-23.
Thomas Serwold; Federico Gonzalez; Jennifer Kim; Richard Jacob; Nilabh Shastri. ERAAP customizes peptides for MHC class I molecules in the endoplasmic reticulum. Nat. 2002, 419, 480-483.
K. Maude Ashby; Kristin A. Hogquist. A guide to thymic selection of T cells. Nat. Rev. Immunol. 2023, null, 1-15.
Jason T. George; David A. Kessler; Herbert Levine. Effects of thymic selection on T cell recognition of foreign and tumor antigenic peptides. Proceedings of the National Academy of Sciences 2017, 114, E7875-E7881.
Dorinda A. Smith; Dori R. Germolec. Introduction to Immunology and Autoimmunity. Environ. Heal. Perspect. 1999, 107, 661.
Jorde, L.B.. Genetic variation and human evolution. American Society of Human Genetics 2003, 7, 28-33.
Dinler A. Antunes; Didier Devaurs; Mark Moll; Gregory Lizée; Lydia E. Kavraki. General Prediction of Peptide-MHC Binding Modes Using Incremental Docking: A Proof of Concept. Sci. Rep. 2018, 8, 4327.
Muhammad Saqib Sohail; Syed Faraz Ahmed; Ahmed Abdul Quadeer; Matthew R. McKay. In silico T cell epitope identification for SARS-CoV-2: Progress and perspectives. Adv. Drug Deliv. Rev. 2021, 171, 29-47.
Ehsan Raoufi; Maryam Hemmati; Samane Eftekhari; Kamal Khaksaran; Zahra Mahmodi; Mohammad M. Farajollahi; Monireh Mohsenzadegan. Epitope Prediction by Novel Immunoinformatics Approach: A State-of-the-art Review. Int. J. Pept. Res. Ther. 2019, 26, 1155-1163.
Philip Bradley; Herbold Computational Biology Program; Division of Public Health Sciences. Fred Hutchinson Cancer Center; United States; Institute for Protein Design. University of Washington; United States. Structure-based prediction of T cell receptor:peptide-MHC interactions. eLife 2023, 12, e82813.
John Jumper; Richard Evans; Alexander Pritzel; Tim Green; Michael Figurnov; Olaf Ronneberger; Kathryn Tunyasuvunakool; Russ Bates; Augustin Žídek; Anna Potapenko; et al.Alex BridglandClemens MeyerSimon A. A. KohlAndrew J. BallardAndrew CowieBernardino Romera-ParedesStanislav NikolovRishub JainJonas AdlerTrevor BackStig PetersenDavid ReimanEllen ClancyMichal ZielinskiMartin SteineggerMichalina PacholskaTamas BerghammerSebastian BodensteinDavid SilverOriol VinyalsAndrew W. SeniorKoray KavukcuogluPushmeet KohliDemis Hassabis. Highly accurate protein structure prediction with AlphaFold. Nat. 2021, 596, 583-589.
Yang Zhang; Jeffrey Skolnick. Scoring function for automated assessment of protein structure template quality. Proteins: Struct. Funct. Bioinform. 2004, 57, 702-710.
Adam Zemla. LGA: a method for finding 3D similarities in protein structures. Nucleic Acids Res. 2003, 31, 3370-3374.
Julia Koehler Leman; Pawel Szczerbiak; P. Douglas Renfrew; Vladimir Gligorijevic; Daniel Berenberg; Tommi Vatanen; Bryn C. Taylor; Chris Chandler; Stefan Janssen; Andras Pataki; et al.Nick CarrieroIan FiskRamnik J. XavierRob KnightRichard BonneauTomasz Kosciolek. Sequence-structure-function relationships in the microbial protein universe. Nat. Commun. 2023, 14, 1-11.
Kristin Johnson. Natural history as stamp collecting: a brief history. Arch. Nat. Hist. 2007, 34, 244-258.
Michael Frede. Plato’s Sophist on False Statements; In The Cambridge Companion to Plato;; Richard Kraut, Eds.; Cambridge University Press: Cambridge, United Kingdom, 1992; pp. 397-424.
S A Bero; A K Muda; Y H Choo; N A Muda; S F Pratama. Similarity Measure for Molecular Structure: A Brief Review. null 2017, 892, 012015.
Jinrui Xu; Yang Zhang. How significant is a protein structure similarity with TM-score = 0.5?. Bioinform. 2010, 26, 889-895.
Python code to help process files of 3d protein structure (PDB format) . GitHub (accessed on 19 August 2023). Retrieved 2023-8-21