2. Matrix Representations of Whole Sets of N-Plets (or N-Mers)
The genetic code system is based on sets or alphabets of
N-plets (or
N-mers) such as:
-
the set of 41 monoplets (in DNA: A, C, G, T) (in RNA, uracil U replaces thymine T);
-
the set of 42 = 16 duplets (AA, AC, AG, AT, ….);
-
the set of 43 = 64 triplets (AAA, AAC, ACA, ACG, ACT, ….);
-
etc.
Each whole set of 4N N-plets coincides with the whole set of 4N entries in a (2N × 2N)-matrix, which belongs to the Kronecker family of genetic matrices [A G; C T](N), where (N) means Kronecker (or tensor) power. shows the first three members of this Kronecker family for n = 1, 2, 3. It also shows that—inside such matrix [A G; C T](N)—each N-plet has its individual binary coordinates (or appropriate coordinates in decimal notation) due to biochemical attributes of N-plets. This is explained in detail below.
The four nitrogenous bases—adenine A, guanine G, cytosine C, thymine T (or uracil U in RNA)—represent specific poly-nuclear constructions with special bio-chemical properties. The set of these four constructions is not absolutely heterogeneous, but it bears the substantial symmetric system of distinctive-uniting attributes (or, more precisely, pairs of an “attribute–antiattribute”). This system of pairs of opposite attributes divides the genetic four-letter alphabet into three pairs of letters in all possible ways; letters of each such pair are equivalent to each other in accordance with one of these attributes or with its absence.
The system of such attributes divides the genetic four-letter alphabet into three pairs of letters, which are equivalent from a viewpoint of one of these attributes or its absence: (1) С = T and A = G (according to the binary-opposite attributes: “pyrimidine” or “non-pyrimidine”, that is purine); (2) А = С and G = T (according to the binary-opposite attributes “keto” or “amino” [
38]); (3) С = G and А = T (according to the attributes: three or two hydrogen bonds (or strong–weak divisions) are materialized in these complementary pairs). The possibility of such division of the genetic alphabet into three binary sub-alphabets is known from the work [
38]. We will utilize these known sub-alphabets by means of the following approach in the field of matrix genetics. We will attach appropriate binary symbols, “0” or “1”, to each of the genetic letters from the viewpoint of each of these sub-alphabets. Then we will use these binary symbols for binary numbering of the columns and the rows of the genetic matrices of the Kronecker family.
Let us mark the abovethree kinds of binary-opposite attributes with the numbers N = 1, 2, 3 and let us ascribe to each of the four genetic letters the symbol “0N” (the symbol “1N”) in case of presence (or of absence, correspondingly) of the attribute under number “N” at this letter. In result we receive the following representation of the genetic four-letter alphabet in the system of its three “binary sub-alphabets to attributes” ().
The table on shows that, on the basis of each kind of the attributes, each of the letters A, C, G, T/U possesses three “faces” or meanings in the three binary sub-alphabets. On the basis of each kind of attribute, the genetic four-letter alphabet is reduced to the two-letter alphabet. For example, on the basis of the first kind of binary-opposite attributes we have (instead of the four-letter alphabet) the alphabet from two letters 01 and 11, which one can name “the binary sub-alphabet to the first kind of the binary attributes”.
Accordingly, any genetic message as a sequence of the four letters C, A, G, T consists of three parallel and various binary texts or three different sequences of zero and unit (such binary sequences are used at storage and transfer of the information in computers). Each from these parallel binary texts, based on objective biochemical attributes, can provide its own genetic function in organisms.
In view of these three binary sub-alphabets, any nucleotide sequence can be represented as three different binary sequences. For example, the sequence ATGGC. is represented as:
-
10110. (in accordance with the first sub-alphabet; its decimal equivalent can be located on the “X” axis of a Cartesian system of coordinates);
-
01110. (in accordance with the second sub-alphabet; its decimal equivalent can be located on the “Y” axis of a Cartesian system of coordinates);
-
11000. (in accordance with the third sub-alphabet; its decimal equivalent can be located on the “Z” axis of a Cartesian system of coordinates).
For an unambiguous determination of the nucleotide sequence is sufficient to know its binary representations in any two of the three sub-alphabets [
31,
32,
37]. In particularly, in this example of the sequence ATGGC., to get its third binary representation 11000. (in accordance with the third sub-alphabet) it is enough to summarize its other two representations 10110. and 01110. (received in accordance with the first two sub-alphabets) by means of modulo-2 addition.
In genetic matrices of the Kronecker family (see ), each row has its individual binary number, which is connected with the fact that all N-plets inside this row have identical binary representation from the point of view of the first sub-alphabets on . For example, in the (8 × 8)-matrix [A G; C T](3) on , the second row has its binary numeration 110 because each of its triplets (AAC, AAT, AGC, AGT, GAC, GAT, GGC, GGT) is a sequence “purine-purine-pyrimidine” that corresponds to binary number 110 from the point of view of the first sub-alphabet on . Analogically in genetic matrices of the Kronecker family (see ), each column has its individual binary number, which is connected with the fact that all N-plets inside this column have identical binary representation from the point of view of the second sub-alphabet on . For example, in the (8 × 8)-matrix [A G; C T](3) on , the third column has its binary numeration 010 because each of its triplets (AGA, AGC, ATA, ATC, CGA, CGC, CTA, CTC) is a “amino–keto–amino” sequence that corresponds to binary number 010 from the point of view of the second sub-alphabet on . Respectively, each N-plet, which is located in an appropriate genetic matrix on crossing “column–row”, obtains its individual 2-dimensional coordinates on the base of binary numeration of its column and row. For example, the triplet AGC, which is located on crossing of the mentioned column and row (), obtains its individual binary coordinates (010, 110), or in decimal notation (2, 6).
Any long nucleotide sequence can be divided into equal pieces of arbitrary length, and a binary record of these fragments can be read in decimal notation. Then, any long nucleotide sequence is represented in the form of three different sequences of decimal numbers, and its unique identification is sufficient to know its decimal representation in any two sub-alphabets.
If one divides a long nucleotide sequence into equal fragments, whose lengths are equal to “n” (
N-mers or
N-plets), then each of these fragments is defined by means of its two binary representations (from points of view of the two sub-alphabets) or by means of their equivalents in decimal notations. For example the 5-mer ATGGC is represented as 10110 (in accordance with the first sub-alphabet) and 01110 (in accordance with the second sub-alphabet). Its appropriate decimal meanings are 22 and 14. In such way, this 5-mer ATGGC can be represented not only as the appropriate cell with coordinates (22, 14) inside the genomatrix [A G; C T]
(5) but also as the point with decimal coordinates (22, 14) in the orthogonal Cartesian system of coordinates (
x,
y). Taking into account the chosen connection () between each sub-alphabet and one of the
X,
Y,
Z axes of the Cartesian system of coordinates, the following correspondence exists between Kronecker families of genomatrices and 2-dimensional planes (
x,
y), (
x,
z) and (
y,
z) of the Cartesian system:
-
the plane (x, y) corresponds to matrices [A G; C T](N), whose rows and columns are binary numerated from the point of view of the first sub-alphabet and the second sub-alphabet respectively;
-
the plane (x, z) corresponds to matrices [G A; C T](N), whose rows and columns are binary numerated from the point of view of the first sub-alphabet and the third sub-alphabet respectively;
-
the plane (y, z) corresponds to matrices [G T; C A](N), whose rows and columns are binary numerated from the point of view of the second sub-alphabet and the third sub-alphabet respectively.
Taking into account this 2-dimensional representation of each
N-plet, one can introduce a notion of Euclidean distance R between any pair of
N-plets V(a
1, b
1) and W(a
2, b
2):
One can also introduce notions of distance of other types.
The method, which is described below, uses many variants of a division of a nucleotide sequence into fragments of equal lengths (
N-plets). Each whole set of
N-plets, which contains 4
N members, is located inside one of the matrices of the Kronecker family of matrices such as [A G; C T]
(N). Correspondingly this method is closely connected with Kronecker multiplication of matrices, which is widely used in mathematics, informatics, physics, etc. and which is one of the main mathematical operations in the field of matrix genetics [
32,
33,
34,
35,
36,
37]. Kronecker multiplication of matrices is used when one needs to go from spaces of smaller dimension into associated spaces of higher dimension. If one uses the mathematical language of vector spaces for modeling the ontogenetic complication of a living organism, it is natural to apply the ideology of a gradual transition from the spaces of low dimensions into spaces of higher dimensions. Such gradual transition is described by means of a series of Kronecker multiplication of matrices.
3. The Description of the Matrix Method for Long Nucleotide Sequences
In a general case, the proposed method includes the following algorithmic steps:
-
Any long nucleotide sequence, which contains K nucleotides, is divided into equal fragments of length “N” (N-plets or N-mers), where “N” takes different values: n = 1, 2, 3, …, K; in the result, an appropriate set of different symbolic representations of this sequence as a chain of N-plets appears;
-
Each N-plet in every of these representations of the sequence is transformed into three kinds of n-bit binary numbers by means of its reading from the point of view of the three sub-alphabets (). Each of these binary numbers is transformed into its decimal equivalent. In the result, an appropriate set of different decimal representations of the initial symbolic sequence appears in a form of three kinds of sequences of decimal numbers respectively for positive integer coordinates on Cartesian axes X, Y, Z (or for numeration of rows and columns of appropriate genetic matrices).
-
Any two of the received numeric sequences define an appropriate sequence of pairs of positive integer coordinates of points on the 2-dimensional Cartesian plane (or coordinates of cells inside an appropriate genetic matrix of a Kronecker family). On the base of these pairs of coordinates, a set of corresponding points is built on the 2-dimensional Cartesian plane (or a set of corresponding cells in black inside a respective genetic matrix of a Kronecker family in contrast to other cells, which remain in white).
As a result of these algorithmic steps, different black-and-white mosaics arise as representations of any long nucleotide sequence in different cases of its division into N-plets. shows examples of fractal-like and other visual patterns, which have been received on the basis of the described method for some long nucleotide sequences.
The numbered patterns on correspond to the following sequences:
-
Homo sapiens contactin associated protein-like 2 (CNTNAP2), RefSeqGene on chromosome 7 (N = 63).
-
Homo sapiens contactin associated protein-like 2 (CNTNAP2), RefSeqGene on chromosome 7 (N = 63).
-
Sorangium cellulosum So0157-2, complete genome (N = 63).
-
Burkholderia multivorans ATCC 17616 genomic DNA, complete genome, chromosome 2 (N = 63).
-
Thermofilum sp. 1910b, complete genome (N = 63).
-
Thermofilum sp. 1910b, complete genome (N = 63).
-
Dinoroseobacter shibae DFL 12, complete genome (N = 8).
-
Escherichia coli LY180, complete genome (N = 24).
-
Francisella tularensis subsp. tularensis SCHU S4 complete genome (N = 24).
-
Halomonas elongata DSM 2581, complete genome (N = 24).
-
Helicobacter mustelae 12198 complete genome (N = 24).
-
Helicobacter mustelae 12198 complete genome (N = 12).
-
Invertebrate iridovirus 22 complete genome (N = 8).
-
Methanosalsum zhilinae DSM 4017, complete genome (N = 12).
-
Methanosalsum zhilinae DSM 4017, complete genome (N = 12).
-
Mycobacterium abscessus subsp. bolletii INCQS 00594 INCQS00594_scaffold1, whole genome shotgun sequence (N = 12).
-
Penicillium chrysogenum Wisconsin 54-1255 complete genome, contig Pc00c12 (N = 32).
-
Riemerella anatipestifer DSM 15868, complete genome (N = 12).
-
Riemerella anatipestifer DSM 15868, complete genome (N = 12).
-
Burkholderia multivorans ATCC 17616 genomic DNA, complete genome, chromosome 2 (N = 8).
Thismosaic pattern shows the phenomenology of “presence and absence” of different N-plets. Note that a division of a long nucleotide sequence into only a single possible variant of its equal fragmentation (for example, a division into 16-plets) does not provide an unambiguous definition of this sequence; such a separate case of a division represents this sequence as a set of fragments but without a reflection of their order in the sequence (any permutation of these fragments gives a new sequence with the same set of N-plets). To get an unambiguous definition of the sequence, one should take into consideration all (or many) possible variants of its equal partitions (N = 1, 2, 3, …). In practice for many tasks of a comparison analysis and classification of different long nucleotide sequences it is enough to consider some chosen variants of fragmentations of these sequences, for example, variants with N = 16, 32, 64.
Another possible way to get an unambiguous representation of a long nucleotide sequence in the case of its division with a certain value n (for example, with N = 8) is connected with construction of additional visual patterns, which reflect an order of N-plets in the sequence.
shows two examples of such mosaic patterns for Homo sapiens chromosome 22 genomic scaffold and for Arabidopsis thaliana mitochondrion in the case of their representations as sets of 16-mers. On these mosaics, white places correspond to dispositions of those 16-mers on a corresponding 2-dimensional plane, which are missing in such representations of the sequences. The mosaic pattern depends on a concrete choice of two kinds of sub-alphabets from . shows two mosaic patterns on 2-dimensional Cartesian planes (x, y) and (y, z), which are identical to black-and-white mosaics of the genetic matrices [A G; C T](16) and [G T; C A](16)respectively, where cells with existing 16-plets of the sequence are shown in black and cells with missing 16-plets are shown in white.
shows one of interesting patterns received by the described method.
Binary representations of
N-mers are expressed in a form of
n-bit binary numbers, the quantity of kinds of which is equal to 2
n. For example, the set of 3-bit binary numbers contains 2
3 = 8 members: 000, 001, 010, 011, 100, 101, 110, 111 (their equivalents in decimal notation are 0, 1, 2, 3, 4, 5, 6, 7). Decimal equivalent of the biggest
n-bit binary member in a set of
n-bit binary numbers is equal to 2
n − 1. Such sets of
N-bit binary numbers are named “dyadic groups” (see details in [
6]).
The most interesting application of this matrix method is realized in the case of long nucleotide sequences, which are divided into relative long
N-mers (
N = 8, 9, 10, …). The reasons for this are the following (see ):
-
a long nucleotide sequence, which is divided into relative short N-mers (N = 1, 2, 3, 4), usually contains all possible kinds of such short N-mers; correspondingly, its visual pattern is trivial because it contains all possible points with positive integer coordinates (x, y) inside an appropriate numeric range;
-
a long nucleotide sequence, which is divided into relative long N-mers (N = 8, 9, 10, …), usually generates a regular non-trivial mosaic of a fractal-like or other character. This was detected using a special computer program in the course of initial investigations of different long nucleotide sequences by means of the described method.
(lower level) also illustrates that—in a certain range of changing values “
N”—visual fractal mosaics for different “
N” approximately repeat each other (see
Section 5 about this “stability” of the fractal-like mosaics).
Fractal patterns, which are obtained by means of the described matrix method, sometimes resemble fractal patterns of long nucleotide sequences and amino acid sequences, which were previously obtained by means of the known method “Chaos Game Representation” (CGR-method) in work [
12], though both methods are quite different in their algorithmic essence. In particularly, CGR-method deals with representations of nucleotide sequences or other long sequences by means of four numbers 0, 1, 2, 3 but not by means of binary numbers 0, 1. In addition our new method seems to be simpler to understand and be used by biologists.
5. Kronecker Multiplication, Fractal Lattices and the Problem of Coding an Organism on Different Stages of Its Ontogenesis
Previous sections have shown that the described method gives very different types of visual patterns for random nucleotide sequences (where non-regular patterns arise as on ) and for real nucleotide sequences (where fractal-like patterns have been revealed as on , , and ). The authors note that in many cases these fractal-like patterns of long nucleotide sequences resemble fractal lattices, which are automatically generated for matrices of Kronecker families. We should explain this in more detail.
Let us take a square (k × k)-matrix M, whose entries are equal only to 0 or 1. Any integer Kronecker power (
N) of this matrix generates a new (k
n × k
n)-matrix M
(N) with a fractal location of entries 0 and 1 inside it ( and ). These fractal mosaics inside such matrices of Kronecker families are called “fractal lattices.” The theme of “Kronecker multiplication and fractal lattices” is accurately described in a previous book [
39]. Such fractal lattices () are generated due to a general definition of Kronecker multiplication of matrices as a special mathematical operation.
One should note that, in many cases, significant features of fractal-like patterns of real nucleotide sequences can be simulated by means of fractal lattices of matrices of a Kronecker family, if a matrix kernel of the Kronecker family is adequately chosen. For example let us take the pattern (from ) of the nucleotide sequence
Homo sapiens chromosome 22 genomic scaffold, which has 648,059 nucleotides [
40,
41] and which is divided into a sequence of 16-mers. If this pattern is covered by the uniform (8 × 8)-grid, 8 cells of this grid will be almost white color in contrast to the remaining 56 cells (, upper level, left side). In such case this (8 × 8)-mosaic of black-and-white type is similar to mosaic of the genetic (8 × 8)-matrix [A G; C T]
(3) of 64 triplets where those 8 triplets are missing, which are located in this matrix on the same places and which are marked by red color on (upper level, right side). Let us replace these 8 missing triplets by number 0, and all other 56 triplets by number 1. It leads to a transformation of this variant of symbolic matrix [A G; C T]
(3) into a numeric matrix S (, bottom level, left side).
Kronecker exponentiation of the matrix S generates matrices S(2), S(3), …, whose visual patterns illustrate appropriate fractal lattices, one of which for the matrix S (2) is shown on (bottom level, right side). The numeric matrix S(16) contains the whole set of 16-plets with an appropriate fractal lattice, which resembles the visual pattern of the real nucleotide sequence Homo sapiens chromosome 22 genomic scaffold on . One should note that the visual pattern of this real sequence contains more white places (than in the matrix S(16)) because many additional 16-plets are absent since the sequence has a finite length in 648,059 nucleotides.
Fractal-like lattices in visual patterns of long nucleotide sequences testify in favor of significance of Kronecker multiplication for structuration of these genetic sequences. This is not an isolated fact about a genetic significance of Kronecker multiplication. Previously we have provided other evidence for the biological significance of Kronecker multiplication of matrices in phenomenology of natural ensembles of molecular-genetic alphabets [
32,
33,
34,
35,
36,
37] and also in a structure of Punnet squares in the field of Mendelian genetics in connection with the Mendelian laws of independent inheritance of traits [
33].