Mathematical Biology & Bioinformatics | Volume 16 Issue 2 Year 2021

Principal Components of Genetic Sequences: Correlations and Significance

Efimov V.M.^1,2,3,4, Efimov K.V.⁵, Kovaleva V.Yu.², Matushkin Yu.G.¹

¹Institute of Cytology and Genetics SB RAS, Novosibirsk, Russia
²Institute of Systematics and Ecology of Animals SB RAS, Novosibirsk, Russia
³Novosibirsk State University, Novosibirsk, Russia
⁴Tomsk State University, Tomsk, Russia
⁵HSE School of Economics, Moscow, Russia

Abstract. Any numerical series can be decomposed into principal components using singular spectral analysis. We have recently proposed a new analysis method ‒ PCA-Seq, which allows calculating numerical principal components for a sequence of elements of any type. In particular, the sequence may be composed of nucleotide base pairs or amino acid residues. Two questions inevitably arise about interpretation of the obtained principal components and about the assessment of their reliability. For interpretation of the symbolic sequence principal components, it is reasonable to evaluate their correlations with numerical characteristics of the sequence elements. To assess the significance of correlations between sequences, one should bear in mind that standard significance criteria are based on the assumption of independence of observations, which, as a rule, is not fulfilled for real sequences. The article discusses the use of an anchor bootstrap technique for these purposes also previously developed by the authors of the article. In this approach it is assumed, that points of a metric space can represent the objects. When taken together they make up some fixed structure in it, in particular, a sequence. The objects are assigned the same random integer weights as in the classical bootstrap. This is sufficient to obtain the bootstrap distribution of the correlation coefficients and assess their significance. The coding sequence of the SLC9A1 gene (synonyms APNH, NHE1, PPP1R143) were taken as an example of use the anchor bootstrap technique in the genetic sequence analysis. Significant correlations of the first principal component were revealed with the hydrophobicity/“transmembraneity” of the corresponding fragments of the amino acid sequence, the phenylalanine content in them, as well as the difference in the T- and A-content in the corresponding nucleotide fragments. Earlier a similar pattern was found by other authors for other genes. Very likely, that it is of a more general nature.

Key words: SSA, PCA-Seq, SLC9A1 (NHE1) gene, CDS, protein secondary structure, external factors, anchor bootstrap.