Biological Language Model. Qiwen Dong
2-1 shows the comparative n-gram analysis of Human (A) for n = 3 and R_norvegicus (B) for n = 4. The x-axis represents the ranked ngrams of a specific organism. The y-axis represents the corresponding frequency. The sorted n-grams of the organism of choice are shown as the bold line. Thick lines indicate the frequencies of n-grams with given rank in other organisms. Table 2-1 shows the 20 organisms used in this book.
In natural language, there are some words that are used frequently and some rarely; similarly in proteins, the frequencies of usage of the 20 amino acids are different. From the uni-gram plot of 20 organisms, Leucine was found to be one of the most frequent amino acids, ranked among the top three. Tryptophan and Cysteine, on the other hand, are the most rare amino acids, and their ranks occupy the last three spots. In language, frequent words are usually not closely related to the actual meaning of the sentence, whereas the rare words often are. So too is the case with the rare amino acids, which may be important for the structure and function of the protein.
Another statistical feature of n-grams is that there are organism-specific “phrases” in the protein sequences. Examples are shown in Fig. 2-1. In Human (Fig. 2-1(A)), the phrases “PPP” “PGP” and “SSP” are among the top 20 most frequently used 3-grams, but they are used in other organisms with very low frequencies. Also in R_norvegicus (Fig. 2-1(B)), similar phrases are “HTGE”, “GEKP”, “CGKA”, “GKAF”, “IHTG” and “PYKC”. These highly idiosyncratic n-grams suggest that there are organism-specific usages of “phrases” in protein sequences.
Table 2-1 Organism names used in the plot.
Organism | Organism |
A_thaliana | Human |
Aeropyrum_pernix | Methanopyrus_kandleri |
arabidopsis | Streptomyces_avermitilis |
Archaeoglobus_fulgidus | Mycoplasma_genitalium |
Bacillus_anthracis_Ames | Neisseria_meningitidis_MC58 |
Bifidobacterium_longum | Pasteurella_multocida |
Borrelia_burgdorferi | R_norvegicus |
Buchnera_aphidicola_Sg | s_pombe |
Encephalitozoon_cuniculi | Worm |
Fusobacterium_nucleatum | Yeastpom |
Figure 2-1 Comparative n-gram analysis of Human (A) for n = 3 and R_norvegicus (B) for n = 4.
2.3The Zipf Law Analysis
Claiming Zipf’s law in a data set seems to be simple enough: if n values, xi (i = 1, 2 . . . n), are ranked by x1 ≥ x2 ≥ . . . xr . . . ≥ xn, Zipf’s law15 states that
where xr is in the data set whose rank is r, and C and α are constants which denote features of Zipf’s law. It can be rewritten as
This equation implies that the xr versus r plot on a log–log scale will be a straight line.
In natural language, the words’ frequency and their ranks follow Zipf’s law. Especially in English, Zipf’s law can be applicable to words, parts of speech, sentences and so on.
Zipf’s law of n-grams has been analyzed using the results of ngram statistics. Figure 2-2 shows the log–log plot of n-gram frequency versus their rank for A_thaliana (A) and Human (B). When n is larger than 4, the plot is similar to a straight line and the value of α is close to 0.5. We can claim that the n-grams of whole genome protein sequences approximately follow Zipf’s law when n is larger than 4.
A statistical measure giving partial information about the degree of complexity of a symbolic sequence is obtainable by calculating the n-gram entropy of the analyzed text. The Shannon n-gram entropy is defined as
Figure 2-2 Zipf’s Law analysis for A_thaliana (A) and Human (B).
where Pi is the frequency of the n-gram and λ is the number of letters of the alphabet.
From the n-gram entropy, one can obtain the redundancy R represented in any text. The redundancy is given as
where K = log2 λ. The redundancy is a manifestation of the flexibility of the underlying language.
To test whether the n-gram Zipf law could be explained by chance sampling, random genome protein sequences have been generated that have the same sequence length and frequency of amino acids as the natural genome. The process used to generate such random genome sequences is the same as the one used by Chatzidimitriou.3
The n-gram redundancy of natural and artificial genome protein sequences have been calculated for different values of n (see Fig. 2-3); n-gram redundancy can be approximately expressed as
Here, the alphabets are amino acids, and so the value of λ is 20.
From Fig. 2-3, one can see that the n-gram redundancy of the natural genome is larger than that of the artificial genome. This means that the n-gram entropy of the natural genome is small and that a “language” may exist in the protein sequence.
2.4Distinguishing the Organisms by Uni-Gram Model
Here, perplexity is used to distinguish the different organisms. Perplexity represents the predictive ability of a language model on a testing text. Let W = w[1], w[2] . . . w[n] denote a sequence of words in the testing text. Let Ck(i) be the context the language model chooses for the prediction of the word w[i]. Furthermore p(w[i] | ck(i)) denotes the probability assigned to the ith word by the model.
Figure 2-3 The n-gram redundancy comparison of a natural