Biological Language Model. Qiwen Dong
of “C” is 01000000000000000000, and so on. Since protein sequences may contain some unknown amino acids, it should be noted that one more bit is needed to represent the unknown amino acid type in some cases, and the dimension of its binary vector will be 21.11
Because one-hot encoding is a high-dimensional and sparse vector representation, there is a simplified binary encoding method based on conservative replacements through evolution.12 Deriving from the point accepted mutation (PAM) matrices,13 the 20 standard amino acids are divided into six groups: [H, R, K], [D, E, N, Q], [C], [S, T, P, A, G], [M, I, L, V] and [F, Y, W]. Six dimensional binary vectors are used to represent amino acids based on their groups. Another low-dimensional binary encoding scheme is the binary 5-bit encoding introduced by White and Seffens.14 Theoretically, the binary 5-bit code could represent 32 (25 = 32) possible amino acid types. In order to represent the 20 standard amino acids, the ones encoded by all 0s, the ones encoded by all 1s and those encoded with 1 or 4 ones (5 + 5 = 10) are removed, finally leaving 20 encodings (32 − 1 − 1 − 10 = 20). This binary 5-bit encoding uses a 5-dimension binary vector to take the place of the 20-dimension vector of one-hot encoding, which may lead to less model complexity.5
3.2.2 Physicochemical properties encoding
From the perspective of molecular composition, a typical amino acid generally contains a central carbon atom (C) which is attached with an amino group (NH2), a hydrogen atom (H), a carboxyl group (COOH) and a side chain (R). The side chains (R) are usually carbon chains or rings (except for proline) which are attached to various functional groups.5 The physicochemical properties of those components play critical roles in the formation of protein structures and functions; thus, these properties can also be used as features for protein structure and function prediction.15
Among various physicochemical properties, the hydrophobicity of the amino acid is believed to play a fundamental role in organizing the self-assembly of a protein.16 Based on the propensity of the amino acid side chain to be in contact with a polar solvent like water, the 20 amino acids can be classified as either hydrophobic or hydrophilic. The free energy of amino acid side chains transferring from cyclohexane to water can be used to represent its hydrophobicity in a quantifiable manner.6 If the free energy is a positive value, the amino acid is hydrophobic, while negative values indicate hydrophilic amino acids. Hydrophobic amino acids are usually buried inside the protein core in protein three-dimensional structures, while the hydrophilic amino acids preferentially cover the surface of the protein three-dimensional structures. Furthermore, the hydrophilic amino acids are called polar amino acids. In a typical biological environment, some polar amino acids carry a charge, Lysine (+), Histidine (+), Arginine (+), Aspartate (−) and Glutamate (−), while other polar amino acids, Asparagine, Glutamine, Serine, Threonine and Tyrosine, are neutral.17 A detailed classification of the hydrophobic properties of the 20 standard acid sides is shown in Table 3-1. Other than hydrophobicity properties, the codon diversity and size of amino acids are also used as features. The codon diversity of an amino acid is reflected by the number of codons coding for the amino acid, and the size of an amino acid denotes its molecular volume.15
Table 3-1 The hydrophobic properties of 20 standard acid sides.
Some physicochemical property-based amino acid encodings have been proposed in previous studies. Fauchère et al.18 established 15 physicochemical descriptors of side chains for 20 natural and 26 non-coded amino acids which reflect hydrophobic, steric, electronic, and other properties of amino acid side chains. Radzicka and Wolfenden19 obtained digitized indications of the tendencies of amino acids to leave water and enter a truly nonpolar condensed phase in their experiments. Lohman et al.20 represented amino acids by using seven physicochemical properties to predict transmembrane protein sequences, and the properties are hydrophobicity, hydrophilicity, polarity, volume, surface area, bulkiness and refractivity. Atchley et al.15 used multivariate statistical analyses to produce multi-dimensional patterns of attribute covariation for the 20 standard amino acids, which reflect the polarity, secondary structure, molecular volume, codon diversity and electrostatic charge of amino acids.
3.2.3 Evolution-based encoding
The evolution-based encoding methods extract the evolutionary information of residues from sequence alignments or phylogenetic trees to represent amino acids, mainly by using the amino acid substitution probability. These evolution-based encoding methods can be categorized into two groups based on position relevance: position-independent methods and position-dependent methods.
The position-independent methods encode amino acids by using fixed encodings, regardless of the amino acid position in the sequence and the amino acid composition of the sequence. The most commonly used position-independent encoding methods are the PAM matrices and the BLOSUM matrices, and a common flowchart is shown in Fig. 3-1. The point accepted mutation (PAM) matrices represent the replacement probabilities for change from a single amino acid to another single amino acid in homologous protein sequences,13 which are focused on the evolutionary process of proteins. The PAM matrices are calculated from protein phylogenetic trees and related protein sequence pairs. The assumption of the PAM matrices is that the accepted mutation is similar in physical and chemical properties to the old one and the likelihood of amino acid X replacing Y is the same as that of Y replacing X; thus, the PAM matrices are 20 ∗ 20 symmetry matrices where each row and column represents one of the 20 standard amino acids. Corresponding to different lengths of evolution time, different PAM matrices can be generated. The 250 PAMs, which means the amino acid replacements to be found after 250 evolutionary changes, was found by the authors to be an effective scoring matrix for detecting distant relationships,13 and it is now widely used in related research.21,22 The blocks amino acid substitution matrices (BLOSUM)23 are amino acid substitution matrices derived based on conserved regions constructed by the PROTOMAT24 from non-redundant protein groups. The values in the BLOSUM matrices represent the probabilities that amino acid pairs will exchange places with each other. To reduce the contributions of most closely related protein sequences, the sequences are clustered within blocks. Different BLOSUM matrices can be generated by using different identical percentages for clusters, and the BLOSUM62 matrix performed better overall.23
Figure 3-1 The flowchart of position-independent amino acid encoding methods. First, the target proteins are selected (step 1). Then, the sequence alignments are constructed based on some criteria (step 2). Finally, the mutation matrix is calculated and is regarded as the amino acid encoding (step 3).
Different from position-independent matrices, the position-dependent methods encode amino acids at different positions by using different encodings, even if the amino acid types are the same. The position-dependent encodings are deduced from the multiple sequence alignments (MSAs) of target sequences; the flowchart for this is shown in Fig. 3-2. The position-specific scoring matrix (PSSM) is the most widely used encoding method. The PSSM is also called the position weight matrix (PWM), which represents the log-likelihoods of the occurrence probabilities of all possible molecule types at each location in a given biological sequence.25 Generally, the Position-Specific Iterative Basic Local Alignment Search Tool (PSI-BLAST)26 is used to execute sequence alignment and generate MSA for the target protein sequence. Then the corresponding PSSM is calculated from the MSA. For a protein sequence with length L, its PSSM is an L ∗ 20 matrix, in which each row represents the log-likelihoods of the probabilities of 20 amino acids occuring at its corresponding position. Besides the PSI-BLAST, the HMM-HMM alignment algorithm HHblits is also widely used to generate the probabilities profile, which is more sensitive than the sequence-profile alignment algorithm PSI-BLAST,