Biological Language Model. Qiwen Dong

Biological Language Model - Qiwen Dong


Скачать книгу
as demonstrated by Remmert et al.27

      Figure 3-2 The flowchart of position-dependent amino acid encoding methods. First, the target protein sequence is selected (step 1). Then, multiple sequence alignments are constructed by searching the protein sequence database (steps 2 and 3). Finally, the position weight is calculated by columns and is regarded as the corresponding amino acid encodings (step 4).

       3.2.4 Structure-based encoding

      The structure-based amino acid encoding methods, which can also be called statistical-based methods, encode amino acids by using structure-related statistical potentials, mainly using the inter-residue contact energies.28 The basic assumption is, in a large number of native protein structures, the average potentials of inter-residue contacts can reflect the differences of interaction between residue pairs,29 which play an important role in the formation of protein backbone structures.28 The inter-residue contact energies of the 20 amino acids are usually estimated based on amino acid pairing frequencies from native protein structures.28 The typical procedure to calculate the contact energies comprises three steps. First, a protein structure set is constructed from known native protein structures. Then, the inter-residue contacts of the 20 amino acids observed in those structures are counted. Finally, the contact energies are deduced from the amino acid contact frequencies by using the predefined energy function, and different contact energies reflect different contact potentials of amino acids in native structures.

      Many previous studies have focused on structure-based encodings. In order to account for medium- and long-range interactions which determine the protein folding conformations, Tanaka and Scheraga28 evaluated the empirical standard free energies to formulate amino acid contacts from the contact frequencies. By employing the lattice model, Miyazawa and Jernigan29 estimated contact energies by using quasi-chemical approximation with an approximate treatment of the effects of chain connectivity. Later, they reevaluated the contact energies based on a larger set of protein structures and also estimated an additional repulsive packing energy term to provide an estimate of the overall energies of inter-residue interactions.30 To investigate the validity of the quasi-chemical approximation, Skolnick et al.31 estimated the expected number of contacts by using two reference states, the first of which treats the protein as a Gaussian random coil polymer and the second of which includes the effects of chain connectivity, secondary structure and chain compactness. The comparison results show that the quasi-chemical approximation is, in general, sufficient for extracting the amino acid pair potentials. To recognize native-like protein structures, Simmons et al.32 used distance-dependent statistical contact potentials to develop energy functions. Zhang and Kim33 estimated 60 residue contact energies that mainly reflect the hydrophobic interactions and show strong dependence on the three secondary structural states. These energies were found to be effective in threading and three-dimensional contact prediction according to their test results. Later, Cristian et al. set up an iterative scheme to extract the optimal interaction potentials between the amino acids.34

       3.2.5 Machine-learning encoding

      Different from earlier manually defined encoding methods, the machine-learning based encoding methods learn amino acid encodings from protein sequence or structure data by using machine learning methods, typically using artificial neural networks. In order to reduce the complexity of the model, the neural network for learning amino acid encodings is weightsharing for 20 amino acids. In general, the neural network contains three layers: the input layer, the hidden layer and the output layer. The input layer corresponds with the original encoding of the target amino acid, which can be one-hot encoding, physicochemical encoding, etc. The output layer also corresponds with the original encoding of the related amino acids. The hidden layer, which represents the new encoding of the target amino acid, usually has a reduced dimension compared with the original encoding.

      To our knowledge, the earliest concept of learning-based amino acid encodings was proposed by Riis and Krogh.35 In order to reduce the redundancy of one-hot encoding, they used a 20 ∗ 3 weightsharing neural network to learn a 3-dimensional real number representation of 20 amino acids from one-hot encoding. Later, Jagla and Schuchhardt36 also used the weight sharing artificial neural network to learn a 2-dimensional encoding of amino acids for human signal peptide cleavage site recognition. Meiler et al.37 used a symmetric neural network to learn reduced representations of amino acids from amino acid physicochemical and statistical properties. The parameter representations were reduced from five and seven dimensions, respectively, to 1, 2, 3 or 4 dimensions, and then these reduced representations were used for ab initio prediction of protein secondary structure. Lin et al.8 used an artificial neural network to derive encoding schemes of amino acids from protein three-dimensional structure alignments, and each amino acid is described using the values taken from the hidden units of the neural network.

      In recent years, several new machine-learning-based encoding methods9,38,39 have been proposed with reference to distributed word representation in natural language processing. In natural language processing, the distributed representation of words has been proven to be an effective strategy for use in many tasks.40 The basic assumption is that words sharing similar contexts will have similar meanings; therefore these methods train the neural network model by using the target word to predict its context words or by predicting the target word from its context words. After training on unlabeled datasets, the weights of the hidden units for each word are used as its distributed representation. In protein-related studies, a similar strategy has been used by assuming that the protein sequences are sentences, and that the amino acids or sub-sequences are words. In previous researches, these distributed representations of amino acids or sub-sequences show potential in protein family classification and disordered protein identification,9 protein function site predictions,38 protein functional property prediction,39 etc.

      In this section, we will make a theoretical discussion of amino acid-encoding methods. First of all, we investigate the classification criteria of amino acid-encoding methods; second, we discuss the theoretical basis of these methods, and then analyze their advantages and limitations. Finally, we review and discuss the criteria for measuring an amino acid encoding method.

      As introduced above, amino acid encoding methods have been divided into five categories according to their information sources and methodologies. However, it should be noted that the methods in one category are not completely different from those in others, and that there are some similarities between the encoding methods belonging to different categories. For example, the 6-bit one-hot encoding method proposed by Wang et al.12 is a dimension-reduced representation of the common one-hot encoding, but it is based on the six amino acid exchange groups which are derived from PAM matrices.13 There is another classification criterion based on position relevance. In an earlier section, evolution-based encoding methods were discussed, and it was mentioned that they are divided into two categories: position-independent methods and position-dependent methods. We can also group all of the amino acid encoding methods into these position-independent and position-dependent categories. Except for the position-specific scoring matrix (PSSM) and other similar encoding techniques that extract evolution features from multiple sequence alignments which are position-dependent methods, all the other amino acid encoding methods are position-independent methods. The position-dependent methods can capture homologous information, while position-independent ones can reflect the basic properties of amino acids. To some extent, these two types of methods can be complementary to each other. In practice, the combination of position-independent encoding and position-dependent encoding is often used, such as combining one-hot and PSSM,41 combining physicochemical properties encoding and PSSM,42 etc.

      Theoretically, the functions of a protein are closely related to its tertiary structure, and its tertiary structure is mostly determined by the physicochemical properties of its amino acid sequence.43 From


Скачать книгу