Biological Language Model. Qiwen Dong
and hfre is the corresponding amino acid emission frequency. h is set to 0 if it is an asterisk.
Table 3-2 A brief introduction of the 16 selected amino acid encoding methods.
3.4.2 Benchmark datasets for protein secondary structure prediction
Following several representative protein secondary structure prediction works11,42,51 published in recent years, we use the CullPDB dataset52 as training data and use four widely used test datasets — the CB513 dataset,53 the CASP10 dataset,54 the CASP11 dataset55 and the CASP12 dataset56 — as test data to evaluate the performance of different features. The CullPDB dataset is a large non-homologous sequence set produced by using the PISCES server,52 which culls subsets of protein sequences from the Protein Data Bank based on sequence identity and structural quality criteria. Here, we retrieved a subset of sequences that have structures with better than 1.8 Å resolution and share less than 25% sequence identity with each other. We also remove those sequences sharing more than 25% identity with sequences from the test dataset to ensure there is no homology between the training and the test datasets, and finally the CullPDB dataset contained 5748 protein sequences with lengths ranging from 18 to 1455. The CB513 dataset contains 513 proteins with less than 25% sequence similarity. The Critical Assessment of techniques for protein Structure Prediction (CASP) is a highly recognized community experiment to determine state-of-the-art methods in protein structure prediction from amino acids56; the recently released CASP10, CASP11 and CASP12 datasets are adopted as test datasets. It should be noted that the protein targets from CASP used here are based on the protein domain. Specifically, the CASP10 dataset contains 123 protein domains whose sequence lengths range from 24 to 498, the CASP11 dataset contains 105 protein domains whose sequence lengths range from 34 to 520, and the CASP12 dataset contains 55 protein domains whose sequence lengths range from 55 to 463.
Protein secondary structure labels are inferred by using the DSSP program57 from corresponding experimentally determined structures. The DSSP specifies 8 secondary structure states to each residue; here, we adopt 3-state secondary structure prediction as a benchmark task by converting 8 assigned states to 3 states: G, H, and I to H; B and E to E; and S, T, and C to C.
3.4.3 Performance comparison by using the Random Forests method
In order to use the information of neighboring residues, many previous protein secondary structure prediction methods apply the sliding window scheme and have demonstrated considerably good results.48 Referring to those methods, we also used the sliding window scheme to evaluate different amino acid encoding methods, and the diagram for this is shown in Fig. 3-3. The evaluation is based on the Random Forests method from the Scikit-learn toolboxes,58 the window size is 13 and the number of trees in the forest is 100. The comparison results are shown in Table 3-3.
Figure 3-3 The diagram of the sliding window scheme by using the Random Forests classifier for protein secondary structure prediction. The two target residues are Leu (L) and Ile (I) separately, the input for each target residue is independent.
First, we analyze and discuss the performance of different methods in the same category. For the binary encoding methods, one-hot encoding is the most widely used encoding method. The one-hot (6-bit) encoding and the binary 5-bit encoding are two dimension-reduced representations of the one-hot encoding. As can be seen from Table 3-3, the best performance is achieved by the one-hot encoding method, which demonstrates that some effective information could be lost after the artificial dimension reduction for one-hot (6-bit) encoding and binary 5-bit encoding. For the physicochemical properties encodings, the hydrophobicity matrix just contains hydrophobicity-related information and performs poorly, while the Meiler parameters and the Acthely factors are constructed from multiple physicochemical information sources and perform better. This shows that the integration of multiple physicochemical information sources and parameters is valuable. For evolution-based encodings, it is obvious that the position-dependent encodings (PSSM and HMM) are much more powerful than position-independent encodings (PAM250 and BLOSUM62), which shows that the homologous information is strongly associated with the protein structures. For the two structure-based encodings, they have comparative performances. For the three machine-learning encodings, the ANN4D performs better than the AESNN3 and the ProtVec, while the ProtVec-3mer encoding achieves similar performance compared with the ProtVec encoding. Second, on the whole, the position-dependent evolution-based encoding methods (PSSM and HMM) achieved the best performance. This result suggests that the evaluation information extracted from the MSAs is more conserved than the global information extracted from other sources. Third, the performances of different encoding methods show a certain degree of correlation with encoding dimensions, and the low-dimensional encodings, i.e. the one-hot (6-bit), binary 5-bit and two machine-learning encodings, have poorer performances than the high-dimensional encodings. This correlation could be due to the sliding window scheme and Random Forests algorithm; larger feature dimension is more conducive to recognizing the secondary structure states, but too large of a dimension will lead to poor performance (ProtVec and ProtVec-3mer).
Table 3-3 Protein secondary structure prediction accuracy of 16 amino acid encoding methods by using the Random Forests method.
3.4.4 Performance comparison by using the BRNN method
In recent years, deep learning-based methods for protein secondary structure prediction have achieved significant improvements.48 One of the most important advantages of deep learning methods is that they can capture both neighboring and long-range interactions, which could avoid the shortcomings of sliding window methods with handcrafted window size. For example, Heffernan et al.42 have achieved state-of-the-art performances by using the long short-term memory (LSTM) bidirectional recurrent neural networks. Therefore, to exclude the potential influence of the handcrafted window size, we also perform an assessment by using the bidirectional recurrent neural networks (BRNN) with long short-term memory cells. The model used here is similar to the model used in Heffernan’s work,42 as shown in Fig. 3-4, which contains two BRNN layers with 256 LSTM cells and two fully connected (dense) layers with 1024 and 512 nodes, and it is implemented based on the open-sourced deep learning library TensorFlow.59
The corresponding comparison results of the 16 selected encoding methods are shown in Table 3-4. From the overall view, the BRNN-based method was found to have better performance compared with the Random Forests-based method, but there are also some specific similarities and differences between them. For binary encoding methods, one-hot encoding still shows the best performance, which once again confirms the information loss of the one-hot (6-bit) and the binary 5-bit encoding methods. For the physicochemical property encodings, the Meiler parameters do not perform as well as the Acthely factors, suggesting that the Acthely factors are more efficient for deep learning methods. For the evolution-based encodings, the PSSM encoding achieves the best accuracy, while the HMM encoding just achieves as much accuracy as those position-independent encodings (PAM250 and BLOSUM62). The difference could be due to the different levels of homologous sequence identity. The HMM encoding is extracted from the UniProt20 database with 20% sequence identity, while the PSSM encoding is extracted from the UniRef90 database with 90% sequence identity. Therefore, for a certain protein sequence, its MSA from the UniProt20 database mainly