Biological Language Model. Qiwen Dong
and function.
Qiwen Dong
Xiuzhen Hu
Xiaoyang Jing
Aoying Zhou
Acknowledgments
This work was supported by the National Key Research and Development Program of China under grant 2016YFB1000905 and the National Natural Science Foundation of China (Grant No. U1401256, U1711262, U1811264, 61672234, 61961032, 31260203, 61402177).
We would like to thank all the people who have made contributions to and given their valuable suggestions regarding this book, especially Bin Liu, Ming Gao, Dingjiang Huang and Daocheng Hong. We would also like to express our sincere thanks and appreciation to the people at University Press, for their generous help throughout the publication preparation process.
Contents
East China Normal University Scientific Reports
1.3Organization of the Book Content
2.Linguistic Feature Analysis of Protein Sequences
2.2Comparative n-gram Analysis
2.4Distinguishing the Organisms by Uni-Gram Model
3.Amino Acid Encoding for Protein Sequence
3.4The Assessment of Encoding Methods for Protein Secondary Structure Prediction
3.5Assessments of Encoding Methods for Protein Fold Recognition
4.2Related Work
4.3Latent Semantic Analysis
4.4Auto-cross Covariance Transformation
4.5Conclusions
References
5.2Related Work
5.3Domain Boundary Prediction
5.4Building Blocks of Protein Local Structure
5.5Characterization of Protein Flexibility Based on Structural Alphabets
5.6Novel Nonlinear Knowledge-based Mean Force Potentials
5.7Conclusions
References
6.2Profile-level Interface Propensities for Binding Site Prediction
6.3Gene Ontology-Based Protein Function Prediction
6.4Prediction of Protein–Protein Interaction from Primary Sequences
6.5Identifying the Missing Proteins using the Biological Language Model
6.6Conclusions
References
7.Summary and Future Perspectives
Chapter 1
Introduction
1.1Background and Motivation
The task of human genome sequencing was completed in 2003, and life science from then on stepped into the post-gene era. The research focuses are gradually shifting from accumulating data to methods to interpret the data, i.e. how to extract structural and functional information from sequence data. Post-genome sequencing research includes comparative genomics, structural genomics, functional genomics, proteomics, holistic biology and pharmacogenomics.
The proteome1 is a dynamic concept that is not only different in different tissues and different cells of the same organism but is constantly changing throughout the developmental stages of the same organism until the final demise of that organism. The complex pattern of gene expression leads to a variety of complex life activities. In fact, each form of movement in the stages of life is the result of different combinations of specific protein groups that appear at different times and spaces. The sequence of the