Biological Language Model. Qiwen Dong

Biological Language Model

and function.

Qiwen Dong

Xiuzhen Hu

Xiaoyang Jing

Aoying Zhou

Acknowledgments

This work was supported by the National Key Research and Development Program of China under grant 2016YFB1000905 and the National Natural Science Foundation of China (Grant No. U1401256, U1711262, U1811264, 61672234, 61961032, 31260203, 61402177).

We would like to thank all the people who have made contributions to and given their valuable suggestions regarding this book, especially Bin Liu, Ming Gao, Dingjiang Huang and Daocheng Hong. We would also like to express our sincere thanks and appreciation to the people at University Press, for their generous help throughout the publication preparation process.

Contents

East China Normal University Scientific Reports

Preface

Acknowledgments

1.Introduction

1.1Background and Motivation

1.2Related Topics

1.3Organization of the Book Content

References

2.Linguistic Feature Analysis of Protein Sequences

2.1Motivation and Basic Idea

2.2Comparative n-gram Analysis

2.3The Zipf Law Analysis

2.4Distinguishing the Organisms by Uni-Gram Model

2.5Conclusions

References

3.Amino Acid Encoding for Protein Sequence

3.1Motivation and Basic Idea

3.2Related Work

3.3Discussion

3.4The Assessment of Encoding Methods for Protein Secondary Structure Prediction

3.5Assessments of Encoding Methods for Protein Fold Recognition

3.6Conclusions

References

4.Remote Homology Detection

4.1Motivation and Basic Idea

4.2Related Work

4.3Latent Semantic Analysis

4.4Auto-cross Covariance Transformation

4.5Conclusions

References

5.Structure Prediction

5.1Motivation and Basic Idea

5.2Related Work

5.3Domain Boundary Prediction

5.4Building Blocks of Protein Local Structure

5.5Characterization of Protein Flexibility Based on Structural Alphabets

5.6Novel Nonlinear Knowledge-based Mean Force Potentials

5.7Conclusions

References

6.Function Prediction

6.1Motivation and Basic Idea

6.2Profile-level Interface Propensities for Binding Site Prediction

6.3Gene Ontology-Based Protein Function Prediction

6.4Prediction of Protein–Protein Interaction from Primary Sequences

6.5Identifying the Missing Proteins using the Biological Language Model

6.6Conclusions

References

7.Summary and Future Perspectives

Index

Chapter 1

Introduction

1.1Background and Motivation

The task of human genome sequencing was completed in 2003, and life science from then on stepped into the post-gene era. The research focuses are gradually shifting from accumulating data to methods to interpret the data, i.e. how to extract structural and functional information from sequence data. Post-genome sequencing research includes comparative genomics, structural genomics, functional genomics, proteomics, holistic biology and pharmacogenomics.

The proteome¹ is a dynamic concept that is not only different in different tissues and different cells of the same organism but is constantly changing throughout the developmental stages of the same organism until the final demise of that organism. The complex pattern of gene expression leads to a variety of complex life activities. In fact, each form of movement in the stages of life is the result of different combinations of specific protein groups that appear at different times and spaces. The sequence of the

Скачать книгу