Molecular Biotechnology. Bernard R. Glick
of a particular ecosystem or location (Fig. 2.43). The entire library is sequenced using a massively parallel approach and assembled into contigs as described above with the aim of determining the sequence of as many different genomes as possible and identifying both novel gene sequences and those that are similar (homologous) to known gene sequences. For example, a massive study that included 50 ocean samples from locations in the North Atlantic through the Panama Canal to the South Pacific yielded 6.3 billion bp of sequence. Analysis of the assembled and nonassembled sequences indicated that there might be as many as 400 new bacterial species among the samples with about 1 × 106 genes that lack significant sequence similarity with any known gene. The analysis also revealed sequences encoding potentially novel forms of many proteins including proteins for repair of ultraviolet light-induced DNA damage and RuBisCO (ribulose bisphosphate carboxylase), an enzyme that is important for carbon fixation.
Figure 2.43 Construction of metagenomic libraries. Bacteria and/or viruses in samples from various environments or tissues are concentrated before extracting and then fragmenting the DNA. Libraries containing the DNA fragments are sequenced or screened for novel genes.
Genomics
Genome sequence determination is only a first step in understanding an organism. The next steps require identification of the features encoded in a sequence and investigations of the biological functions of the encoded RNA, proteins, and regulatory elements that determine the physiology and ecology of the organism. The area of research that generates, analyzes, and manages the massive amounts of information about genome sequences is known as genomics.
Sequence data are deposited and stored in databases that can be searched using computer algorithms to retrieve sequence information (data mining or bioinformatics). Public databases such as GenBank (National Center for Biotechnology Information, Bethesda, MD), the European Molecular Biology Laboratory Nucleotide Sequence Database, and the DNA Data Bank of Japan receive sequence data from individual researchers and from large sequencing facilities and share the data as part of the International Nucleotide Sequence Database Collaboration. Sequences can be retrieved from these databases via the Internet. Many specialized databases also exist, for example, for storing genome sequences from individual organisms, protein coding sequences, regulatory sequences, sequences associated with human genetic diseases, gene expression data, protein structures, protein-protein interactions, and many other types of data.
One of the first analyses to be conducted on a new genome sequence is the identification of descriptive features, a process known as annotation. Some annotations are protein coding sequences (open reading frames), sequences that encode functional RNA molecules (e.g., rRNA and tRNA), regulatory elements, and repetitive sequences. Annotation relies on algorithms that identify features based on conserved sequence elements such as translation start and stop codons, intron-exon boundaries, promoters, transcription factor-binding sites, and known genes (Fig. 2.44). It is important to note that annotations are often predictions of sequence function based on homology to sequences of known functions. In many cases, the function of the sequence remains to be verified through experimentation.
Figure 2.44 Genome annotation utilizes conserved sequence features. Predicting protein coding sequences (open reading frames) in prokaryotes (A) and eukaryotes (B) requires identification of sequences that correspond to potential translation start (ATG or, more rarely, GTG or TTG) and stop (TAA, shown; also TAG or TGA) codons in mRNA. The number of nucleotides between the start and stop codons must be a multiple of three (i.e., triplet codons) and must be a reasonable size to encode a protein. In prokaryotes, a conserved ribosome-binding site (RBS) is often present 4 to 8 nucleotides upstream of the start codon (A). Prokaryotic transcription regulatory sequences such as an RNA polymerase recognition (promoter) sequence and binding sites for regulatory proteins can often be predicted based on similarity to known consensus sequences. Transcription termination sequences are not as readily identifiable but are often GC-rich regions downstream of a predicted translation stop codon. In eukaryotes, protein coding genes typically have several intron sequences in primary RNA that are delineated by GU and AG and contain a pyrimidine-rich tract. Introns are spliced from the primary transcript to produce mRNA (B). Transcription regulatory elements such as the TATA and CAAT boxes that are present in the promoters of many eukaryotic protein coding genes can sometimes be predicted. Sequences that are important for regulation of transcription are often difficult to predict in eukaryotic genome sequences; for example, enhancer elements can be thousands of nucleotides upstream and/or downstream from the coding sequence that they regulate.
Comparison of a genome sequence to other genome sequences can reveal interesting and important sequence features. Comparisons among closely related genomes may reveal polymorphisms and mutations based on sequence differences. Association of specific polymorphisms with diseases can be used to predict, diagnose, and treat human diseases. Traditionally, cancer genetic research has investigated specific genes that were hypothesized to play a role in tumorigenesis based on their known cellular functions, for example, genes encoding transcription factors that control expression of cell division genes. Although important, this gives an incomplete view of the genetic basis for cancers. Sequencing of tumor genomes and comparing the sequences to those of normal cells have revealed point mutations, copy number mutations, and structural rearrangements associated with specific cancers. For instance, comparison of the genome sequences from acute myeloid leukemia tumor cells and normal skin cells from the same patient revealed eight previously unidentified mutations in protein coding sequences that are associated with the disease. Comparison of the genomes of bacterial pathogens with those from closely related nonpathogens has led to the identification of virulence genes. Unique sequences can be used for pathogen detection, and genes encoding proteins that are unique to a pathogen are potential targets for antimicrobial drugs and vaccine development.
Genome comparisons among distantly related organisms enable scientists to make predictions about evolutionary relationships. For example, the Genome 10K Project aims to sequence and analyze the genomes of 10,000 vertebrate species, roughly 1 per genus. Comparison of these sequences will contribute to our understanding of the genetic changes that led to the diversity in morphology, physiology, and behavior in this group of animals.
Another goal of genomic analysis is to understand the function of sequence features. Gene function can sometimes be inferred by the pattern of transcription. Transcriptomics is the study of gene transcription profiles either qualitatively, to determine which genes are expressed, or quantitatively, to measure changes in the levels of transcription of genes. Proteomics is the study of the entire protein populations of various cell types and tissues and the numerous interactions among proteins. Some proteins, particularly enzymes, are involved in biochemical pathways that produce metabolites for various cellular processes. Metabolomics aims to characterize metabolic pathways by studying the metabolite profiles of cells. All of these “-omic” subdisciplines of genomics use a genome-wide approach to study the function of biological molecules in cells, tissues, or organisms, at different developmental stages, or under different physiological or environmental conditions.
Transcriptomics
Transcriptomics (gene expression profiling) aims to measure the levels of transcription of genes on a whole-genome basis under a given set of conditions. Transcription may be assessed as a function of medical conditions, as a consequence of mutations, in response to natural or toxic agents, in different cells or tissues, or at different times during biological processes such as cell division or development of an organism. Often, the goal of gene expression studies is to identify the genes that are up- or downregulated in response to a change in a particular condition. Two major experimental approaches for measuring RNA transcript levels on a whole-genome basis are DNA microarray analysis and high-throughput next-generation RNA sequencing.
DNA Microarrays
A DNA microarray (DNA