Molecular Biotechnology. Bernard R. Glick
Figure 2.40 Features of an adaptor used for preparation and sequencing of genomic DNA fragments. A single 3′ deoxythymidine monophosphate overhang facilitates ligation of the adaptor to the ends of A-tailed genomic DNA fragments (Fig. 2.39D). The adaptor has a sequence that anneals to an oligonucleotide primer which captures and amplifies the genomic sequence on a solid support, a sequence that anneals to a sequencing primer that primes the sequencing reaction, and a unique barcode sequence that is used to tag a genomic library when multiple libraries are combined for sequencing (multiplexing).
High-Throughput Next-Generation Sequencing
Most of the current commercially available, high-throughput next-generation sequencing strategies use PCR to generate clusters containing millions of copies of each DNA sequencing template. In one strategy, single-stranded sequencing templates (denatured library fragments) are captured by hybridization via the adaptor sequence that is complementary to oligonucleotides covalently bound to a solid surface, such as a glass slide (Fig. 2.41A). The oligonucleotides also act as primers for DNA polymerase to synthesize the strand complementary to the captured template strand. The resulting double-stranded DNA is denatured and the original template is washed away. The remaining strand is anchored to the solid support via the bound oligonucleotide primer at one end and at the other end carries the adaptor sequence that is complementary to an oligonucleotide primer that is adjacent on the glass slide, to which it hybridizes (Fig. 2.41B). Following extension of the second primer (bridge PCR) and denaturation of the double-stranded product, two single-stranded DNA molecules are anchored to the solid support (Fig. 2.41C). The process is repeated many times to generate clusters of about a thousand copies of each sequencing template. (Fig. 2.41D)
Figure 2.41 Generation of clusters of sequencing templates. Denatured genomic DNA library fragments are captured on a glass slide by annealing to a bound oligonucleotide via a complementary adaptor sequence (A). The oligonucleotide primes the synthesis of the complementary strand. The resulting double-stranded DNA is denatured and the original library fragment is washed away. The strand that remains anchored to the glass slide at one end binds to an adjacent oligonucleotide primer by the adaptor sequence at the other end (B). The complementary strand is synthesized by extension of the second primer in a process known as bridge amplification. Denaturation of the double-stranded product results in two single-stranded DNA molecules that are bound to the glass slide (C). The process is repeated many times to generate clusters of about a thousand copies of each sequencing template (D).
The nucleotide sequence of each template may be acquired by addition of a sequencing primer that is complementary to an adaptor sequence, DNA polymerase, and four differentially labeled fluorescent reversible chain terminators (Fig. 2.37). A single sequencing cycle consists of addition of these reagents to each cluster containing the PCR-amplified copies of a fragment of genomic DNA and then capturing the fluorescent signal generated by addition of a single nucleotide (as described above). The spectrum of fluorescence corresponds to the nucleotide added, and the cycle is repeated to determine the sequence of 50 to 150 nucleotides from one or both ends of each template. This process occurs simultaneously for hundreds of millions of clusters anchored to the solid support.
The accuracy of the sequence is assessed using a Phred quality score (Q), which indicates the probability (P) that a base is identified incorrectly, as described in the following equation: Q = −10 log10 P. For example, a Q score of 30 (Q30) indicates that there is a 1 in a 1,000 chance that a base is called incorrectly, or that the base call accuracy is 99.9%.
Genome Sequence Assembly
A genome sequence can be assembled by aligning the sequences of DNA reads with sequences from a previously determined and highly related (reference) genome. For example, reads from resequenced human genomes, that is, genomes from different individuals, are mapped to a reference human genome. Alternatively, when a reference sequence is not available, the reads can be assembled de novo using a computer program that aligns the matching ends of different reads. The process of generating successive overlapping sequences produces long, contiguous stretches of nucleotides called contigs. The presence of repetitive sequences in a genome can result in erroneous matching of overlapping sequences. This problem can be overcome by using the sequences from both ends of a DNA fragment (paired end reads), which are a known distance apart (when genomic DNA fragments are size selected prior to sequencing), to order and orient the reads and to assemble the contigs into larger scaffolds (Fig. 2.42). Many overlapping reads are required to ensure that the nucleotide sequence is accurate and assembled correctly. Each nucleotide site in a genome is generally sequenced many times from different fragments. The extent of sequencing redundancy, called coverage or depth of coverage, varies from 10 to more than 100, depending on the error rate of the sequencing method, the read length (shorter reads require greater coverage), the complexity of the genome, the assembly method, and the goal of the sequencing project. The assembly process generates a draft sequence; however, small gaps may remain between contigs. Although a draft sequence is sufficient for many purposes, for example, in resequencing projects that map a sequence onto a reference genome, in some cases it is preferable to close the gaps to complete the genome sequence. For de novo sequencing of genomes from organisms that lack a reference genome, gap closure is desirable. The gaps can be closed by PCR amplification of high-molecular-weight genomic DNA across each gap, followed by sequencing of the amplification product, or by obtaining short sequences from primers designed to anneal to sequences adjacent to a gap. Sequencing of additional libraries containing fragments of different sizes may be required to complete the overall sequence.
Figure 2.42 Genome sequence assembly. Sequence data generated from both ends of a DNA fragment are known as paired ends (paired ends are shown in blue for each fragment, and the distance between them is represented by a thin, black line). A large number of reads are generated and assembled into longer contiguous sequences (contigs) using a computer program that matches overlapping sequences. Paired ends help to determine the order and orientation of contigs as they are assembled into scaffolds. Shown is a scaffold consisting of three contigs.
Sequencing Metagenomes
For more than 100 years, the identification of microorganisms and characterization of their biological functions has required cultivating each strain in the laboratory. In the 1990s, with the emergence of techniques to extract DNA directly from environmental samples such as soil and seawater, researchers began to examine the sequence diversity of bacteria using the universal 16S ribosomal RNA gene as a taxonomic marker. These studies revealed that less than 1% of all bacterial species could be cultured, and therefore, novel genes that might be of considerable interest for basic and applied research were inaccessible using methods that depended on growth of bacteria in the laboratory. Considering the wealth of biotechnologically important genes and proteins that had been obtained from the relatively few culturable microorganisms, the possibility of harvesting useful genes from the much greater number of unculturable microorganisms was exciting, if not daunting. With the development of high-throughput next-generation sequencing and algorithms for assembling genome sequences, it has become possible to access the genomes of uncultured organisms from complex environmental and clinical samples. The study of the collective genomes in these samples is known as metagenomics.
The primary objective of a metagenomic project is to construct a comprehensive DNA library from all