Principles of Virology. Jane Flint
Today the same result could be achieved in less than one hour.
The difference is a consequence of the development of second- and third-generation sequencing methods, spurred by the desire to sequence larger and larger virus and cell genomes. These methods were originally called next-generation sequencing, because they followed the very first sequencing methods. The first of these new methods to be developed, 454 sequencing, was released in 2005 and could produce 200,000 reads of 110 base pairs. Other technologies that generated larger numbers of sequence reads soon followed (Solexa/ Illumina, SOLiD, and Ion Torrent) which generated larger numbers of reads, but the number of bases in each read was much shorter. These technologies relied on amplification of the target DNA and optical detection of incorporated fluorescent nucleotides. Third-generation sequencing methods can not only detect single molecules (e.g., amplification is not required) but also carry out sequencing in real time. PacBio instruments can achieve maximum read lengths of 20 kb, and those from Illumina can generate 1.8 terabytes of sequence per run. The latter reduces the cost of sequencing a human genome to below $1,000, a 10,000-fold reduction in price since 2004, when the first human genome was deciphered.
These technologies have not only made sequencing of DNA cheaper and faster but also helped create innovative experimental approaches to study genome organization, function, and evolution. Their use has led to the discovery of new viruses and has given birth to the field of metagenomics, the analysis of sequences directly from clinical or environmental samples. These sequencing technologies can be used to study the virome, the genomes of all viruses in a specific environment, such as sewage, the human body, or the intestinal tract. While these virus detection technologies are extremely powerful, the results obtained must be interpreted with caution. It is very easy to detect traces of a viral contaminant when searching for new agents of human disease (Box 2.9).
It should be noted that metagenomics is not limited to DNA viruses. Nucleic acids extracted from clinical or environmental samples may be treated with DNase, and the remaining RNAs converted to DNA with reverse transcriptase for sequencing and identification.
EXPERIMENTS
Pathogen de-discovery
High-throughput sequencing of nucleic acids has accelerated the pace of virus discovery, but at a cost: contaminants are much easier to detect.
During a search for the causative agent of seronegative hepatitis (disease not caused by hepatitis A, B, C, D, or E virus) in Chinese patients, a new virus with a single-stranded DNA genome was discovered in sera by high-throughput sequencing. Seventy percent of 90 patient serum samples were positive for viral DNA by PCR, and sera from 45 healthy controls were negative. Furthermore, 84% of patients were positive for antibodies against the virus. Among healthy controls, 78% were antibody positive. The authors concluded that this virus was highly prevalent in some patients with seronegative hepatitis. A second independent laboratory identified the same virus in sera from patients in the United States with non-A-to-E hepatitis, while a third group identified the virus in diarrheal stool samples from Nigeria.
The first clue that something was amiss was the observation that the new virus identified in all three laboratories shared 99% nucleotide and amino acid identity: this similarity would not be expected in virus samples from such geographically, temporally, and clinically diverse samples. Another problem was that in the U.S. non-A-to-E hepatitis study, all pools of patient sera were positive for viral sequences. These observations suggested the possibility of viral contamination.
When nucleic acids were repurified from the U.S. non-A-to-E hepatitis samples using a different method, none were positive for the new virus. The presence of the virus was traced to the use of column-based purification kits manufactured by Qiagen, Inc. (pictured). Nearly the entire viral genome could be detected by deep sequencing of sterile water that was passed through these columns. The nucleic acid purification columns contaminated with the new virus were used to purify nucleic acid from patient samples. These columns, produced by a number of manufacturers, are typically an inch in length and contain a silica gel membrane that binds nucleic acids. The clinical samples are added to the column, which is then centrifuged briefly to remove liquids (hence the name “spin” columns). The nucleic acid adheres to the silica gel membrane. Contaminants are washed away, and the nucleic acids are then released from the silica by the addition of a buffer.
Why were the Qiagen spin columns contaminated with viral DNA? A search of the publicly available environmental metagenomic data sets revealed the presence of sequences highly related to this virus (87 to 99% nucleotide identity). The data sets containing these sequences were obtained from seawater collected off the Pacific coast of North America and coastal regions of Oregon and Chile. The source of contamination could be explained if the silica in the Qiagen spin columns was produced from ocean-dwelling diatoms that were infected with the virus.
In retrospect, it was easy to be fooled into believing that the novel virus might be a human pathogen because it was detected only in sick and not healthy patients. Why antibodies to the virus were detected in samples from both sick and healthy patients remains to be explained. However, the virus is not likely to be associated with any human illness: when non-Qiagen spin columns were used, the viral sequences were not found in any patient sample.
The lesson to be learned from this story is clear: high-throughput sequencing is a very powerful and sensitive method but must be applied with great care. Every step of the virus discovery process must be carefully controlled, from the water used to the plastic reagents. Most importantly, laboratories carrying out pathogen discovery must share their sequence data, something that took place during this study.
Naccache SN, Greninger AL, Lee D, Coffey LL, Phan T, Rein-Weston A, Aronsohn A, Hackett J, Jr, Delwart EL, Chiu CY. 2013. The perils of pathogen discovery: origin of a novel parvovirus-like hybrid genome traced to nucleic acid extraction spin columns. J Virol 87:11966–11977.
Xu B, Zhi N, Hu G, Wan Z, Zheng X, Liu X, Wong S, Kajigaya S, Zhao K, Mao Q, Young NS. 2013. Hybrid DNA virus in Chinese patients with seronegative hepatitis discovered by deep sequencing. Proc Natl Acad Sci U S A 110:10264–10269.
Computational biology. The generation of nucleotide sequences at an unprecedented rate has spawned a new branch of bioinformatics to develop algorithms for assembling sequence reads into continuous strings and to determine whether they are from a new or previously discovered virus. Storing, analyzing, and sharing massive quantities of data constitute an immense challenge: the number of bases in GenBank, an open-access, annotated collection of all publicly available nucleotide sequences produced and maintained by the National Center for Biotechnology Information, has doubled every 18 months since 1982. As of June 2019 GenBank held 329,835,282,370 bases.
Computational problems must be solved at multiple steps during the process of genome sequencing. The initial problem is that sequence reads are typically short, and there are many of them (e.g., high throughput). These short sequences must be overlapped and, if possible, mapped to a genome. Many computer programs have been developed to address this problem. Some carry out alignment of sequence reads to a reference genome, while others perform this process de novo, i.e., in the absence of a reference genome.
When clinical or environmental samples are subjected to high-throughput sequencing for pathogen discovery, it is essential to identify viral sequences in what is typically a mix of host, bacterial, and fungal sequences. This task relies on alignment of sequences to reference viral databases. However, such databases are limited because most of the sequences retrieved in metagenomic studies are unknown (so-called “dark matter”) and therefore cannot be annotated. Consequently, computational pipelines have been designed to analyze high-throughput