Cell Biology. Stephen R. Bolsover
hemoglobin α and hemoglobin β. Different members of a family sometimes encode proteins that carry out the same specialized function but at different times during development. The α‐ and β‐globin gene families, illustrated in Figure 4.7, are an example. The α‐globin gene cluster is on human chromosome 16 while the β‐globin gene cluster is on human chromosome 11. Hemoglobin is composed of two α globins and two β globins. The different globin proteins are produced at different stages of gestation, embryo to fetus to adult, to cope with the different oxygen transport requirements at each step. The duplication of genes and their subsequent divergence allows the expansion of the gene repertoire, the production of new protein molecules and the elaboration of ever more specialized gene functions during evolution.
Some sections of DNA are very similar in sequence to other members of their gene family but do not produce mRNA. These are known as pseudogenes. There are two in the α‐globin gene cluster and one in the β‐globin gene cluster (Ψ in Figure 4.7). Pseudogenes may be former genes that have mutated to such an extent that they can no longer be transcribed into RNA. Some pseudogenes have arisen because an mRNA molecule has been copied back into DNA by an enzyme called reverse transcriptase found in some viruses (Medical Relevance 3.1 on page 39). Such pseudogenes are immediately recognizable because some or all of their introns were spliced out before the integration occurred. Some have the poly(A) tail characteristic of intact mRNA present in the gene (page 76). These are called processed pseudogenes.
IN DEPTH 4.1 THERE ARE MORE PROTEINS THAN GENES IN MULTI‐CELLULAR ORGANISMS
As genomes of more and more organisms were sequenced, the most surprising feature to emerge was just how few genes supposedly complex organisms possess (Table 3.1 on page 39). The first eukaryotic genome to be sequenced was that of the budding yeast, Saccharomyces cerevisiae, the simple unicellular fungus that we use to make bread and beer. S. cerevisiae has 6091 genes. The fruit fly, Drosophila melanogaster, a much more complex organism with a brain, nervous and digestive systems, and the ability to fly and navigate, on the other hand, has 14 133 genes, or roughly twice the number in a yeast. Even more surprising was the finding that humans have only about 19 116 protein‐coding genes. However, humans make many more than 19 116 proteins and it is these that contribute to the complexity of an organism such as ourselves. How is it possible to have so few genes and yet make 100 000s of different proteins? It is the arrangement of human genes into exons and introns (page 59) that provides the solution. Alternative splicing (page 76) allows the cell to “cut and paste” exons in different ways to produce many different mRNAs from the same gene. The most extreme case known is the human gene called SLO, which encodes a protein found in some potassium channels (page 152). This gene has 35 exons, which can produce 40 320 different combinations of exons from a single gene. Estimates are that something like 50% of human genes show alternative splicing with the pattern of splicing (the range of proteins produced) varying from tissue to tissue. Drosophila genes also show alternative splicing but those of yeast, which contain few introns, do not.
Sometimes DNA that encodes RNA is repeated as a series of copies that follow one after the other along the chromosome. Such genes are said to be tandemly repeated. The genes that code for ribosomal RNAs (about 250 copies/cell), transfer RNAs (50 copies/cell), and histone proteins (20–50 copies/cell) are tandemly repeated. The products of these genes are required in large amounts.
This still leaves about 75% of our nuclear genome that lacks a very clearly understood function. A large proportion of this extragenic DNA is made up of repetitive DNA sequences that are repeated many times in the genome. Some sequences are repeated more than a million times and are called satellite DNA. The repeating unit is usually several hundred base pairs long, and many copies are often lined up next to each other in tandem repeats. Most of the satellite DNA is found in a region called the centromere, which plays a role in the physical movement of the chromosomes that occurs at cell division (page 235), and one theory is that it has a structural function.
Our genome also contains minisatellite DNA where the tandem repeat is about 25 bp long. Minisatellite DNA stretches can be up to 20 000 bp in length and are often found near the ends of chromosomes, a region called the telomere. Microsatellite DNA has an even smaller repeat unit of about 4 bp or less. Again, the function of these repeated sequences is unknown but microsatellites, because their number varies between different individuals, have proved very useful in DNA testing (page 130). Other extragenic sequences, known as LINEs (long interspersed nuclear elements) and SINEs (short interspersed nuclear elements) occur in our genome. There are about 50 000 copies of LINEs in a mammalian genome and they make up about 17% of the human genome.
GENE NOMENCLATURE
One of the great difficulties that has arisen out of genome‐sequencing projects is how to name the genes and the proteins they encode. This has not been easy and a number of committees have been set up to deal with this problem. In general, each gene is designated by an abbreviation, written in capitalized italics. For example, type 1 collagen (the commonest form in the human body) is a trimer formed of two molecules of collagen 1 α1 and one molecule of collagen 1 α2. The abbreviated names of these proteins are COL1A1 and COL1A2 respectively, using normal capitals, while the names of the genes coding for these proteins use capitalized italics: COL1A1 and COL1A2. It is mutations in COL1A1 that give rise to osteogenesis imperfecta (Medical Relevance 3.2 on page 47).
There are many instances where for historical reasons the correlation between the protein and gene names are not so simple. For example the proteins connexin 43, 46, and 50 (page 28) are named for their relative molecular masses (43 kDa, etc.) and have the abbreviated names Cx43, Cx46, and Cx50. However, the genes that encode these proteins are called GJA1, GJA3, and GJA8 respectively, where GJ stands for gap junction.
IN DEPTH 4.2 GENOME PROJECTS
The publication in 1996 of the sequence of the genome of the single‐celled yeast S. cerevisiae was a milestone in biology. Not only did scientists have before them the complete genetic blueprint of a eukaryotic organism, but the technology for obtaining and curating huge amounts of genetic data was established. The genomes of other simple organisms such as the tiny nematode worm Caenorhabditis elegans, with just 959 body cells, and the fruit fly D. melanogaster, were published soon after, followed by more complex organisms such as the mouse and, of course, humans. Today, the sequence of the genomes of nearly 60 000 organisms, including 15 000 eukaryotic species, has been determined. Genomes from every branch of the tree of life are now available for study, including the platypus, our most distant mammalian relative, and both the nuclear and mitochondrial genomes of the Neanderthal, the hominid most closely related to present‐day humans.
Sophisticated databases have been created to store and analyze base sequence information from the various genome projects. Computer programs analyze the data for exon sequences and compare the sequence of one genome to that of another. In this way sequences encoding related proteins (proteins that share stretches of similar amino acids) can be identified. The genome data from patients can be used to identify mutations and inform clinical decision‐making. Some important programs that can be easily accessed through the internet are BLASTN for the comparison of a nucleotide sequence to other sequences stored in a nucleotide database and BLASTP, which compares an amino acid sequence to protein sequence databases. Programs such as Clustal, MAFFT, MUSCLE, and T‐Coffee can be used to compare multiple DNA or multiple protein sequences