In April 2003, the Human Genome Project was declared completed, putting an end to the 13-year long scientific endeavor. What is the human genome project? What is inside the human genome? And, how to sequence the human genome? These are some of the questions that we will try to address in this lecture. The Human Genome Project is an international research effort to sequence and map all of the genes of Homo sapiens. The project began in 1990 initially headed by James Watson at the U.S. National Institutes of Health. Largely due to his disagreement with his boss, Bernadine Healy, over the issue of patenting genes, he was forced to resign in 1992. He was replaced by Francis Collins in April 1993, and the name of the Center was changed to the National Human Genome Research Institute in 1997. A working draft of the genome was released in 2000 and a complete one in 2003, with further analysis still being published. In May 2006, another milestone was passed on the way to completion of the project, when the sequence of the last chromosome was published in the journal Nature. Completed in April 2003, the Human Genome Project gave us a complete representative sequence of the human genome. The representative sequence is a composite from several people who donated blood samples. Originally, close to 100 people volunteered to give a sample of their blood. Each person provided their informed consent, affirming that they agreed to the study of their DNA. No names were attached to the blood samples and ultimately scientists used only a few of them. These measures ensured that the DNA sequences remained anonymous. Not even the donors knew whether their samples were actually used or not. Although the human genome sequencing was declared completed by the end of 2003, there are a number of regions of the human genome that are unfinished. Small gaps that are unrecoverable in any current sequencing method remain, amounting for about 1 percent of the gene-containing portion of the genome. These include:
• The central regions of each chromosome, known as centromeres, which consists of highly repetitive DNA sequences that are difficult to sequence. • The ends of the chromosomes, called telomeres, which are also highly repetitive. • There are also several loci in each individual’s genome that contain repetitive sequence that are difficult to sequence. The strategy originally established by the publicly funded Human Genome Project was based on a method called hierarchical shotgun sequencing method. In this approach, genomic DNA is cut into pieces of about 150 megabases and inserted into a bacterial artificial chromosomes vector, and transformed into E. coli, where they are replicated and stored. The bacterial artificial chromosomes inserts were isolated and mapped to determine the order of each cloned 150 megabases fragment. This is referred to as the Golden Tiling Path. Each bacterial artificial chromosomes fragment in the Golden Path is fragmented randomly into smaller pieces, typically 1 to 2 kilobases long, and each piece is cloned into a plasmid vector. The resulting transformed bacterial colonies are picked at random and sequenced on both strands. These sequences are aligned so that identical sequences are overlapping. These contiguous pieces are then assembled into finished sequence once each strand has been sequenced about 4 times. The term contig refers to a known DNA sequence that is contiguous and lacks gaps. In 1998, a parallel project was conducted by the private company Celera Genomics. Celera used a riskier technique called whole genome shotgun sequencing, which had been used to sequence bacterial genomes that smaller in size and contain less repetitive DNA. Shotgun sequencing randomly shears genomic DNA into small pieces which are cloned into plasmids and sequenced on both strands, thus eliminating the bacterial artificial chromosomes step from the publicly funded Human genome Project approach. Once the sequences are obtained, they are aligned and assembled into finished sequence. The advantage to the hierarchical approach is sequencers are less likely to make mistakes when assembling the shotgun fragments into contigs as long as full chromosomes. The reason is that the chromosomal location for each bacterial artificial chromosomes is known, and there are fewer random pieces to assemble. The disadvantage to this method is time and expense. The whole genome shotgun method used by Celera Genomics is faster and less expensive, but it is more prone to errors due to incorrect assembly of finished sequence. It also required tremendous computational power. Which method is better? It depends on the size and complexity of the genome. With the human genome, each group believes its approach to be superior to the other. It is worth mentioning that Celera had access to the publicly funded Human Genome Consortium data but the Human Genome Consortium did not have access to the Celera data. Indeed, without the Golden Tiling Path released by the Human Genome Consortium, the whole genome shotgun method will not be able to determine where the fragment of the sequenced DNA belongs to. Because of the advantages and disadvantages of both approaches, an idealized strategy evolved into hybrids towards the end of the project, in which the Human Genome Consortium selected more clones arbitrarily and Celera made use of the bacterial artificial chromosomes maps and sequence generated by the HGP. In that way, both organizations were able to reach their goals in less than the expected time frame. The total estimated size of the human genome is 3.2 gigabases. Most DNA in the human genome is non-coding DNA, including intergenic regions, introns, repetitive sequences and so forth. About 28% of human DNA is transcribed into RNA and only a mere 1.25% is actually codes for proteins. On average, the introns are longer in human DNA than in other organisms sequenced so far. Over half the human genome consists of repeated sequences. Some 45% is transposons. 3% of the human genome consists of repeats of just a few bases such as microsatellites, Variable Number Tandem Repeats and etc. And 5% of the human genome is made up of duplications of large genome segments. There are both AT-rich regions and GC-rich regions in the human genome. Curiously, the zones of GC-rich sequence have a higher density of genes and the introns are shorter. The significance of this is still unknown. One of the biggest surprise in the human genome is the revelation that humans only have around 25,000 genes rather than the previously estimated 100,000 genes. The nematode worm, Caenorhabditis, with approximately 18,000 genes, has half as much genetic information as humans. The mouse, have essentially the same number of genes as humans. Thus, the genome size is not directly related to its biological complexity. Another surprise of the human genome is that only a third of the genome code for protein. The other two thirds are DNA that we are yet to find out what their function is. These findings have significant implications on how we understand the basic science of how life works, but also how those sequences, which are regulatory, can affect health and disease. However, biological complexity may presence in the expressed protein: First, alternative splicing of a pre-mRNA can yield multiple functional mRNAs corresponding to a particular gene. In human, about 60% of the transcript are subject to alternative splicing. Second, variations in the post-translational modification of some proteins may produce functional differences. For example, 80 to 90% of human proteins are acetylated. Reversible phosphorylation of proteins is another important regulatory mechanism. Reversible phosphorylation results in a conformational change in the structure in many enzymes and receptors, causing them to become activated or deactivated. Finally, qualitative differences in the interactions between proteins and their integration into pathways may contribute significantly to the differences in biological complexity among organisms. For example, signals from the exterior of a cell are mediated to the inside of that cell by protein-protein interactions of the signaling molecules. This process, called signal transduction, plays a fundamental role in many biological processes and in many diseases. Proteins might interact for a long time to form part of a protein complex, a protein may be carrying another protein, or a protein may interact briefly with another protein just to modify it. Therefore, protein-protein interactions are of central importance for virtually every process in a living cell. Although the Human Genome Project is considered completed, there are great challenges to understand what it means. There is still 1% of the genome at the heterochromatin region unsequenced, which may or may not harbor any gene. This challenge will only be overcome by the development of new sequencing technique. Also, the regulatory signals for most genes remain uncharacterized. It is currently unclear how epigenetic modifications such as cytosine methylation and gene silencing on a genome-wide scale, determine their biological consequences. The human genome sequence also identified more than 1.4 million single nucleotide polymorphisms. Although any two unrelated people share about 99.9% of their DNA sequence, some people may have an A at a particular site on a chromosome while others have a G instead. Further work is currently being conducted to find genetic variants affecting health, disease and response to drugs and environmental factors.
The scientific challenges outlined above focus on how the genome sequence can be mined for biological information. Ultimately, it is the function of the gene product that plays the integral role in the biological system. Therefore, tremendous effort has also been carried out to decipher the gene function. In summary, the availability of a complete genome sequence will enormously facilitate the solution of the more difficult problem of identifying the genetic components of the more complex and more common disorders, such as many forms of diabetes, asthma, cancer, and mental illness, in which multiple genetic and environmental factors interact. Using techniques that can measure the expression of thousands of genes at a time, scientists are now beginning to look globally for differences in gene expression that are associated with, for example, the ability to respond to different drugs or pathological states such as cancer