Transcript for:
Insights into the Human Genome Project

In April 2003, the Human Genome  Project was declared completed,   putting an end to the 13-year long  scientific endeavor. What is the   human genome project? What is inside the human  genome? And, how to sequence the human genome?   These are some of the questions that  we will try to address in this lecture. The Human Genome Project is an international  research effort to sequence and map all of   the genes of Homo sapiens. The project began in  1990 initially headed by James Watson at the U.S.   National Institutes of Health. Largely due to  his disagreement with his boss, Bernadine Healy,   over the issue of patenting genes,  he was forced to resign in 1992.   He was replaced by Francis Collins in April  1993, and the name of the Center was changed to   the National Human Genome Research Institute  in 1997. A working draft of the genome was   released in 2000 and a complete one in 2003,  with further analysis still being published.   In May 2006, another milestone was passed  on the way to completion of the project,   when the sequence of the last chromosome  was published in the journal Nature.   Completed in April 2003, the Human Genome Project  gave us a complete representative sequence of the   human genome. The representative sequence is  a composite from several people who donated   blood samples. Originally, close to 100 people  volunteered to give a sample of their blood. Each   person provided their informed consent, affirming  that they agreed to the study of their DNA.   No names were attached to the blood samples and  ultimately scientists used only a few of them.   These measures ensured that the  DNA sequences remained anonymous.   Not even the donors knew whether their  samples were actually used or not.   Although the human genome sequencing was  declared completed by the end of 2003, there   are a number of regions of the human genome that  are unfinished. Small gaps that are unrecoverable   in any current sequencing method remain, amounting  for about 1 percent of the gene-containing   portion of the genome. These include:

• The central regions of each chromosome, known as   centromeres, which consists of highly repetitive  DNA sequences that are difficult to sequence. • The ends of the chromosomes, called  telomeres, which are also highly repetitive. • There are also several loci  in each individual’s genome   that contain repetitive sequence  that are difficult to sequence. The strategy originally established by  the publicly funded Human Genome Project   was based on a method called hierarchical  shotgun sequencing method. In this approach,   genomic DNA is cut into pieces of about 150  megabases and inserted into a bacterial artificial   chromosomes vector, and transformed into E.  coli, where they are replicated and stored.   The bacterial artificial chromosomes inserts  were isolated and mapped to determine the   order of each cloned 150 megabases fragment.  This is referred to as the Golden Tiling Path.   Each bacterial artificial chromosomes fragment  in the Golden Path is fragmented randomly into   smaller pieces, typically 1 to 2 kilobases long,  and each piece is cloned into a plasmid vector.   The resulting transformed bacterial colonies are  picked at random and sequenced on both strands.   These sequences are aligned so that identical  sequences are overlapping. These contiguous   pieces are then assembled into finished sequence  once each strand has been sequenced about 4 times.   The term contig refers to a known DNA  sequence that is contiguous and lacks gaps. In 1998, a parallel project was conducted  by the private company Celera Genomics.   Celera used a riskier technique called  whole genome shotgun sequencing,   which had been used to sequence  bacterial genomes that smaller in size   and contain less repetitive DNA. Shotgun  sequencing randomly shears genomic DNA into   small pieces which are cloned into plasmids and  sequenced on both strands, thus eliminating the   bacterial artificial chromosomes step from the  publicly funded Human genome Project approach.   Once the sequences are obtained, they are  aligned and assembled into finished sequence. The advantage to the hierarchical approach is  sequencers are less likely to make mistakes   when assembling the shotgun fragments  into contigs as long as full chromosomes.   The reason is that the chromosomal location for  each bacterial artificial chromosomes is known,   and there are fewer random pieces to assemble. The  disadvantage to this method is time and expense.   The whole genome shotgun method used by  Celera Genomics is faster and less expensive,   but it is more prone to errors due to  incorrect assembly of finished sequence.   It also required tremendous computational power.   Which method is better? It depends on  the size and complexity of the genome.   With the human genome, each group believes  its approach to be superior to the other.   It is worth mentioning that Celera had access to  the publicly funded Human Genome Consortium data   but the Human Genome Consortium did  not have access to the Celera data.   Indeed, without the Golden Tiling Path released  by the Human Genome Consortium, the whole genome   shotgun method will not be able to determine where  the fragment of the sequenced DNA belongs to.   Because of the advantages and  disadvantages of both approaches,   an idealized strategy evolved into  hybrids towards the end of the project,   in which the Human Genome Consortium selected  more clones arbitrarily and Celera made use of   the bacterial artificial chromosomes maps and  sequence generated by the HGP. In that way,   both organizations were able to reach their  goals in less than the expected time frame. The total estimated size of the human genome is  3.2 gigabases. Most DNA in the human genome is   non-coding DNA, including intergenic regions,  introns, repetitive sequences and so forth.   About 28% of human DNA is transcribed into RNA and  only a mere 1.25% is actually codes for proteins.   On average, the introns are longer in human  DNA than in other organisms sequenced so far.   Over half the human genome consists of  repeated sequences. Some 45% is transposons.   3% of the human genome consists of repeats  of just a few bases such as microsatellites,   Variable Number Tandem Repeats and etc. And 5%  of the human genome is made up of duplications   of large genome segments. There are both AT-rich  regions and GC-rich regions in the human genome.   Curiously, the zones of GC-rich  sequence have a higher density of   genes and the introns are shorter. The  significance of this is still unknown. One of the biggest surprise in the human genome  is the revelation that humans only have around   25,000 genes rather than the previously  estimated 100,000 genes. The nematode worm,   Caenorhabditis, with approximately 18,000 genes,  has half as much genetic information as humans.   The mouse, have essentially the  same number of genes as humans.   Thus, the genome size is not directly  related to its biological complexity. Another surprise of the human genome is that  only a third of the genome code for protein.   The other two thirds are DNA that we are  yet to find out what their function is.   These findings have significant implications  on how we understand the basic science of how   life works, but also how those sequences, which  are regulatory, can affect health and disease. However, biological complexity may  presence in the expressed protein:   First, alternative splicing of a pre-mRNA can  yield multiple functional mRNAs corresponding   to a particular gene. In human, about 60% of the  transcript are subject to alternative splicing.   Second, variations in the post-translational  modification of some proteins may produce   functional differences. For example, 80  to 90% of human proteins are acetylated.   Reversible phosphorylation of proteins is  another important regulatory mechanism.   Reversible phosphorylation results in a  conformational change in the structure in   many enzymes and receptors, causing them  to become activated or deactivated.   Finally, qualitative differences in  the interactions between proteins   and their integration into pathways may contribute  significantly to the differences in biological   complexity among organisms. For example, signals  from the exterior of a cell are mediated to the   inside of that cell by protein-protein  interactions of the signaling molecules.   This process, called signal transduction, plays  a fundamental role in many biological processes   and in many diseases. Proteins might interact for  a long time to form part of a protein complex,   a protein may be carrying another  protein, or a protein may interact   briefly with another protein just to  modify it. Therefore, protein-protein   interactions are of central importance for  virtually every process in a living cell. Although the Human Genome  Project is considered completed,   there are great challenges  to understand what it means. There is still 1% of the genome at the  heterochromatin region unsequenced,   which may or may not harbor any gene.   This challenge will only be overcome by the  development of new sequencing technique. Also, the regulatory signals for most genes remain   uncharacterized. It is currently unclear how  epigenetic modifications such as cytosine   methylation and gene silencing on a genome-wide  scale, determine their biological consequences. The human genome sequence also identified more  than 1.4 million single nucleotide polymorphisms.   Although any two unrelated people share  about 99.9% of their DNA sequence,   some people may have an A at a particular site  on a chromosome while others have a G instead.   Further work is currently being conducted to find  genetic variants affecting health, disease and   response to drugs and environmental factors.

The scientific challenges outlined above focus   on how the genome sequence can be mined  for biological information. Ultimately,   it is the function of the gene product that  plays the integral role in the biological system.   Therefore, tremendous effort has also been  carried out to decipher the gene function. In summary, the availability of a complete genome  sequence will enormously facilitate the solution   of the more difficult problem of identifying  the genetic components of the more complex   and more common disorders, such as many forms  of diabetes, asthma, cancer, and mental illness,   in which multiple genetic and environmental  factors interact. Using techniques that can   measure the expression of thousands of genes  at a time, scientists are now beginning to look   globally for differences in gene expression  that are associated with, for example, the   ability to respond to different drugs  or pathological states such as cancer