Transcript for:
Exploring the Human Genome and Its Applications

Hello everybody, this is lecture 9 of our course and we will talk about human genome and its application in research and clinical practice. The learning objectives of this lecture are to understand the principles of human genome assembly, to examine the variability and similarity between human genomes, and to get familiar with functional and regulatory elements of human genome. We will start with an overview of the human genome and its complexity. We will discuss the ways human genome is built, annotated, and released. The Personal Genome Project is one of the initiatives that would not be possible without the knowledge of human genome, and we will talk in general lines about the Personal Genome Project. You may be familiar with the mitochondrial genome, and we will mention some of its particularities in comparison to human genome. And finally, there will be some examples of application of human genome knowledge in clinical practice and research. This slide is familiar to you, and I would like just to remind you the amazing progress made during the last 50-60 years in the field of molecular biology, genomics, and its application in medicine. The field of sequencing and actually genomics genetics started with the prediction of the double helix. DNA structure by Watson and Crick. Several milestone discoveries brought us to the current technological level and allow us to sequence the entire genome. Among the biggest milestones, we could mention the recombinant DNA technique, development of an automated sequence by the applied biosystems based on the Sanger sequencing methods that allowed to sequence certain pathogens and the first free-living organism, and which culminated in the sequencing of the human genome first draft of which was published in 2001. Human genome was sequenced about 15 years ago for a steep price of about $3 billion, while today several companies are offering to sequence your entire genome for $5,000 or less. The progress in sequencing technologies is very impressive. It took only 6 years from 1995 to 2001 to achieve an almost 2,000% increase in the number of nucleotides that were sequenced. As of now, the number of genomes sequenced every year is in the range of 10,000, and it is expected a tenfold increase in this number every two years. Such a high throughput wouldn't be possible without the advances in sequencing technologies in the past decade. Now that we can decipher our genomes, How this information will help us understanding the molecular processes that happen in living organisms. This is an excerpt from Donald Rumsfeld's interview in which he talks about the weapons of mass destruction in Iraq. There are known knowns. There are things we know we know. We also know there are known unknowns. That is to say we know there are some things we do not know. But there are also unknown unknowns. The ones we don't know, we don't know. So there are known knowns, known unknowns, and unknown knowns. What is the proportion of each of the categories in genomics? What are the knowns in genomics? How many genes do we have? How many protein-coding genes do we have? What are the functions of the non-coding genes? At some point, we may think that we already have the answers to the majority of these questions. However, from time to time, we obtain new and unexpected results that just convince us that we are still at the first steps in understanding the complexity of the human genome. One of the general questions is how old are we as species. Relatively recently, we knew that the humanity of first Homo sapiens appeared about 40-50,000 years ago in Europe and some other parts of the world. However, the latest discoveries made us three times older with an age of over 120,000 years. since the first humanoid creature appeared. Considering the lifespan of a human being around 70 years now and much shorter in the ancient time, the number of generations lived through all this time is probably over half a million, and it is very impressive that we still carry the genes that were part of the first human beings who lived on this planet. The information from ancient time was preserved and passed over to our genes via the human genome, adding to its diversity with each generation, and at the same time maintaining the most crucial genes for our survival. As you remember from our previous lectures, we don't differ very much from each other, and as such, it is difficult to have a measure of the standard genome. From our NGS lecture, you also remember that the highest authority in maintenance and annotation of our genome is the Genome Reference Consortium. or GRC. It is a collaborative body consisting of such advanced institutions as the Wellcome Trust Foundation of the Sanger's Institute in UK, the Genome Institute at the Washington University, the European Bioinformatics Institute, and certainly the NCBI, which is the main repository of information and the main curator of all sequences. Genome Reference Consortium is an international cooperative body responsible for production of a single consensus representing the genome or multiple genomes. This chart represents the components of the genome build cycle. As you can see, it starts with the submission of sequences by researchers and includes some preparatory bioinformatics activities such as filtering, alignment of sequences, mapping them to specific locations. The build itself starts with the assembly of genomic contigs into an unannotated collection of contig sequences and their arrangement along each chromosome. When we have the draft assembly, the next step is to model our genes and annotate their features and structures, such as SNPs, STS, coding and non-coding sequences. Finally, the annotated data are loaded into the MapView, BLAST, FTP, and other relevant databases. And the databases are made publicly available. So all databases that have access or require new build of genome are accessing the same information at the same time. As you remember from the sequencing timeline, the draft human genome was published in 2001 and specifically on July 3rd. Its total length at that time was a little bit more than 2.2 billion base pairs sequenced, which is a little bit over 75% of the current length of the human genome. While it was the first release, it was not actually the first build, because the previous 23 builds or so were done during the 10 years sequencing time that it took to get to the first draft of the entire genome. The latest GRC build, GRC38, was released on December 24, 2013, and it contains already over 3 billion base pairs. So, how complete it is? Well, The answer is not quite. We are still under 100% of completion according to our current estimates. As you can see from this table, there are more than 11 million scaffolds that are not placed or mapped within the chromosomes where they derived from at the sequencing stage. These orphan scaffolds, in the majority of cases, are transposons and repeating elements, which we don't know for sure where they should be placed within the genome. Now getting back to what we think we know. We know that the total number of genes is under 25,000, which is a much lower number than predicted a decade ago. All genes are making up less than 1% of our entire genome. The majority of these genes have specific names and presumed functions, while the exact roles of about half of the genes are still unknown. The intergenic segments and introns contain various genomic structures such as non-coding RNA, microRNA, to name a few, which were considered in the not-so-distant past as junk DNA. Today, while we appreciate much more the importance of such structures in regulation of gene expression, we still don't know the functions of many genomic layers. As you may remember from one of our previous lectures, some long non-coding RNA may code for proteins, which is an entirely novel discovery. From the total of about 23,000 genes, our current understanding of their function on the molecular level is limited to just a little bit over 50% of them, which are annotated and curated. In other words, we may understand the functions of about 12,000 genes, and often we presume the functions of the rest of them. Again, our knowledge is at the very initial stage of understanding the complex molecular interactions that define the functions of our cells. and the process that regulates gene expression. There is a lot of room for new discovery. Since we have the same genomic content in each of our cells, the functional outcome is achieved by activating some genes and silencing some others based on the instructions from the genomic code. The regulation of these processes is very complex, precisely concerted, and could be performed at various levels. For example, genes may be activated or silenced by enhancing or repressing their transcription, by modulation of transcription factors, and the stability of messenger RNA. The splicing processes also generate a huge diversity of proteins in living organisms. We have very high hopes for the next generation sequencing techniques, which should enable the scientists to uncover the dark territory in the human genome. This is probably one of the most accurate interpretations of the NGS process in general terms, as with a blank sheet of paper, or better, a huge pile of paper, shredded into small pieces, The actual problem is then to put it together, each piece of paper to the sheet and line where it belongs, so that the initial content doesn't change. In NGS, this process is called assembly. As you can imagine, the shorter are the pieces of shredded paper, the more difficult it is to find exactly where it belongs in the entire manuscript or just a particular page. Hence, the errors in assemblies are inevitable. The matter is only how big an error we would want to accept. and how much our final test will differ from the initial content. By now we have learned that the GRC build is based on a small number of genomes assembled into one reference. We have some alternatives annotated within the genome, while others are just indicated. The personal genome project, in addition to providing 100,000 people with knowledge about their own genomes, will certainly help clarify the reference genome and it may be reasonable to assume that we may have at the end several reference genomes, independent on the ethnicity or any other factors and traits that make us different from each other. If you paid attention to the SNP annotations, they are usually accompanied by a clarification in which population such SNPs were identified. The PGP will add a lot of information to our understanding of the relationship between our genes and phenotype. prevalence of microbial sequences in our genome and other extraneous genomic material that we may have. A separate and most studied part of the human molecular system is the mitochondrial genome. We have identified 37 genes in it, of which one quarter is responsible for the oxidation-phosphorylation functions of mitochondria, and two-thirds are coding for tRNAs and participate in protein synthesis process. While the genetic code is universal, it does not entirely apply to the mitochondrial DNA. In comparison with nucleic DNA or genomic DNA, the stop codons are coded for by different triplets of nucleotides. In nucleic DNA, the UGA, for example, is a stop codon, while in mitochondrial DNA it codes for tryptophan. The AGA and AGG in nucleic DNA code for arginine, while the same triplets define the stop codon in mitochondrial DNA. The transcription machinery is the same, but the language is somewhat different. The link to the mitochondrial code, also called mitocode, is provided at the bottom of this slide. The mitochondrial genome consists of 16,569 base pairs. It has double-stranded circular shape, at least in our understanding and interpretation. Although its natural 3D structure is probably not so well defined as a real circle. The external ring is also called heavy strand and the internal is called light strand. It is characterized by high mutation rate, which means high clinical variability. However, the recombination does not occur between strands. Maternal inheritance in humans is provided almost exclusively. exclusively via mitochondrial DNA, and mutations in mitochondrial genome are responsible for a number of hereditary diseases, primarily related to neuronal functions. MELAS is an acronym for mitochondrial encephalopathy, lactic acidosis, and stroke-like episodes, and it occurs as a result of pathological mutations in tRNA responsible for leucine transport. As you can see from this image, there are many mutations in just one tRNA, and all of them may be responsible for a particular pathology, and more than one mutation may be responsible for the same pathology, or the same mutation may be responsible for more than one pathology. The most reasonable explanation is that more than one mutation needs to be present for a pathology to be manifested completely or in full phenotypic presentation. The known transport RNA and ribosomal RNA mutations are collected within the METO-MAP database, which is also part of the GeneBank collection, but it's maintained by a separate group of people. The database also contains the wiki section with research-related information posted by individuals. The majority of data submitted to the METO-MAP is hand-curated, meaning that each sequence is being individually evaluated by scientists. for completeness and accuracy. Such a thorough process creates a backlog of sequences that have been submitted but not yet curated. The mitochondrial DNA also helps us to follow the population migration over time and the time frame during which this migration happened. These maps are defining the proximities between different technical groups from the evolutionary point of view. And an animated version of this map is available on the Bradshaw Foundation website, the link to which is provided on this slide. Several clarification points on the maternal inheritance. As mentioned before, the mitochondrial DNA comes almost exclusively from the mother. Hence, the mutations in mitochondrial DNA are transmitted from the maternal line, but can affect both male and female offsprings. One of the most common confusion with the maternal inheritance comes from the existence of the maternal effect on genes and mutations in genomic DNA transmitted from mother. In addition to the maternal transmission, the genetic imprinting phenomenon discovered in early 1980s states that the same gene may function differently depending on whether they came from the mother or from the father. The genetic imprinting is quite rare in humans and mammals, making up less than 1% of genes known to be imprinted. In simple explanation, it means that the imprinted gene is silenced or inactive in the genome, and this is an epigenetic mechanism due to methylation or other types of post-transcriptional modifications. For example, if the paternal gene is imprinted, it means that only maternal gene is functional and vice versa. The mammalian embryos require both maternal and paternal DNA for normal development. However, in some species, the offsprings may be generated from maternal or paternal lines alone, albeit not from the same individual. In cases when the source of genome is maternal DNA from two cells of different origin, two eggs, the process is called partenogenesis or ginogenesis. When the sources are two paternal genomes from two sperms, the process is called androgenesis. Both processes do not occur naturally in mammals, but a number of experiments have been done to try them. In the re-instances, when they develop two post-implantation stages, genogenetic embryos show better embryonic development relative to placental development, while for androgenomes the reverse is true. So far, there have been only one case of an experimental part-anogenesis reported in mice. The mouse was born in 2004 from two parents of the same sex and got the name of Kaguya after a Japanese folk tale, in which the moon-born princess Kaguya is found as a baby inside a bamboo stalk. It took 460 attempts to grow the patina genomic mice to the birth stage. This process should not be confused with cloning, where the new offsprings are produced from replicated genomes containing both male and female genes. So now that we know or can map the human genes, what would be the practical application of this knowledge? Well, the most important application of the discovery is in primary diagnosis of diseases, including the differential diagnosis of stages and types of diseases. Another important application is identification of biomarkers that could help us in early diagnosis of diseases, particularly the preventable types. While we still cannot affect many diseases on the genomic level, and the majority of our genomic discoveries are still unactionable, considering the reciprocal influence of our environment on gene expression patterns, we could still modify our lifestyles, change the diet, and make other attempts to correct or alleviate the dysfunction. For a while, the pharmaceutical industry almost ran out of new chemical entities in the pipeline, which previously were developed based solely on their physicochemical properties. The human genome discovery boosted the drug discovery process and we have a large number of new prospective treatments coming from well established for big pharma and also from new startups that are being acquired by the big pharma as soon as their products show some promises the personalized medicine also called precision therapy is at its infancy with only a handful of drugs approved by the fda or other authorities around the world for using humans however even with such a small number of molecules we already can see a significant progress in treatment of some diseases The best examples being the extended life expectancy in patients with cancer due to targeted therapy against identified receptors and signaling molecules. Now that we know more about human genome, additional target genes have been added to the population screening test, including newborns. We started also to understand better the evolution process and heredity. Last but not least, we have some examples of unexpected use of human genome knowledge, although I'm not sure if such applications have any scientific grounds. The longevity of human lives increased significantly in the last century to over 80 years in women and a little bit less in men in developed countries. You may remember that aging was one of the hottest topics in research during the last decades. It is still a very promising field of study. but we didn't have the knowledge and the technology that we have now to be able to decipher the real causes of aging. The maximum lifespan of 160 years was calculated based on the cell division processes, rate of replication, and viability of cells. We are halfway from this threshold, but we start learning about the factors that could extend the lives of humans due to our knowledge of genome. If you remember the aging theories prior to human genome, the main hypothesis was The Lens of Telomeres defines the number of passages a cell can survive, after which the cell may die or become metabolically inactive. Well, the paper shown on this slide reported that the Lens of Telomeres may be not so important as we believed, and that mutations may also be beneficial in some circumstances. Based on deep hole genome sequencing, the group of authors estimated that approximately 450 somatic mutations accumulated in the non-repetitive genome within the healthy blood compartment of a 115-year-old woman. The detected mutations appear to have been harmless passenger mutations. They were enriched in non-coding and not evolutionarily conserved AT-rich regions. These regions were depleted of actively transcribed genes. The distribution of variant allele frequencies of these mutations suggested that the majority of the peripheral white blood cells were offsprings of two related hematopoietic stem cells clones. Moreover, telomere's lengths of the white blood cells were significantly shorter than telomere lengths from other tissues. Together, this suggests that the finite lifespan of hematopoietic stem cells rather than somatic mutations effects may lead to hematopoietic clonal evolution at extreme ages. Rare diseases. As the name implies, rare diseases are those with low frequency or prevalence in the total population. The official definition of rare diseases, also sometimes called orphan diseases, differ from country to country or in different geographical regions. In the US, the definition considers a disease being rare if fewer than 200,000 people are affected by it. That is approximately 1 in 1,500 people, depending on the total number of population. In Europe, the threshold is 1 in 2,000 people, and in Japan, it is 1 in 2,500. While the definitions of rare and orphan diseases in many cases are used as synonyms, legally, these are two distinct categories in the U.S. and EU. In the U.S., in addition to the actual rare diseases, the status of orphan disease drug may be given to a treatment option that, based on the ratio of cost versus number of patients, has no chance of recovering the research and development investments. In EU, orphan diseases also include the neglected diseases, in addition to the rare by the definition. Both US and EU are giving financial incentives to companies that develop treatments for such diseases. Out of over 7,000 officially recognized rare diseases, only about 5% of them have at least one form of treatment, and 80% of such diseases have genetic factors involved in the pathogenesis. One of the reasons that half of the rare diseases are diagnosed in children is that many of them are little at the embryonic stage, and probably some combinations of genomic profiles result in survival of small number of individuals that get to be diagnosed with rare diseases. In fact, about one-third of children with rare diseases die before the age of five years old. The mutability rate of some genes, as we already know, also differ, and some rarely mutated genes may have a larger phenotypic expression in the forms of rare diseases. Whatever the reasons are, rare diseases received an increased attention in the past decade, predominantly because we are getting more options in diagnostics and treatment of such diseases. Overall, the definition of health is somewhat conventional. While the majority of the population is considered healthy, at some point we realize that diseases, except for some trauma and infections, do not occur overnight. The chronic diseases also don't start in many cases when they are first diagnosed or when the treatment failed. So the category of polygenic diseases that we are starting to understand better with the progress in genomics may explain a number of conditions that develop. at some stages in our lives. For example, the atherosclerosis is considered a disease associated with aging. Aside from the phenotypic manifestations, such as evident changes in circulation that affect our movements or other activities, the process of atherosclerosis or deposits of fatty molecules in our vessels is believed to start at embryonic stages and to progress faster or slower during our active lives. We still don't know the exact... causes of atherosclerotic process but considering that it could be found at early stages of our lives it is probably a polygenic disease with exacerbation caused by our lifestyles we define polygenic diseases as a pathology caused by genomic changes in more than one gene or locus from a scientific point of view the more causative factors a process has the more difficult is to identify these factors the table shows a number of polygenic diseases with at least two genomic factors associated with the disease. However, with some small exceptions, I think that these are not the only factors responsible for the identified conditions. As you can see from this table, the focus of drug discovery process has shifted from macro-organic targets to molecular factors. The majority of these targets were identified due to the progress in genomics. In addition to the old traditional symptomatic and macromolecular treatment, we are already at the stage when a drug is developed using knowledge of the molecular pathway rather than merely by screening the compounds in affected cells or animals. A number of synthetic molecular drugs are being at early stages of development, and the nanomolecular platforms are used more frequently as targeted delivery vehicles for active compounds. The vast majority of SNPs and their associations with diseases are identified in the Genome-Wide Association studies or GWAS. Such studies are enrolled in tens of thousands of people and sequences are screened for common SNPs and associated with known phenotype, hence the name of the method. In addition to disease and exterior phenotypic features such as heights, color of eyes and hair, GWAS helps us identify the SNPs for specific threats. such as modifications of responses to drugs by interrogating the sequences of known cytochromes. These results are probably the most useful findings from GWAS at this time, as they allow for selection of drugs that would be effective in a person or would not lead to an adverse event. Basically, it is a form of personalized medicine. Among most prominent examples are the chemotherapy cocktail selection based on the genomic profiles or mutations, in tumors and individual SNPs. The number of publications on GWAS has increased significantly in the past decade, with over 1,300 studies reported in 2012 alone. Multiplying the number of studies with the number of people participating in them, we would get a minimum of 10 million people enrolled in 2012. In reality, some of this data are collected from the same subjects, but with different approaches, and each may be considered as a separate study and reported as such. The map of SNPs is quite confusing, but you can imagine that interrogating tens of thousands of people for thousands of SNPs would generate a quite complex picture. The GWAS map is interactive and you can find many details on a particular SNP and its association with threats or diseases. So how to read a GWAS report? This graphic shows a magnification of chromosome 19 loci around the LDL-R or low-density lipoprotein receptor gene. Each dot on the graphic is a SNP. The y-axis represents the negative log 10 of the p-value. While we know that the lower p-value, the higher is the confidence of the event not occurring by chance, in cases of the negative log 10. The higher are the numbers, the better are the results or more confidence we may have in them. In addition to the p-value, which we could estimate based on the y-axis position of the dots, the colors within the legend on the left upper corner show the level of correlation with the threat or disease. Red color is for correlation of 0.8, the highest value being 1, a meaning that almost all cases with such SNP have the threat or disease. Well, simplistically, the higher is correlation, the more likely is that the association of the SNP with threat or disease is real. This is a combined view of chromosomes grouped by conditions. The spikes on the image show the SNPs that are both correlated and have a significant p-value for association with a threat or disease. Some of these spikes are hitting areas within a gene, and we could relate to the found association as being causative of some pathologies based on the known molecular pathways. However, some SNPs are hitting on intergenic regions that are quite far from any genes. or their regulatory elements. For example, several SNPs on 9p21 locus have been associated with an increased incidence of coronary artery disease. Based on these findings, specific genotyping tests could be developed to predict the odds of developing a disease. In addition, the genotyping test evaluation will also depend on whether the SNP is located on a single or both alleles, Homo-or heterozygous pattern. As with majority of bioinformatics tools, their development and maintenance is based on enthusiasm of scientists. SNPs make up a special situation in which the confidence of the conclusion is directly related to the number of people enrolled into GWAS. The more people participate in a study, the higher is the confidence in cases that a correlation has been found, providing the statistical evaluation has been properly designed. This table shows several links to the sites that are assessing the reliability of SNPs. There are different methods of p-value calculations and while it is primarily a theoretical approach, many of the presented strategies are applied in GWAS for interpretation of results based on the p-value obtained in a study. Usually, the standard statistic methods are dealing with many samples and just a couple of conditions. In GWAS, we have many samples and many conditions. because the number of SNPs is quite large. As a result, different formulas are used to calculate the statistical confidence in reported findings. Additional resources include databases maintained by the US and international institutions that are collecting the data on the relationships between genotypes and phenotypes. Gene2Phen Portal compares the human and model organisms, genotype-phenotype relationships. and is using a holistic view of these relationships. It is financed by the EU, and the database is linked to the Ensembl genome browser. Genetic Association database has been developed and maintained in collaboration between NIH and CDC, and last year the database has been retired. The data is still available for download. The Human Genome Navigator, or HUGE, is the database maintained by CDC, and it is also... called a navigator for human genome epidemiology. The main goal of HUGE is to monitor and estimate the impact of genomic variation as genomic intervention on epidemiological population data and population health. The database is curated automatically starting this year, meaning that no people are involved in the process. Instead, the developed software are mining the data from literature, submitted samples, and the curation is based on developed algorithms for accuracy of data. The human gene co-expression database allows you to identify the genes that are correlated in expression with your gene of interest. According to the website it contains 8.9 million correlations between 4,238 genes that are expressed in immortalized B cells derived from 295 unrelated individuals. The International HapMap Project is a multi-country effort to identify and catalog genetic similarities and differences in human beings. HapMap Portal helps researchers to find genes that affect health, disease, and individual responses to medications and environmental factors. It is a collaboration among scientists and funding agencies from Japan, the United Kingdom, Canada, China, Nigeria, and the United States. The PhenX toolkit is a collection of tools developed and maintained by NHGRI and NIDA, and the tools can be used to mine the submitted samples by phenotypes such as age, gender, ethnicity, and the toolkit can be customized per individual's needs and interests. The Disease Phenotype Web Portal is a tool for comparison of microarray gene expression data against the phenotype. Since its inception, In 2011, the GW Genomic Medicine Division within the Department of Medicine performed a number of genomic studies as part of the main goal of developing the genomic biomarkers of diseases. This list of projects is partial, and except for two of them, all have been closed or completed with data reported in peer-reviewed journals. This year, the division patented two sets of biomarkers for diagnostic of coronary artery diseases, and appendicitis, with intention to have the blood-based tests available for diagnosis of these diseases. The aspirin resistance project was among the first studies in which we have used NGS as a main platform for diagnosis. As you can see in this table, the fault changes are quite indicative on the sensitivity for acetylsalicylic acid, ASA, or aspirin. ASA is the most widely used drug in the world. While the condition is believed to be present in up to 10% of population, it is crucial to note the response level to ASA when using it for prevention of cardiac diseases or other coagulation disorders. The adriamycin is one of the components of the cancer standard therapy and we have looked at its cardiotoxic effect in women being treated for breast cancer. The study was done using microarray and we have identified several candidate genes that could be responsible for such an adverse event. This project had a number of students involved in the experimental part as researchers, and several presentations and publications resulted from it. The CAT-DX study is mentioned as one that generated a patent on the track or transcript associated with coronary artery disease. Ideally, TRAX could be used during a routine annual checkup as an additional blood test in people over 40-50 years old to evaluate the level of coronary artery disease or the potential of developing advanced degree of vessels narrowing. It could also help the cardiologists to manage patients with suspected CAD in differential diagnosis of causative factors. The LANG-DS project has been mentioned in our NGS lecture and again it is a different level of accuracy of results and diversity of findings compared to existing microbiological standards. Among the scientifically non-sound or scientifically doubtful projects that are based on the human genome, one of them is DNA Song where based on your submitted sequence, The site is creating the music, which to me sounds like a variation of the classical music. It's nice, it's interesting, but I don't think that this is in any way related to the particular sequences of the genes. And the diversity in sounds probably is not so high as the diversity in genomes. ProSapia Genetics, the geographic population structure time machine, which based on your genes is showing... where you come from and the progress during the migration steps. This concludes the Lecture 9 presentation. If you have any questions, please post them in the discussion board or send me an email. Thank you.