Current Topics in Genome Analysis: Final Lecture Summary

Female Speaker 1 So, good morning, everyone. Thank you all for coming out this morning to what is the final lecture in our series, Current Topics in Genome Analysis, which is organized by Dr. Andy Baxabanis and Dr. Tara Wolfsberg. So, it's a tremendous honor for us to have as our final speaker today Dr. Elaine Mardis, who comes to us from Washington University in St. Louis, where she is currently the Robert E.

and Louise F. Dunn Distinguished Professor of Medicine, Co-Director of the Genome Institute, and Director of Technology Development also at the Genome Institute. So, Dr. Martis earned her bachelor's degree in zoology and her PhD in chemistry and biochemistry, all from the University of Oklahoma. In her early career, she worked as a senior research scientist at Bio-Rad Laboratories in California before joining WashU in 1993. Both the breadth and the depth of Dr. Martis's accomplishments are truly outstanding, and she should really serve as an inspiration to all of us. As Director of Technology Development at the Genome Institute in Wash U, Dr. Martis was a key player and a thought leader in creating sequencing methods and automation pipelines that really operationalized the Human Genome Project. She then went on to orchestrate the institute's efforts to explore next-generation sequencing and third-generation sequencing technologies and translate them into production sequencing.

Dr. Martis, as many of you know, has been a major leader in several large federally funded research initiatives, including the Cancer Genome Atlas Project, the Human Microbiome Project, and the 1000 Genomes Project. She's also been a driving force in sequencing the genomes of the mouse, chicken, platypus, rhesus macaques, orangutan, and the zebra finch. In recognition of her seminal contributions to science, Dr. Martis has received numerous research awards. In 2011, she was named a distinguished alumni of the University of Oklahoma College of Arts and Sciences. For her seminal contributions to cancer research, Dr. Martis received the 2010 Scripps Translational Research Award.

And, just earlier this year, in Thompson and Reuters'report on the world's most influential scientific minds, Dr. Martis was named as one of the most highly cited, or what they refer to as the hottest researchers in the world, which they describe as the thought leaders of today, individuals whose research is blazing new frontiers and shaping tomorrow's world. And, so, we're incredibly privileged to have Dr. Martis here with us this morning. And please join me in giving her a very warm welcome.

Female Speaker 1 Thanks so much, Daphne, for the kind introduction, and thanks to everyone for coming. I'm honored to be here again and asked to speak about next-generation sequencing technologies, which I'll do for the majority of the time. And then I thought it would be of particular interest to just take a deep dive right at the end of the talk on how we're applying these technologies that I'll tell you about to the pursuit of cancer genomics translation. So really, I'm going to start with beginning to move away from discovery into making an impact on patients'lives today. So I have no relevant financial relationships with commercial interests to disclose, just to get rid of the perfunctory announcements here and move on to the hot topic of next-generation and third-generation sequencing.

So what I'll do today is sort of take you through the basics of next-gen, followed by the basics of third-gen sequencing, and then, as I said, we'll take a deep dive on cancer genomics. I really want to start at the common core of next generation DNA sequencing instrumentation because there are a lot of common aspects to how this all shakes out in the laboratory setting in real time in terms of getting DNA prepared for sequencing and then generating sequence data itself. So all of the next generation platforms that I'll talk about today really require a fairly simple compared to old style Sanger sequencing library construction.

event that occurs at the very beginning of this process. And this is really characterized by just some simple molecular biology steps that involve amplification or ligation with custom linkers or adapters. So synthetic DNAs that correspond to the platform on which the sequencing will take place are the first step in making a true library for next-generation sequencing. And I'll show the details of this in subsequent slides to really illustrate the point. The second step that follows library construction is really a true amplification process that occurs on a solid surface.

This surface can be either a bead, or a flat silicon-derived surface. But in every case, regardless of the type of surface, the fundamental principle is that those same synthetic DNAs that went on to construct the library are also covalently linked to the surface on which the amplification will take place. So one question you might be asking yourself is, why do we need amplification on the surface?

And the easy answer is, we need plenty of signal to get an accurate read on the DNA sequencing to follow. And amplification is a quick and easy way to represent the same sequence multiple times so that as the sequencing takes place on each individual DNA in that library, you generate plenty of signal and you get an accurate DNA sequence that comes out the other end of the sequencing process, if you will. So, that's the reason for amplification.

And we'll talk more about that as well in subsequent slides. The next step is really the one that differentiates old-style Sanger sequencing, where you truly had a decoupling. First, the library was constructed and then sequenced. The sequencing reactions were then subsequently separated and analyzed.

So, there were two distinct steps in Sanger sequencing. The sequencing itself, the molecular biology, followed by the readout of the data. In next-gen sequencing, everything happens in a step-by-step manner, so it's a truly integrated data production and data detection that all happens in a stepwise fashion. And I'll describe this for the different sequencing platforms, but they basically all use the same premise, which is where the nucleotide base incorporation on each amplified library fragment is determined, and then subsequent steps occur to determine each sequential base as you go along. And really, this differentiating factor that you sequence and detect in lockstep on next-gen sequencing platform is conducive to the other description for next-generation sequencing, which is massively parallel.

What this means is that since you're coupling together or integrating the data production and the data detection, you can actually do this times hundreds of thousands to hundreds of millions reactions all at the same time. which basically reduces down to the x, y coordinate that represents that particular library fragment times all of the library fragments that are being sequenced together. So this is why people commonly refer to, and I actually prefer the term massively parallel sequencing because it really is an accurate reflection of what's going on inside the sequencer.

One other aspect of next-gen sequencing that comes into play often when we want to quantitate on DNA, for example, copy number can be very accurately quantitated by sequencing DNA, especially from whole genome data. RNA-seq, for example, where the RNA is first converted to a cDNA, then turned into a library, and all these steps follow, can also be very accurately quantitated with respect to individual genes and the level to which they're expressed. Mainly because this is a truly digital read type.

The digital nature of the data comes from the fact that each amplified fragment originally was one fragment in that library. And so you have this one-to-one correspondence. So if I have two copies of a particular part of a chromosome, let's say the HER2 locus, I get the equivalent read depth when I sequence that genome for diploidy. But if in the case of a HER2-amplified breast cancer, I have three, four, or five copies I actually have a digital equivalent of that depth on that HER2 locus, and I can quantitate the extent to which the copy number is amplified. So, just an example of that.

Now, the last part I'll get into as we talk about data analysis, and that's really the fact that most next-gen sequencing devices provide much shorter read lengths than those Sanger capillary reads of old, which were in the neighborhood of 600 to 800 bases for Sanger. Whereas, most next-gen boxes deliver somewhere in the neighborhood of 100 to 400 base pairs of data. So, this really now confounds analysis by quite a bit. The data sets are quite large, and we're in a position of needing to map those back to a reference genome for the purposes of interpretation, rather than assembling them, which is what we used to do with Sanger data.

But more on that in a minute. So, let's look now at the detailed library construction steps. They're sort of shown up here, and I've already walked you through them. But what I want to point out in this particular slide is that in many places I've denoted that there are PCR steps that are occurring at each one of these particular steps in the process.

So, after we start with high molecular weight genomic DNA that gets sheared by sound waves down to, let's say, 200 to 500 base pair pieces. The ends are polished using some molecular biology, and this is this first step that I just talked about, where you're ligating on these synthetic DNA adapters. And then amplifying the library fragments by just a bit with a few PCR cycles.

There's also a size fractionation that takes place so that we can get very precise size fractions. This is often important in whole genome studies where we also want to predict structural variation like translocations or other inversions of chromosomes where you really need very precise size fractions. And this also requires some amplification just to bring up the amount of DNA after you've the fraction or fractions of interest. We quantitate these libraries very precisely before they go on for the amplification process, and that's mainly to get the right amount of sequence data coming out of the amplification process itself.

Now amplification is sort of simply shown here. I'll have a more detailed figure in a minute. But this is also a form of PCR.

So the amplification is enzymatic and is the source of some biases, as I'll describe on the next slide. This then shows at the last bit the sequencing approach where in this particular example, which is akin to what's done on the Illumina sequencer, you release by chemical breakage of the covalent bond and denature away the companion strands. You have single strands that then get primed with first one sequencing primer.

You can see it here, and it's sequencing down towards the surface of the chip. And then in a second step, you can regenerate these clusters by just... another amplification process, release the other end through a different chemistry, and prime with the second primer, and so an end sequence.

And so this is now delivering what we typically refer to as paired end reads. Namely, you've got a fragment of about 500 base pairs, you're generating about 100 or so bases from each end, and now you can move forward to accurately placing that onto the genome reference of the organism you're sequencing. So just to...

Follow on a little bit. You may be wondering why I noted all those PCR steps. I want to talk about that just for a minute. So while PCR, polymerase chain reaction, if you're not familiar with the acronym, is an effective vehicle for amplifying DNA, there are lots of problems that creep in that are also listed here that are a consequence of using PCR enzymatic amplification. For example, often you get preferential amplification, or often referred to as jackpotting.

This means some of the... library fragments preferentially amplify through the PCR, and they'll be overrepresented when you do the alignment back to the genome. This turns out to be fairly easy to find, if you will, using a bioinformatic filter that goes on after the alignment occurs.

And these are typically referred to as duplicate reads because they really and truly are. They have the exact same start and stop alignments, and there are algorithms that can go through and effectively deduplicate or remove all but one representative of that sequence. sequence from the library of sequence fragments. And this is particularly a challenge, as we'll talk about at the end of the talk, for low-input DNA, which is common in a clinical setting because you have very, very little tissue from which to derive DNA and then to do your sequencing.

And what that means is that you have a lack of complexity in the DNA molecules that are represented just because there are very few, and this absolutely favors this jackpotting event. So we sequence DNA. From clinical samples, we're often very concerned about duplicate reads, and we try to minimize PCR as much as possible.

And other problems with PCR is that you can get false positive artifacts. If these happen early in these PCR cycles that I showed in the library construction phase, that can be a problem, because once it's in the population of fragments, it amplifies over and over again, and it begins to look real as opposed to being a false positive, which is what... what it is. If it occurs in later cycles, then typically it's drowned out by the other normal correctly copied fragments, and it's not a problem.

And then cluster formation, as I mentioned already, is a type of PCR. This is often referred to as bridge amplification, but it does introduce biases in that it amplifies high and low D plus C content fragments more poorly than fragments with a uniform distribution of ACG and T. And then reduce coverage at these loci is often a result.

This was a problem that we pointed out in one of the first whole genome sequencing papers where we resequenced the C. elegans genome way back when. I think it was published in early 2008. And it's something that's improved a bit over time, but it's still a problem in terms of the bias on the Illumina sequencer.

One other brief word here about a subgenome approach. So early on in next-gen sequencing, all we could do were really sort of two things. One was amplify.

A bunch of PCR products, combine them together and sequence them. Or, you could go to the other extreme, which would be whole genome sequencing. So we really needed a way to partition the genome so that we could sequence less and really focus on the parts of the genome, the human in particular, that we understood the best.

That would be the exome. And if you're not familiar with that term, you could define exome as the exons of every known coding gene in the genome. So this is about 1.5% of the human genome of the three billion base pairs.

And around about end of 2008, early 2009, several methods were introduced that showed us the way to pull out the exome selectively from the whole genome libraries, and that's through a process described here called hybrid capture. So hybrid capture is a very straightforward thing to do. You essentially design synthetic probes that correspond to all of these exomes that you're interested in. It can be the whole exome. Or it can be a subset of the exomes, for example, all kinases.

That's commonly referred to as targeted capture. And by hybrid capture, what you do when you design the probes is that these probes actually have biotin moieties on them. These are the little blue dots that are shown in this figure over here to your right. By combining the probes that are biotinylated with an already prepared whole genome library, And then hybridizing under specific conditions over about a three-day time period, you can get hybrids to form between the whole genome library fragments and the specific probes when those whole genome library fragments contain part of that exon that you're interested in capturing. A subsequent step then takes advantage of the bitintillated probes by combining with streptavidin magnetic beads.

You apply a magnetic force illustrated by the horseshoe magnet here. It doesn't look like that in real life. And you pull down these hybrid fragments selectively, thereby washing away the remainder of the genome that you don't care to sequence. And then these can be released quite readily just by denaturation reaction off of the magnetic beads.

And the probes that are still attached to those by virtue of the strength of the streptavidin-biotin bond. So, you can go directly to sequencing at this point because these were originally, as you can see by the little red tips here on the DNA fragments, they were already a sequence-ready library. It's just now been reduced in complexity by that hybrid capture phenomenon. And we often use both exome sequencing as well as custom capture reagents, these targeted capture approaches.

in our work to subset the human genome and other genomes that we're interested in. So, one question you might be asking is, is there a lower limit? How low can you go below the exome and still get a reasonable yield because you're really now beginning to reduce down below that 1.5 percent? And the answer is, there is a price to pay the lower you go.

That price is typically referred to as off-target effects, which means that you start to get more and more of the sequences that you're not interested in because you're trying to subset the genome down really, really low. And so, the lower threshold probably for targeted capture is somewhere in the neighborhood of about 200 KB, below which you're really going to pay a price in terms of off-target effects, therefore spending a lot of your sequencing dollars on parts of the genome that you don't care about just by virtue of spurious hybridization. So, one way to get around this that's been used for really small gene sets is multiplex PCR. So, this is, again, just sort of getting a bunch of primers to amplify out the regions of the genome you care most about that also behave well together, if you will, in terms of similar TMs for hybridization. Commonly, you will subset multiplex PCR primer sets according to GC content of the regions that you're after so that all the high GC regions get amplified under specific conditions together, et cetera.

So there is a little bit of a optimization that is required for this type of an approach, but there are now commercial multiplex PCR sets out there that are also available that can help you not have to go through that pain and suffering and just get straight to generating data. And this is just the idea behind multiplex PCR. You choose your genes of interest.

They go into a tube with a small amount of DNA. So clinically, this is a very attractive approach because you can use about five nanograms or even less of DNA. And you can amplify out the regions that you want, create the library, as I've said earlier, by a specific ligation, and off you go to sequencing.

So let's talk a little bit now about the specifics of massively parallel sequencing. I'll first talk about Illumina, and then follow up with the second platform, the Ion Torrent. So this is now a more illustrative diagram, if you will.

It's straight off of the Illumina website. of how the cluster amplification process occurs. And I won't dwell on it other than to say that this is how you could imagine the cluster sort of looks after the amplification cycles are all done and before that end is released up for hybridization to the sequencing primer. And during the sequencing process, a single cluster might look like this as it's being scanned by the optics of the instrument. But in reality, what you get is a very closely packed almost star field of clusters.

And really the key to the increasing capacity on a luminous sequencers has been two things. One, the ability to group those clusters tighter and tighter together, yet still get single cluster resolution at the point of deciding which sequence is coming from which cluster. And secondly, if you think about the flow cell that you're putting these DNAs onto and amplifying, it's a three-dimensional flow cell.

So there's It's a top surface and a bottom surface, both of which are decorated by the oligos, and both of which can have clusters amplified on them. So one of the tricks in the luminous sequencer is to actually scan, when you're doing the scanning cycle, which I'll talk about in a minute, scan the bottom surface of the flow cell, shift the focus of the optics just a little, and then scan the top surface of the flow cell. So you essentially double your capacity by doing that simple, probably not simple, but that simple.

focusing of the optics itself. So how does the sequencing actually work? It's really fairly straightforward. It's shown here where these individual circles are the different labeled nucleotides that are supplied in by the fluidics of the instrument into the flow cell.

So this is when the sequencing starts. You've got your sequencing primer shown here in purple, and the very first nucleotide that's been incorporated is this T against the A in the template. And it's really a series.

As I said earlier, of stepwise, incorporate the nucleotide, detect it with the optics of the sequencing instrument, and then go through a series of steps to essentially regenerate the 3-prime end. So you can see here the nucleotides that are supplied have this 3-prime chemical block in place, and that prevents a second nucleotide from getting incorporated in after the first nucleotide is incorporated until you prepare it by going through this deblocking step. The other step that's very necessary is cleavage of the flora.

So you can see here the sun in purple is a flora, and there's a cleavage site here that in this subsequent step will go through and remove the fluorescent groups. They get washed away by the fluidics. And, of course, the reason for that is you don't want flora sort of hanging around from the previous incorporation step because they'll interfere with the fluorescence wavelength that's being detected in the subsequent steps.

So it's really a series of where the... The nucleotide that's incorporated is excited by the optics of the instrument. The emission wavelength is recorded, and that's specific for ACG or T. And on you go with the rest of this. Now, one thing you might be wondering about is, why is this sequence finite?

So why do I have to stop at 100 or 150 bases? And if you're wondering that, it's a really good question. And the answer is simple and complex.

The simple answer is, it's all about signal to noise. Well, where is the source of noise coming from? Because we've got all these hundreds of thousands of fragments. They're all reporting the fluor, right, that they incorporated.

It's not just this one. This is obviously oversimplified. So, we've got lots of signal.

Where is the noise coming from? Well, the noise comes from a phrase that you should always remember, which is chemistry is never 100%. So, let's talk about that for just a second.

Chemistry is never 100%. So, these nucleotides that get added in. right, should look like this, but they might have a small proportion that don't. So, where can things go wrong?

Well, one thing that can go wrong is that you don't actually have a blocking group here on the 3 prime end, because chemistry is never 100 percent. And so, in those cases, when that nucleotide gets incorporated into this fragment, another nucleotide can come right in, because the polymerase is very good at its job. Now, chances are that that nucleotide will have the blocking group, and so, through the things stop, but that strand is now out of phase with the rest of the strands in the cluster, and therefore, when the next incorporation cycle comes along, it's one ahead of everybody else.

And it's not just going to be that one strand, because chemistry is never 100 percent. So you can see that in all the clusters on the flow cell, there's a proportional probability of incorporating a nucleotide that's not 100 percent, either labeled here at the three prime end. Or, another possibility is that because the chemistry of the cleavage won't always work, you either might not get the fluorescent group removed, so it'll continue to interfere by providing noise in subsequent cycles, or you might not actually get this three prime block removed. So that fragment now falls out of the running, if you will. It can't be extended any further.

It's not going to contribute signal anymore. It also won't contribute noise, to be clear. But these are some of the sources of signal to noise that ultimately limit the point at which you're not getting sufficient signal to accurately represent the nucleotide that is properly incorporated.

So Just to finish with the Illumina platforms, here's just a thing from their website that sort of shows what's illustrated below by the remarks. And these are not my remarks, but rather I sort of took a poll across the cognizantio sequencing technology folks for their impressions of the Illumina platforms. And in general, we see that this is a platform with high accuracy. The predominant error type is substitution.

Typically, you're in the range of about 0.1 to 0.2 percent error rate on a per-read basis. So each read, the forward and reverse read. There is a range of capacity and throughput as illustrated across the series of boxes up here. So the MiSeq is sort of the desktop sequencer. The HiSeq X is this sort of Titan $1,000 genome box that's been recently.

announced, and now is starting to populate large-scale sequencing providers. And there are longer read lengths available on some platforms like the MiSeq, which will do 2 by 300 base pair reads, but in general, most of the Illumina sequencers are still in the 100 to 150 base per end read length. for the reasons that we just talked about. And because of the challenges of data analysis, which has already been mentioned and which I'll talk about more in detail in a minute, these providers have improved their software pipeline, the downstream analytical capabilities, and are now offering some cloud computing options for users that don't have the desire to put together large compute farms to analyze on the data.

Okay, so let's switch gears now to a different type of sequencing, which is the IonTorrent platform, and it's illustrated by these figures shown here, which again are off of the company's website. So I'll encourage you, if you're interested in any of these technologies in particular, really to go to the company websites because they have fancy animation and things that I can't do on slides or don't want to. So it's much more explanatory perhaps than even I can provide. This is a unique approach to sequencing because it's truly without labels. So this is using now native nucleotides for sequencing and a very unique form of sequence detection which is shown by the chemistry here, which illustrates that when you're putting in a nucleotide using the polymerase, of course, relative to the template, so the C is now going in against this G.

One of the byproducts of nucleotide synthesis sorry, of chain growth, rather, is the release of hydrogen ions. And it's a proportional release, so that if there were, for example, three Gs here in a row, I would generate proportionally more hydrogen ions because the native nucleotide is going to get added in in triplicate, not just once. Okay?

And how do we know that that hydrogen is being produced? Well, we have sort of an old-fashioned device here on the silicon wafer part of the sequencing chip which is a pH meter. That's unique to each one of the wells in which this bead might be sitting that's going through the sequencing process. So this approach uses a bead-based amplification where the surface of the bead is covalently decorated with the same adapters or primers that we've used for the library.

So you can see here now amplified fragments that are primed and ready for sequencing. The series of steps is very similar to what I just described, with the exception that because we're not using fluorescent... labels or labels of any kind here, the nucleotide flows are one at a time.

So it's A, let's say, followed by C, followed by G, followed by T. So this native nucleotide gets washed across the surface of the chip, of which there are many wells like this, most of which are occupied with a bead. The diffusion process brings in the nucleotide.

If that's the right nucleotide to add at that cycle, according to the template sequence, of course, it will get added in. will be released and the amount of hydrogen or pH change will be detected and turned into an electronic signal that registers with the software that knows which nucleotide is being washed over. So, you go through a series of four steps for incorporation. The pH is monitored, of course, individually to each one of these wells and recorded according to the x-y coordinate for that particular well.

And then at the end of this, you get a readout that looks something like this, if you like to look at data, where you can see that, for example, this peak right here is quite a bit higher than the others, and so on and so forth. And I should point out that there is a way of registering the height of these peaks. There's a sequence that's at the beginning of the adapter that's single nucleotide only, so you get the representative height for G, A, T, and C.

And the software cues off of that for quantitating the peaks thereafter from the sequence that you're trying to obtain. And so, this baselines everything for you. So let's look at the platforms for this approach.

There are two. The personal genome machine has been around the longest. We have one of these in our laboratory.

There are three different size sequencing chips that are available, depending upon how much sequence data you want to generate. The runs are quite rapid, and the read lengths can be as high as 400 base pairs. These are not paired end reads, so this is just a single priming event followed by extension and data collection up to a stopping point. And then the larger throughput device is the proton.

This is currently doing exomes, I think, and aiming for whole genomes. And there are preparatory modules that are associated with both of these instruments that take care of some of the initial amplification steps on that bead, which occur through a process, not a bridge amplification like I showed for the Illumina, but rather require encapsulating the bead, the library fragments that are going to be amplified on each bead, and PCR reagents, including enzymes. into single mice cells in an oil emulsion that then allows everything to be PCR cycled en masse, and that's where you get the amplification step that's required for the signal strength, as I talked about at the beginning. And so, just the characteristics of this platform, because you're supplying rather one nucleotide at a time, this has an inherently low substitution rate.

You don't detect something that's not there because only a single nucleotide is being added in. Insertion and deletion is really the key error type in this sequencing, and that's because there's a proportionality that exists only for a certain number of nucleotides. Typically, up to five to six nucleotides of the same sequence, five or six Gs in a row, can be accurately detected, and then above that, the proportionality is lost.

So you do end up getting insertion deletion errors as a result around what are called homopolymer runs, those runs of the same nucleotide. I already talked about pyridine reads. This is relatively inexpensive sequencing, mainly because it's using native nucleotides. And the data production turnaround is relatively fast. And again, they're improving their computational workflows for data analysis of different types.

Okay, so let's talk a little bit about data analysis because this is, as I mentioned in the beginning, one of the more challenging aspects. I'm not going to take a deep dive on this, but just to sort of roughly reflect what the challenges are, especially when you're dealing with a genome that's as large as the human genome, which will be my exemplar. So, you know, the goal of using science, including in genomics, is that if you could just have your sequencer and you could generate all of these data, then the next step would be to have this beautiful figure, part C, you know, for the publication that's going to the high-impact journal of your choice. Of course, it's not that easy. And sequence data alignment is really the crucial first step, which allows me to put a plug in for you know, genome references, many of which we've generated in our own laboratory through NHGRI funding, because these are really critical pieces in the data analysis of next-gen sequencing.

So just to give a pictorial example, you know, if this is sort of the human genome, the cover on the box of the puzzle you're trying to put together so you can generate that beautiful figure for your paper, these are all the short read data that you have to actually try and make sense out of. And, of course, the challenge here is that you can easily find the pieces with unique features. Those are probably many of the genes in the human genome, but like figuring out where everything else goes is really the harder part of the equation.

And of course, the genome being about 48 percent repetitive, this turns out to be reasonably difficult. And part of the problem, of course, is that because there's so much repetition in the genome, you can get reads that look like they probably belong in multiple places where The real challenge here is mapping back accurately to where that read came from so you can properly assign any mutations that you might identify in that sequence correctly. And one of the ways that we've gotten around this from a bioinformatics standpoint is that we have sort of quality scores that illustrate the quality or certainty of mapping that read to that particular spot in the genome.

So where you have a multiple map, as illustrated here, for a given sequence read, you can go with the highest quality score to sort of assure that you've gotten that at the right place. The other aspect that can save us in terms of certainty of placement, of course, is of course, is paired-end reads, because oftentimes while you'll have one read that sits in a repetitive sequence, the opposite read or the companion read may actually properly align in a unique sequence, and therefore you can give a higher certainty to the placement of that read using the paired-end read mapping approach. So once you have your reads aligned properly to the genome, what do you need to do to get a good accurate sequence evaluation? Well, there are a series of steps here. I won't dwell on them for long, but first of all, you have to identify where your duplicates are.

We talked about those as a result of PCR. We correct any local misalignments. This is particularly for identifying small insertion deletion events, a few bases that are added or deleted.

Those are the hardest things to find. And then we recalculate the quality scores and call SNPs, single nucleotide polymorphisms, for the first pass. Why do we do this? Well, it allows us to do...

what we call evaluating coverage. And coverage is the name of the game here. So if you don't have adequate coverage, you don't have enough oversampling of the genome essentially to prove to yourself that any variant you identify is actually correct.

So the more coverage, the better. But of course, more coverage costs money. So you have to find a balance between those two where you have high confidence, but you haven't sort of, you know, killed your budget. And then the next, and there are various ways rather of evaluating.

evaluating coverage. For example, if we have SNPs that are called from a SNP array where we took the same genomic DNA and applied it to a SNP array and called SNPs, we can actually do a cross-comparison. What are the SNP calls from next-gen sequencing?

What are the SNP calls in those same loci from the array? And to what percent are they concordant with one another? The higher the concordance, the better your coverage is, and the more certain you can go on to downstream analytical steps with the notion that you've got the right coverage. coverage to be confident about anything that follows. Another thing you can do is look at the data.

So people who have been sequencing as long as I have, there's a real comfort in that visual examination of the data. And even though NextGen is a fundamentally different data type than Sanger, which has these beautiful colored peaks, or as old as I am, used to go back to calling up the auto radiogram and slapping it back on the light box, we won't go there. There are tools now like the IGV to actually look at your data. data and look at the quality of the data and so on.

I'll illustrate these in a minute. And then, because we tend to do things in very large numbers at the Genome Institute and other large-scale centers are the same, we also have bulk tools that will allow us for a huge data set to just sort of say, how did we do across the spectrum of coverage? So this is a, for us, a program called RuffCov that I'll show you. And then once this is all sort of said and done, you go off to analyze the data in a multitude of ways. I'm not going to spend time on that today because it's like Of course, not just a single lecture.

So here's an IGV. This is a program that's available from the Broad Institute at this URL, and we use this a lot in examining sequence data by eye. So you can get, you know, sort of whole chromosome views, zoom down to see more detail in the region you're interested in, and even in this illustration, look at the single nucleotide, single read level to really see the depth of coverage that you have, the quality that's described to that nucleotide, et cetera, where low-quality base calls are faint or semi-transparent, just to illustrate that they have lower confidence, whereas these Cs are very high-confident calls, as you can see from the close-up. Here's another look now comparing what one of the things that we do is compare the tumor to the normal from an individual, where you can see now there's great evidence here that this is a mutation that's somatic in nature. So it's truly unique to the tumor DNA and not to the normal.

DNA for this individual. And then this is just a look at Refcov. Here's the site for this on our website, which is now looking across a bulk number of samples, many of which, if you are looking carefully at the notations here, are from formal and fixed paraffin-embedded tumors. And this is showing sort of the percent coverage at different coverage levels according to the key, where what we want to see is everything green or better for an above 80 percent, for This is how much data you've generated with a look at uniqueness versus duplicates. And then this is just how much you've actually enriched across the regions of the genome that you're actually interested in sequencing.

And so these bulk tools can give us a quick look at just the quality of a data set, again, before we move on to analysis. One of the things that we've spent a lot of time doing is really then putting together a somatic variant discovery pipeline. And I'm just using this as an example of how you can daisy chain together different analytical programs to take you from the original read set to ultimately what you care most about, which is the analyzed data.

So just for the purposes of illustration. When we sequence a tumor, we always sequence the matched normal, as I illustrated earlier. So that's the input.

The alignment initially is to the human reference genome. We align all the tumor reads as a separate build, as we call it. All of the normal reads as a separate build.

And then the comparison begins. So we have a variety of algorithms for first read alignment, for discovering truly unique tumor unique somatic point mutations, as well as indels, where these are Now, single nucleotide variants as opposed to single nucleotide polymorphisms, which would be in the germline or the constitutional DNA. We can detect structural variants, as I alluded to earlier, and often in cancer, translocations, inversions fuse genes together that are known to be drivers in oncogenesis. So we absolutely want to detect these from whole genome data. And then, as I alluded to earlier with my HER2 example, we can get very precise quantitation in boundaries on one copy number alterations in the genome.

We do apply filters to these. They're sometimes very sophisticated statistical filters where we remove sources of known false positivity. For example, I detect a variant, but every read that's showing me that's a variant is at the end of the read, whereas we talked about earlier the quality of the data gets poorer because of signal to noise. I can easily throw that out as being a false positive because if they're all at the ends of the reads, they're likely to be at the end of the read.

not true positives, just based on experience and validation exercises that we've gone through. And then for these structural variants, really the best way, again, is to sort of look at the data. And we have tools to look at the read support for a translocation, for example, where one end of the reads is mapping to one chromosome, the opposite end to another chromosome, and that really gives us good support visually that there's actually a translocation that's occurred there, and so on and so forth. We then use use the annotation of the human genome reference to annotate the variants and really tell us, is this an amino acid changing mutation?

Is this something that's going to alter splicing, for example? Is there a fusion gene that I'm predicting from this translocation, et cetera? And then we finally get to that desired result that I talked about earlier, which is the beautiful representation of our tumor genome in all of its glory with the chromosomes.

Here is colored blocks. This is a circosplot, as we call them. All of the words written to the outside are the known genes in the genome that are altered by mutation. Copy number is this gray area in the middle. And then all of these arcs across, for example, in the center are translocations that are involving two chromosomes.

The little blips are typically inversions or deletions and so on. So this is really the Cliff Notes version of a cancer genome, if you will, and takes an enormous... amount of work to really get to that point.

Now just to finish up, you know, a lot of what I'll talk about here at the end is the transition of cancer genomics into the clinic. And one of the things that we've come up against in terms of this translation is that we really need to understand our sources of both false negativity and false positivity, where in the research setting we actually care more about false positives, quite frankly, because we want to accurately represent out to the world. Those true mutations that we've identified in the course of sequencing many cancer genomes, for example. So we have lots and lots of knowledge about what causes false positives. And I've already alluded to one of these, which is the variant being only called at the end of the read.

There are others. And we can design statistical filters to eliminate these, as I've already told you. But most false negatives are actually due to a lack of coverage.

In the clinical setting, you actually worry more about false negatives because you don't want to miss something. which makes total sense, right? It's just that we have to now build in new filters to examine where our coverage is actually too low and note those so that we understand the areas where we're going to be getting false negatives and try to understand that.

And really this... The notion here that has allowed us to come up with these statistical filters is because next-gen sequencing has been changing so much over the past six years in terms of improvements, obviously, but changing nonetheless. We've had to go back constantly and validate. So each time we get a set of mutations, we design new probes. Hybrid capture is used to pull those regions back out of the genome and sequence them again to really verify what's a false positive and what's a true positive.

That's allowed us to come up with these notions of where to remove false positivity. Okay, I'm going to shift gears now to third-gen sequencers. And this is really a variation on a theme, because unlike all the things that I've already told you about, third-gen sequencers, which is shown here, the PacBio instrument that's been with us now for about four years, commercially available, is a completely different paradigm than what I just talked about. So, this is true single-molecule sequencing, as opposed to sequencing a cluster of amplified molecules, which is what next-gen sequencing really does if you want to put a fine point on it.

In the PacBio system, the library preparation actually looks quite similar, if you recall, to those steps that I already talked about for next-gen sequencers. So we shear the DNA, we polish the ends, we adapt on these specific adapters called smart bells in the PacBio system, and then we anneal the sequencing primer to the portion of the smart bell where it's complementary. Now, unlike next-gen sequencers, sequencing, the next series of steps is highly unique. We first bind all of these library fragments to a specific DNA polymerase.

So we incubate them together, get the DNA polymerase to bind onto a single molecule, and then we load this entire mixture onto the surface of this little device here, which is called a smart cell. This is the sequencing mechanism in the PacBio instrument, and it's about as big in diameter as the smart cell. small tip of your little finger, but hidden, which the eye can't see in this smart cell, are 150,000 zero-mode waveguides. What's a zero-mode waveguide?

It's basically a sequencing well that this DNA polymerase complex to your library fragment can nestle down into for the sequencing reaction itself. And that's really illustrated here. Hopefully, these aren't too dark to see, but again, these are just shots from the PacBio website. So if you can't see them here, you can look at them.

What we want to do is in each one of these 150,000 zero-mode waveguides, or as many as possible because this is never 100 percent loading, you want to have a DNA polymerase complex come down to the bottom and attach to the bottom of the zero-mode waveguide. Well the function of that zero-mode waveguide is what? It's actually to now precisely pinpoint the active site of that DNA polymerase with the machine optics that are going to detect the. sequencing reaction happening in real time in the active site of that polymerase.

So what happens is you provide in, and the instrument does this, these labeled nucleotides, and they sample in and out, much as they do in the cell, in and out of the polymerase active site. When they get detected is when they've dwelled long enough in that active site to be incorporated and detected according to the fluorescence wavelength that's emitted ACG. So there's a specific label for each nucleotide. It gets detected by the optics of the instrument as it enters the active site of the polymerase and dwells there for a sufficient period of time.

And so this is the sort of real-time sequencing because you don't take specific periodic snapshots. Rather, the optics, the camera, the instrument watches each one of these 150,000 zero-mode waveguides for a continuous period of time. which is called a movie, and it essentially collects data from what's going on in that active site of each one of the polymerases all the time during the duration of that movie. So any nucleotide that samples in stays for long enough to get incorporated, gets its fluorescence detected.

What's happening to the fluorescence? It's actually on the polyphosphate. So when the nucleotide's incorporated, it diffuses away and doesn't stay around to interfere with the subsequent...

cycles of incorporation. And of course, that's critically important. Why?

Because keep in mind, we're looking at a single DNA polymerase operating on a single strand of DNA in real time. So you've got to be exquisitely sensitive to detect that fluorescence and pick up the information. So one of the things that's unique about this type of sequencing, as you might have guessed, is that the sequencing read lengths now are extraordinarily long. So, as opposed to the next-gen sequencers, we're now looking with improvements to the chemistry and improvements to the library prep, some of the details of which are shown here.

When we're isolating longer and longer fragments in our preparatory library construction process, we're actually now able to extend the time of the movie generation and collect quite long reads. And so, here's just some real data comparing sort of the previous chemistry to the new chemistry. I won't go into the details. This is all available on their website. And what we're doing here now is look at these read lengths.

We're looking now at reads that are extending out to 25,000 to 30,000 nucleotides at a time. Now, this is not all, you know, perfect, right? As I mentioned earlier and alluded to, single molecule detection is really hard.

Hopefully you got that point. So there is a high error rate associated with this type of sequencing, somewhere around the 15% error rate. So 15 bases out of 100 totally.

random error rate. There's no rhyme or reason to it, and the sources of error are pretty easy to pinpoint. But the bottom line is that if you cover, again, coverage is your friend.

So if you cover enough of the genome with these long reads, you can actually correct the random errors to a point of ending up with about a 0.01 percent error rate. That's very, very low in the conglomerate, but not from the single reads. And so that's really the trick. to using these data, and we are using them increasingly more as these read lengths go out because they're really good for connecting small bits, for example, of chromosomes that we haven't been able to orient before. So the chicken genome is a perfect example of this.

If you're not familiar with chickens, and most people aren't, they have mini chromosomes. So some large chromosomes akin to human size, but actually lots of these mini chromosomes that are very, very hard to stitch together. until this technology came along. So again, this is just exemplary data that we've generated for the chicken, and showing that when you really predominate in large fragments on this blue pippin device, which is used for the fragment separation, you can really just kick out the read length quite a bit.

And I don't have data for this, but I have colleagues in the business who have reported read lengths now with some of these new approaches in excess of 50,000 nucleotides. So you're really getting quite long, even to the point of... DNA may be being unstable above these levels without some specific care and feeding. One of the things that we're also using this technology on, just to finish up, is for improving the human reference genome.

So because you can generate sequence reads now of 50,000 or so bases, you can actually take entire human backs that are representing difficult regions of the genome, and rather than breaking them into 2KB subclones as we used to do and try and put them all back together again, you can sequence from end to end, on phosmids, for example, 30 to 50 kb inserts, and major portions of human BACs, and then do assembly with the PacBio reads. So this is just a lot of data about a bunch of BACs that we're sequencing across difficult regions of the human genome that aren't properly finished. And then we're doing comparative assemblies with just random human genome data to try and improve overall the assembly of the genome.

And one of the ways that we're doing this is actually sequencing. From a unique anomaly that's often identified in obstetrics, which is the hidatidiform mole. So if you're not familiar with this terminology, this is a rare event where an enucleate egg, an egg without a nucleus, is released from the ovary and gets fertilized by a sperm and develops to a certain stage in the uterus before it's removed surgically.

So this hidatidiform mole represents a true haploid human gene. genome, one copy, the sperm that fertilized the egg. And so there are some, not many, but a few cell lines that have been produced from these hydatidiform moles, and we're sequencing them actively now on the PAC-BIO sequencer, trying to achieve about 60-fold coverage across the genome of the human to really begin to understand without the complications of diploidy, which you have with most human genomes, how to stitch these difficult regions of the genome together.

So, this is work that's ongoing in our laboratory right now as we're sequencing on a new genome, which is CHM13, another hidatidiform mole cell line. So, more on that to begin with, and then just to finish up, this is just a comparison of the tiling path versus a long read assembly that we were able to obtain using PacBio for a specific segment of the genome. with these approaches on the hydro-tidiform mole.

So just one last word about sequencing, and then I'll finish with my little vignette about cancer genomics, and that's nanopore sequencing. So this is the next type of sequencing that's on the horizon. It sounds a little weird to say that it's on the horizon, because actually, if you look in PubMed, the earliest report for nanopore sequencing, namely pulling a DNA strand through a nanopore, is about 19 years old. So this is an idea that's been around for a while, which ought to give you an idea that it's really, really hard to make it actually work.

So, this is just one example of which, or an idea of the way that this could work out, which is that you have your nanopore here. You have maybe, for example, an exonuclease perched at the top of the nanopore, and when it grabs a strand of DNA like it does, cutting off one base at a time, then you may be able to pull those nucleotides through and somehow detect A from C from G from T. So, that's one possible approach. The other approach is just having a an enzyme here that maybe works to separate the double strands and translocate the single strand through the pore. Okay, so that's another approach that could be used.

And again here the challenge is twofold. One, uniform pores because you want to do multiple pores not just one at a time ideally so that you have throughput. And the less uniform these pores are the more differential readout you get at each one.

And then the readout itself so What's the signal? How are you detecting these? And typically, that's sort of a charge differential.

So if you have a differential on either side of this pseudomembrane shown here, when the DNA translocates through, there should be some abrupt change in the charge ratio across the two halves of the membrane, and that should correspond to an identity of a nucleotide, for example. So in practice, there is a commercially not yet available but in testing device called the... Oxford Nanopore.

This uses the approach, the ladder approach that I showed you earlier, and the idea here is that you have this little thumb drive. This is obviously a prototype because it's got some lab tape on it. The idea is to put your DNA fragments in through here, and then they essentially get pulled through the pores, and the readout comes in through the USB3 port of your laptop, and you can read out the sequence based on that.

So these are... out in testing in certain laboratories. There are just some early reports where it looks like the error rate is, I would say, quite high at this point in time, north of 30 percent error rate. I think 15 percent alone is pretty hard to deal with from an algorithmic standpoint, so I think that this needs some refinement before it really sees the light of day. But it's an interesting and new approach and is truly reagentless sequencing.

So this is just DNA in, sequence out. There's no reagent here other than... the device that the sequencing is happening.

Okay, so I just want to spend the last few minutes talking about how this all coalesces down to really change science and ultimately maybe the practice of medicine. So it's a little bit of a forward look, but things that we're actively working on now. We've known for some time now, since the early 1900s really, before we knew that there was a genome, there were hints that there was something fundamentally different about the in cancer cells.

And this is one of my scientific heroes, Janet Rowley, who really sat down with a microscope in the early 1970s and started looking at cancer chromosomes under a microscope. She devised several preparatory methods that made these much more clear to look at. And this is one of the figures from one of her early papers showing this T1517 translocation that is diagnostic now for a specific subtype of acute myeloid leukemia. known as APL, or acute promyelocytic leukemia. And so really, her studies, as well as several others, began to lay the foundations that when cancer occurs, by looking at the chromosomes in the cells, you can see physical differences that now we can sequence whole genomes of cancer.

We can actually begin to really understand these translocations at the ACG and T-level, whereas with her microscope, she could see just the... gross result of the translocation. So I often say that the next-gen sequencers are really just the new form of microscope, if you will, where we have resolution down to the single nucleotide.

And there has been, if you're aware of the cancer genomics field, a lot of work in cancer genomics just over the past five to six years. Really this is reflective of the work that's gone on in our laboratory, but this is now an international effort to categorize cancer. Cancer genome.

across multiple tumor types. You can see in this particular display It's a few months out of date. We've now sequenced in excess of 2,700 whole genome sequences from over 1,000 cancer patients across different subtypes, AML, breast cancer, et cetera.

And now a very large fraction of our work has been in pediatric cancer setting as well, where a collaboration with St. Jude Children's Research Hospital has sequenced to date over 750 pediatric cancer cases. So this is really a scalable enterprise. And this is true discovery, really what are the genomic roots of cancer and how can we tease them out using next-gen approaches such as whole genome sequencing and exome sequencing as well.

And again, this is an international exercise within the U.S. We have the Cancer Genome Atlas, which has been jointly funded by the National Cancer Institute and the National Human Genome Research Institute. It'll wrap up round about next year. But because we've now sequenced through almost 20 adult cancer types across multiple different types of assays, so not just DNA mutation and copy number, but also RNA methylation, et cetera, protein data, we're now beginning to coalesce around the commonalities and differences across cancer types. So this is just a recent publication of this so-called pan-cancer approach that really tells us that cancer is a disease of the omes.

So the genome is important to be sure, but there are things that we can detect only at the RNA level, only at the methylation level. And combined together, they really begin to tell us about the biology of human cancer as opposed to human health. And so this is, I think, been a really foundational set of data, if you will, that really now sets the stage for translation and making a difference in cancer patients'lives.

So let me just talk briefly about what we're doing at our center because I think it's maybe different a bit. than most places, which are tending to sort of pick known cancer genes, put together a targeted hybrid capture set, and just look at those genes in particular, which can be very informative, but ultimately is not as comprehensive as I think we need. So what we're approaching is sort of a combined and integrated approach that uses whole genome sequencing of the tumor and the normal for each patient. This really gives us, as I've already talked about, the full breadth of... alterations that are unique to the cancer genome, and also will tell us about any constitutional predisposition that's known for specific genes for these patients.

Exome sequencing is important for two reasons. One is a standalone analysis of tumor versus normal exomes. We can get most of the sites that we've already detected in the whole genome and really have that interplay that says, hey, you really got it right because you detected it in both data sets. And so that's an important sort of validation.

The other thing this gives us is great depth, because typically exomes are sequenced at about 100-fold or higher. Combining the whole genome coverage with the exome coverage now gives us great depth at these sites and tells us a lot about what we know is true in cancer, which is that not all cancer cells are created equal in terms of their mutational profile. So there's so-called heterogeneity in the cancer genome and even in a single tumor mass, and this can really be identified through.

through deep coverage analysis, which you get out of the combined axon-mold genome. Lastly, and perhaps I could argue most importantly, doing the transcriptome of the tumor cell is fundamentally important. Why? Well, first of all, it tells us about genes that are overexpressed that we might not detect from just sequencing DNA. So, for example, there's a new transcription factor binding site, but we haven't detected it.

There's a change in methylation, but we haven't detected it. But the downstream consequences... sequence is that gene is overexpressed, and that may be pathogenic. I'll show you an example in a minute of that. We also know that even though we can detect lots of mutations often in cancer genomes, only about 40 to 50 percent of those genes that carry mutations are expressing those mutations.

So if you're going to drug a mutation, you really want to know that that's being expressed at the level of RNA, and RNA sequencing alone will tell you that. And then lastly, I've already alluded to gene fusion. The T1517 that I showed you earlier from Janet Rowley's work fuses two genes together, PML with RAR alpha, and this is a sufficient gene fusion for causing acute promyelocytic leukemia. So, if we're detecting that at the level of structural variant analysis in whole genomes, which is the only place it can be done, that still has a high false positive rate.

So, if we can identify the fusion. gene in the RNA-seq data, that really gives us a nice validation that that structural variant fusion we're predicting is actually being expressed. So, there's a huge interplay and integration in these data that needs to take place. And at the end of the day, what we really want to do is to identify gene-drug interactions that may be indicative for that particular patient of a key drug that they should be taking to help alleviate their tumor burden. Now, the analysis, as I've already alluded to, is very complicated.

It really takes a team to put together an analytical approach and all of the downstream decision support tools to really make this fly in the clinical setting. So this is my dream team. OB and Malachi Griffiths, so they have the same last name.

They kind of have the same faces. One's hairier than the other. Same eyes. Yes, they're identical twins. So it's nice to have a dream team that works this closely together, lives together, et cetera, et cetera.

These guys have a personal commitment to cancer treatment because their mother died of breast cancer when they were 18 years old. So they've developed this sort of system. It's complicated.

I'm not going to. sort of delve into the details, but suffice it to say that all of these green boxes are the things that you can find through that approach that I just walked you through. So there's a lot of information. How do we make sense out of all of it?

Well, there are a variety of key steps that have to be followed. They're here in the center. First of all, for all the altered genes, whether it's from DNA or RNA or both, we have to really understand functionally what the consequence is, and that's a hard step. So there were really We're putting together some decision support tools to help with that.

We're also putting together decision support tools for this next part, which is, what are the activating mutations? Because you can really only apply a drug therapy to things that are activating. If the gene's been knocked out, which happens in tumor suppressors, for example, it's not really a good drug target.

And then layering this information onto pathways within the cell is critically important. Why? Because Cancer is not a disease of genes, it's a disease of pathways that are activated and aberrant and cause a disruption in the normal division and growth cycles in the cell.

So by layering these mutations onto pathways, we not only understand the pathway that's activated, but we can also now strategically identify the best place to drug that pathway. And coming to those strategic decisions is not easy. We've generated this drug gene interaction database, which I'll show you in just a second.

It combines information from a lot of different sources that are essentially not meant to talk to each other, and that's part of the complicating factor when it comes to making decisions. So, the drug gene interaction database helps to interpret where best to drug, what drugs are available, what clinical trials may be available for that patient to go on to, negative indications, et cetera. And rolls this all out into a clinically actionable events list that we often refer to as the report. So, here's just briefly some of the decision support tools that we're curating to come up with tools that anyone can use that's in this enterprise, and there are many people entering into this. This is the database of canonical mutations, which is not yet ready for release, but close.

It's just going to give curated database of mutations that have a demonstrated association with cancer. So, that should be coming out soon. And this is DGI-DB, which I already alluded to. Here's the URL.

It's been released. It's been published. And we just recently updated it, so it's a bit more sophisticated.

All you do at DGI is type in the genes that you're interested in. Set a few parameters that are shown here in terms of the databases you want to look at, whether you want antineoplastic drugs only, et cetera, and push the green button. And then this is the search interface that results for this particular query, where all I'm showing is Able1 kinase that's involved in CML. You can see there are multiple drugs available, all inhibitors in this particular screenshot.

And now here's a link to the database source. In this case, these are all from my cancer genome. So, by clicking on any one of these drug links, you go to My Cancer Genome and get more information from that data source about that drug, about clinical trials, et cetera, and so on. So, really, DGI is a clearinghouse for information that helps to link drugs to mutations and genes and really is just meant to simplify the search for information that a clinician might come up against. So, how have we used this type of an approach?

Well, we had an early success that was reported here in the Journal of the American Medical Association. I won't take too long to talk about it because it's already in the peer-reviewed literature, but this is just an example where whole genome sequencing was able to solve a diagnostic dilemma for this patient who presented from the pathology examination of her leukemia cells with what appeared to be acute promyelocytic leukemia, that form of leukemia that I talked about earlier. But upon cytogenetic examination of her chromosomes, there was no evidence for that T1517 translocation. However, by whole genome sequencing, what we were able to show is that physically a large piece of chromosome 15 had inserted into chromosome 17, recapitulating the PML-RAR-alpha fusion, much like the T15-17 does in 90-plus percent of APL patients, and resulting in the disease that she had.

So, based on this analysis and subsequent CLIA lab verification, because this was done in the research setting, this patient was able to go to consolidation therapy with all transretinoic acid, the standard of care for acute promyelocytic leukemia, and she's alive and well this day because most patients with APL experience about a 94 percent cure rate with chemotherapy and all transretinoic acid consolidation. Another success that we had is here. In the New York Times, not my favorite peer-reviewed journal, but we're still in the process of fully investigating this example, which is based around the acute lymphocytic leukemia second relapse in my colleague and friend, Lucas Wortmann, who's shown here in this cancer story from Gina Kolata in the New York Times a couple of years ago.

In Lucas's case, RNA was really the bellwether for the driver in his disease. So by sequencing RNA, what we were able to show, unlike the nothing that we got from the whole genome sequencing of Lucas's cancer and normal, was that FLT3 was extraordinarily overexpressed in his tumor cells. We didn't really understand the significance of this until we looked in the literature.

This reference from blood shows that even in patients that are Moving towards B-cell ALL, there's an increase in FLT3 expression, so this looked like a putative driver. More importantly, a search of DGI showed that there was a good possible inhibitor, which is sunitinib. At the time, and still to this day, not approved by the FDA for acute lymphocytic leukemia, but he was able to get the drug through compassionate appeal to Pfizer, took the drug.

was put in full remission by taking the drug and was able to get an unrelated stem cell transplant and is alive and well to this day because of this intervention. So, that's some of what we're doing. We continue to sequence cancer patients'genomes now through a coordinated effort at our institute and sponsored by our medical school that tries to also bring in more clinicians for the purposes of education, which is a critically important aspect of understanding genomics that I haven't talked about. I just want to finish with this one little, I think, interesting other approach that we're using, and then I'll open for questions.

So So, in the targeted therapy arena, much like the Sinitinib that you just heard about, patients are receiving great relief from their tumor burdens, but often patients come back with what's called acquired resistance, which means that the cancer cells basically invent around the blockade. And one of the ways that we're now looking at cancer in a different way is to, for those patients who have now passed through acquired resistance, perhaps take a different look at their tumor genomes to design a specific and highly personalized vaccine-based approach that might invoke their immune system into helping them battle their progressive disease. So this is just the paradigm that we're following.

We're starting with melanoma in this setting that I'll describe. In melanoma, you have multiple cutaneous lesions in the metastatic setting that can be readily sampled through the skin. Here, we're studying tumor versus germline DNA by exome sequencing.

We identify somatic mutations, but we don't worry about all those parameters that I showed you in the earlier diagram. Here what we do is we first check RNA to verify which of these mutations are being expressed. With the high mutational load in melanoma because of UV, you have to do this. But, more importantly, we also want to understand the highly expressed RNAs because those are likely to also be highly expressed proteins.

Why do we care about that? Because these are the targets that we want to examine in terms of their immunogenicity for that patient. To do that, we need another piece of data, which is the HLA class 1 type for the patient. This is a readily obtained clinical grade assay, although you can also derive it from whole genome sequencing.

And then we put all this information through this algorithm called NETMHC. So we translate the peptides for the mutated genes to give the wild type peptide and the mutated peptide. We put in the information about the class one type for the patient, and then MHC returns to us a prioritized list of those most immunogenic peptides that are highly unique to that patient's tumor.

So, unlike broad spectrum immunotherapy, here we're really going after the tumor-specific molecules and we hope by that virtue of that to have a reduced amount of side effects. because we're not impacting the normal immune system, as it were. Now, would we ever trust an algorithm to give us exactly the right answer? No.

So we also do a series of downstream tests using an apheresis sample from that patient to look, for example, at existing T cell memory and any T cell lysis that we might be able to illustrate in a dish. This helps us to reprioritize that list, and then we take the top 8 to 10. mutant peptides, and we move them into the vaccine setting. For melanoma, that's shown here. What we're doing is we're removing dendritic cells from the patient, and we're conditioning them with GMP-quality peptides that correspond to that 8 to 10 at the top of our list.

And then the dendritic cell vaccine is infused back into the patient. So if you're not familiar with the immune system, dendritic cells, once conditioned with these peptides, in the body will present them to T cells, and it's only dendritic cells that can evoke T cell memory. So, if there's an existing T cell memory to those tumor-specific epitopes, it will be elicited by the dendritic vaccine, or in principle, anyway. This is all, of course, in testing. And so, this is the paradigm that we're using.

And just to be clear, this is really happening. We have a five-patient FDA-approved IND that's ongoing. As you can see from the progress here so far, patients one through three have already received their vaccines.

They're being monitored in two ways, imaging, which is conventional, and then by blood draws, we're looking at whether we're eliciting T cell memory for any one of those eight to ten peptides that we've used for the vaccine. And then patients four and five, we're just getting ready to go on. I'll have a meeting when I get back next week to work to get that transition into patients four and five.

So Too early to say whether we're being successful with this approach, but I think it's a really exciting new way of using genomics to really inform vaccine development. I'll be clear, dendritic cells aren't the only vaccine platform. It's just what we're using here, and there are other groups that are now pursuing a similar approach. So I think this is now beginning to introduce a new set of potential answers for cancer patients, all stemming from the work that we've been doing in. discovery genomics after the past, for the past few years.

So I'll just finish here by thanking the group back at the Genome Institute. They're all listed. Also special thanks to my clinical collaborators across multiple oncology targets. Jerry Linet and Beatrice Carino in particular are the two that we're working with on that last trial in melanoma therapy. And I also want to thank a couple of my buddies from the genomics sector in analysis who provided the slides for the analytical aspects of the human genome.

So thanks for your attention. There's about 10 minutes left, and I'm happy to answer any questions. Or if you're shy, feel free to come up afterwards, and I can answer questions one-on-one as well. Thanks so much.

Questions, questions, questions. Thank you. No, it's all abundantly clear. Daphne.

Yeah. Yeah. So it's a good question. So Daphne soft-spoken, I'll repeat the question. And that was regarding my comment about the 40% to 50% of genes that are mutated but not expressed in the tumor cell.

Could those have perhaps been important early in the progression of the cancer but then switched off for other reasons? And it's a great question. The answer is probably or maybe possibly.

But part of the problem is, of course, that we have little to no ability to kind of capture that moment in time. So, it's a downside of the way that we're doing things nowadays, which is that we're really isolated to whenever that patient was biopsied, that's the sort of look that we have at the tumor in isolation. Getting progression events, which would be sort of precancerous lesions, for example, early tumor samples, advanced tumor samples, is incredibly difficult. Probably, in the leukemia setting, it's a little bit easier than in the solid tumor setting, to be clear, but it's a very difficult thing to do.

So mostly people have tried to do this with mouse models, often sort of reproducing the mutations that we know are drivers and then following resecting tumors at different stages as they develop in the mouse. So you can get some insights there, I think, but the comprehensivity hasn't been there. So RNA has been always this hard target.

that we've even waited to pursue because the analytical spectrum that you can get out of RNA-seq data is myriad and complicated. So I think those will come, but they really haven't been done to that level of detail at this point in time. But it's entirely possible that those were important early on.

Yeah. Yeah. Yeah. Sure.

Sure. So the question is in the... In the clinical setting, often pathology or always pathology puts samples, biopsies into formalin and then into paraffin, and this is for preservation of cell structure and proteins, but as I may have alluded to, does pretty horrible things to DNA and RNA.

So how do we deal with that in the setting that I described at the end of the talk? I think sort of two things to keep in mind. In the clinical setting, we're often wanting to get the most recent example of that patient's tumor.

So, Sequencing their primary tumor when they're metastatic is sometimes not terribly a good idea, mainly because depending upon the treatments that they've gone through, broad spectrum chemotherapy that damages DNA, radiation which we know damages DNA, what comes out of that in that metastatic setting may be fundamentally changed and different than the initial primary tumor. So that's one thing. The second thing is, is that formalin, while it does introduce DNA backbone breaks, typically does that.

over the age of the block. So the older the block is, the more difficult it is to get high-quality sequence, and there are some artifacts that are characteristic to formalin damage that we can actually subtract away. So those are well-known now. I think the biggest challenge of formalin-fixed material is just how intact is the DNA still.

And if it's a recent biopsy, even if it's gone into FFP, typically the intactness is still very much there. So we're... We're not as afraid of that as we were. I think the bigger challenge is still because we're in a struggle with sort of conventional pathology and we want to add in the value of genomics, which I think is ultimately quite valuable for most patients. If there's a precious little material, then who gets what?

And it's always going to go to pathology because of the standard of care, which I understand. But then we're often left with precious little or none. you know, to do genomics.

So that's really the challenge there, even more than formalin fixation and paraffin embedding. There's another question here. Yeah.

Yeah. Yeah, so that would be a nice to have. So the question is, how long are the peptides that we're using in our vaccine-based approach?

Typically they're sort of on the order of 9 to 11 amino acids, so fairly short. There are, I alluded to other vaccine platforms. There is a camp.

in the cancer vaccines forums that likes the idea of long peptide as a vaccine, so a cocktail of peptides that are somewhere in the neighborhood of 20-plus amino acids. Those get very expensive very quickly, especially keep in mind these are going into a human being, so they're in the GMP, so they have to be. You know, go through rigorous QC and that sort of thing.

But yeah, our vaccines are in the 9 to 11 amino acid length. Yes, at the mic. Yeah, I don't think the mic is live, but in any case, great talk, Elaine. Thank you.

I wanted to know... What next-gen sequencing is doing in regards to modifications of DNA, such as methylation, which we now know, of course, is important for pathophysiological processes as well as normal development? Yeah, so there are sort of two camps.

This is regarding methylation of DNA, which is a very common chemical modification and actually takes many forms and flavors. So methylation has some nuances to it. But methylation writ large, we're sort of now beginning to... to approach genome-wide by just doing a bisulfite conversion where C residues that have a methyl residue are protected from bisulfite conversion, non-methylated Cs are converted to T, and then that readout comes with the alignment back to the genome.

Fundamental questions for whole genome bisulfite are, what's the right coverage, right? So if we want to get down to single nucleotide CPGs and whether they're methylated, even if they're in... an island, then we're going to have to get sufficient coverage to get good granularity at single base resolution.

So we're still a little bit struggling with that aspect. To date, most studies are reduced representation by sulfite, so you're only looking at a subset of the genome. That keeps it cheap, allows you to do higher coverage, but it's kind of like an exome whole genome argument.

What are you missing? So we're really going for the whole genome approach and trying to sort out that coverage aspect. 5-methyl-C is also, or sorry, 5-hydroxymethyl-C is also an interesting target because it has the opposite effect from generic just methylation, but the number of those residues is precious few compared to just plain methyl. So 5-methyl, there are kits to actually convert and detect, but there the coverage looks like it's even higher. So there we might be doing a reduced representation.

just around an exome capture type approach where we're targeting only the known methylated islands in the genome and just sequencing on those with the hydroxymethyl conversion. And now there are, I think, kits for formylmethyl and maybe other flavors of methyl that are coming out. So it's becoming very sophisticated.

Personally, I feel like while methylation is probably interesting, ultimately what I'm getting... Aging, maybe as a surrogacy for that, is the RNA-seq alteration. But it's not a perfect or complete picture, so we'd like to do everything, of course, where everything is limited sort of by your budget more than anything, rather than your imagination.

I should also point out, although that it's not at human scale just yet, there are aspects of methylation that can be evaluated from the PacBio data, the single molecule sequencing data, where the dwell time for. Methylated residues, again, generically speaking about methylated residues, is different for unmethylated, and that over multiple samplings can actually be teased out by different algorithmic treatment of the data. So that's mostly been looked at for bacterial genomes, and there they have even more than human wild and different types of methyl modifications.

So it's probably very interesting in that setting for bacteriologists. but hasn't really scaled to human scale just yet. So that's kind of what's going on in the subset of epigenetics that is methylation.

It's fascinating but complicated. Yeah, anybody else? Yes.

Oh, yeah. We have done a lot of that. I neglected to mention it. I apologize. But those fusion transcripts that we're very keenly interested in in cancer, we've done a lot of evaluation of the fusion transcripts full length in PacBio.

And they're actually extraordinarily easy to pick up. Yeah. So we've done a lot of that, for example, in prostate cancer, where fusions are the drivers, as near as we can tell.

And the TMPRSS2 ERG and other ERG fusions are key. to discover. So yeah, it's a great platform for doing that.

You had a question, ma'am. Sorry. Yeah. Right. It's both.

So it really just depends on what's in the template. So what I was trying to illustrate there is if you have a single G, you get X amount of hydrogen ions released according to how many fragments actually. right, incorporate that G, because keep in mind it's a population of fragments, not just a single one.

If you have three or four Gs that are in a row, because you have native nucleotides, so no blocking groups or anything like I described for alumina, all four of those Gs will get incorporated just like that, and four times the amount of hydrogen ions will be produced as a result. So it's a... It's an immediacy, really, of incorporation. Polymerases work quite quickly to incorporate nucleotides, and however many there are in the template, that's however many the polymerase will incorporate in that system, and the resulting outflow of hydrogen ions is correlated to how many nucleotides got incorporated.

So that's really the secret in the sauce, if you will, for determining what the nucleotide sequence is. gauging that level of hydrogen ion release, where there is an upper limit, as I pointed out, that'll limit your accuracy in those regions. Does that make sense? Okay.

Anybody else? Yes. I have a question about nanoparticles.

Okay. Yes, yes and yes. So where does Nanopore fit best is the question in terms of different types of sequencing. I mean, all of the things being equal and assuming, you know, that it will improve over time as other types of sequencing have, I think that there are broad applications for Nanopore basically because if you can take a laptop and a, you know, thumb drive sort of looking device, I mean, you could be even sequencing in the depths of the jungle.

if that's what interests you, assuming you can find a place to plug your laptop in sooner or later. But seriously, so I think my impression from Nanopore is that, much like PacBio, the read lengths should be long, all other things being equal. So if you have the right preparatory method to get that DNA ready for the chip, then you load it in and off it goes.

So PacBio... Because of the long reads, it's being used a lot. I didn't mention it because it's not my area.

It's being used a lot in plant sequencing where the genomes are large, sometimes even larger than human, but highly repetitive and repetitive over very long stretches. So human repeats tend to be sort of choppy unless they're big segmental duplications. So it's really important, I think, for ag.

And you see a lot of ag places really interested in it, but already using PacBio. And there have been some nice reports about wheat, for example, which has a large genome being sequenced by PacBio, et cetera, et cetera. So I think across the scale. And then as I just mentioned in response to the other question regarding methylation, obviously in bacterial systems this can be very informative data.

So clinical microbiology may be interested in this as well. And there are some early indications that even in the nanopore setting, you may be able to detect different methylation residues on the DNA as it's translocating through the pore. So, it would be similar in that regard to what I described for PacBio.

But for those of you who have seen the PacBio instruments, they are physically quite large, so it's not something that at this day and age reduces down to sequencing in the depths of the jungle. So, maybe there's some portability aspects to the nanopore that could be really... you know, important for that type of remote sequencing as well as other applications as well. This isn't something I think about a lot, but, you know, you think about, for example, forensic sites and that sort of thing where the immediacy of preparation and sequencing might be incredibly important for the output. So, yeah, open up your imagination because I think it's going to be sort of all available given enough electricity and enough compute cycles to actually make sense out of the data.

Okay, we should probably wrap up, I think, but I really want to thank people for coming, and hopefully it's been informative. And again, if you're shy, come up and I'll answer some questions here. Thanks so much. Thank you.

Transcript for:Current Topics in Genome Analysis: Final Lecture Summary

Transcript for:
Current Topics in Genome Analysis: Final Lecture Summary