Methods for Computational Microbiome Analysis

hello my name is Michael Tremonti I'm a research scientist in the Iowa Institute of human genetics at the University of Iowa and I specialized in bioinformatics my talk today is called methods for computational microbiome analysis and while this talk title is a little bit dry I want to tell you a story that better illustrates the human aspect to why we should care about the microbiome and this story is about a researcher here at the university of iowa a clinician named dr. Terry walls in the year 2000 dr. walls was diagnosed with multiple sclerosis and as you can see in the picture on the left by the year 2007 she had declined to where she was wheelchair-bound and she recognized that the progression of her disease would leave her unable to work eventually she would be confined to a bed and unable to work so she said about using her medical training and her science scientific training to look in the literature look at animal studies look at pharmaceutical trial results and she developed a strategy that she decided was going to help slow the decline of her disease and what she did was she started taking supplements that she felt would provide nutrients that would improve her brain's ability to resist the disease to generate new neurons and to slow her decline and eventually she transitioned from taking supplements to just adopting a diet that was rich in nutrient and foods a diet that was can be described sort of as a paleo type diet now you can see that in the right hand photo just one year later adopting this strict diet and having a few other lifestyle changes she was able to get out of her wheelchair and start riding bikes walking and so not only to achieve reverse not only did she arrest the decline but she actually reversed her disease and if you want to learn more about dr. walls I can recommend visiting our website here at Terry walls calm she wrote a book about her experience and her diet now this is related to the microbiome because there is another person in Iowa who is doing work on the connection between the microbiome and the disease of MS and I would want to postulate here perhaps there's a link between not only the food and the amount of nutrients and types of nutrients in this diet that she adopted but maybe she also was able to modulate her gut microbiome in response to this change in diet and that may have been at least partly responsible for her improvement and so there is a researcher here also at the University of Iowa doctor Ashutosh Mangalam and he has done some very interesting work looking at the relationship of the gut commensal bacteria to protection against MS in a mouse model and so he wrote on his website here that they isolated private Ella has to cola from a healthy individual a bacteria linked with phytoestrogen metabolism and observed that it can induce regulatory cd4 T cells as well as suppress disease in an experimental mouse model of MS and so here is a link potentially between the microbiome the prevalence or abundance of particular species what they're doing functionally in that environment and the onset or continuance of disease or the suppression of symptoms of that disease in particular in this case ms this goes to show that we really should care about the microbiome I think we're just scratching the surface of how the human microbiome interacts with disease and health and there is a lot more to understand so when we talk about a microbiome what we're really talking about is a community of bacteria virus archaea fungi and protozoa that are found in a particular environment thus you can have microbiomes from thousands of different environmental samples could be a surgical implement it could be a sample of soil it could be seawater it could be a countertop it could be your skin etc those are all microbiomes you can extend the definition of microbiome to include the genes and gene products that are found as part of that environment so for example we know that bacteria trade antibiotic resistance genes with each other on pieces of DNA and those genes can be considered as part of a given microbiome to have a microbiome you have to have more than one species so if you just have one species you have a monoculture but a microbial community or microbiome is 2 3 4 up to thousands and thousands of species many of the slides that I'll be using in today's talk I adopted from a talk given by Curtis Hutton Hauer at Harvard he is a top researcher in the field of men genomics and he's written some very popular methods for computational analysis of these data and I adapted several of his slides here for the purposes of this talk so I've already told you about the potential causal link between ms and they got microbiome and the abundance of particular species that appear to be important for modulating those symptoms but the dysbiosis or unhealthy nature of the microbiome has been implicated in other diseases as well including the metabolic disorders so obesity syndrome acts and diabetes autoimmune diseases like eye IVD and crohn's neurological conditions even colorectal cancer it has been associated with that dysbiosis of the microbiome and possibly autism as well so there are many reasons to care about the role of the microbiome in and human health and to learn about how we both explore and interrogate these microbiomes and analyze and understand the data that they generate so microbiomes helped us maintain a healthy state they can cause disease when they go become unbalanced and it's even been found that transferring healthy microbiomes to sick patients can improve and occur their diseases so this is the basis for the so-called fecal transplant therapy this is where fecal matter from a healthy donor is put inside of a pill which was swallowed by a unhealthy patient and that material then recolonized is the gut of that patient hopefully crowding out the bad or the virulent bacteria to restore health and so this has been used successfully for example in recurrent c diff infections and it's also being studied for inflammatory bowel disease and ulcerative colitis the overuse of broad-spectrum antibiotics which are routinely prescribed for ear infections for children for example may be contributing to increasing rates of asthma and allergies and children and this could be modulated through the effects secondary to the primary effect which is to cure the ear infection on the child's microbiome and we just don't know enough about what kind of impact that's having and what downstream effects that will have for the child so when we perform a microbiome study there are three main questions that we're asking first we want to know who was there then we want to know what are they doing there and how are they doing it so looking at who is there that breaks down into two questions what is the relative proportion of the community members in that microbiome what percent of that microbiome is bacterial virus archaea and then had more specific levels what percent is from genus a genus B genus C etc and the other part of point one is how rich is this community and how evenly distributed so richness ISM is a measure of how many taxa are present in the community so microbiome with a hundred different bacterial species is richer than a microbiome five in the simplest way of thinking about it and then how evenly distributed so if you have a microbiome with 100 species are they all present at the same abundance or are two or three highly abundant and the extremely lowly abundant that would be a very uneven distribution once we know about the abundances and the distributions we can ask what are they doing there that is to say what is the functional chemistry that's being carried out by these bacteria or viruses etc how are they supporting themselves how are they interacting with the environment how are they collaborating and cooperating to carry out chemistry how are they carrying out chemical warfare with each other or with fungi or virus basically what is going on in that environment what environmental products are consumed and what is excreted and third we want to understand then we know what they're doing but we want to know how they're doing it so what enzymatic pathways are being activated or overexpressed in these bacteria are those pathways all contained in one species are they present across several different species which is this idea of stratifying pathway activity across several microbes so to begin to answer these questions we turn to metagenomic sequencing and computational analysis of those sequencing data so in metagenomic sequencing you start with a microbial community here on the left and you can take two main approaches one is called amplicon sequencing and the other is called shotgun sequencing in amplicon sequencing we focus on one particular ancestral ribosomal gene that is known to be contained within all of the bacteria and archaea in our community and then we amplify and extract that particular gene in this case the 16s ribosomal RNA gene we sequence it using next-gen sequencing methods produce short reads as a result of that and then go through a quality control and clustering process whereby those clusters are compared to a reference the result of that operation is an abundance table that features samples and taxonomic features the output of this is called operational taxonomic units okay and so this process gives you operational taxonomic units generally inferred to the genus level and then you can produce this heat map and do things like clustering on it and things like that on the other side of this approach you can have a start with your microbial community but now you're extracting all of the DNA in the community not specifically extracting that out purifying it and sequencing it again using a short read sequencing approach and now you have reads from across the genome from thousands of different genomes in your sample and you do QC on those as well you can try to assemble them which we'll talk about a little bit and again you're aligning to a reference database typically in this case of thousands of genomes and once again producing this abundance table of samples by features so to talk about amplicon sequencing a little bit more depth first and amplicon marker gene is a single ancient gene and because it is ancient has shared in many modern species but it has subtle evolutionary differences and those differences can be used to tag species so the gene encoding the microbial 16s RNA ribosomal RNA is a common target for amplicon sequencing the 18s gene is another one and the 16s gene has both alternating conserved and variable regions so you can see in this phylogenetic tree that our gene X here was shared in an ancestor and then it was adopted by or passed in to the archaea branch and into the bacterial branches and bacteria have some version of it and gram-positive size of a different version etc etc and so it's these differences that were going to use to identify different taxa in our samples this is a graph of the base position of the 16s ribosomal RNA gene and the variable regions v1 through VN to we don't sequence all of these we just sequence a handful of them like before through v6 or something like that and then align those sequences and this is an image of the 16s ribosomal RNA itself so you have a microbial genomes with containing gene X you amplify out just that gene and then you have these sequences all containing gene X you sequence that and you get back the sequencing results and you can align all of those with each other and see that the different species have in this case single base pair differences that you nee Klee identified that that genus or that species so what to do computationally with all these short reads that you get from amplicon sequencing there are a couple of different approaches in this graphic the dots represent 16s read and this is describing what's called the closed reference approach where you attempt to create clusters out of your sequencing reads and then once the clustering is done you identify one representative member at the center of that cluster and you map that to a database like green jeans or Silva to try to identify within some tolerance the genus definition of that read or this is closely most closely matching genus or species for example that is closed reference picking with sequences that do not map you can do a different approach called de novo where you instead you still group sequences by similarity but because they do not map to a particular known organism in the green jeans or Silva databases what you do is you pick a representative sequence in that cluster of similar sequences and you try to approximate the as the taxonomy by getting as close as you can typically what's done is a combination or a hybrid approach open reference picking that starts with close reference picking and then takes the leftover sequences and does de novo and you get both closed reference groups and de novo groups assigned okay so now talking about whole meta-genome shotgun sequencing you recall that is the other alternative way to interrogate a microbiome sample here you have microbial genomes in your sample we've got two from red one from blue and two from gray and we shatter these into short relatively equal lengths the nucleotide fragments this is a random process we do not introduce any PCR amplification and this is your DNA meta-genome right here and we sequence this to generate millions of short reads from these short pieces of DNA now when you have these millions of short reads you have to do something with them one approach for that is called the mapping based analysis where we're mapping a DNA read or fragment that and attempting to find a sequence in a database of either genomes or genes or proteins that could be the source of that particular read and this is done by what's called sequence alignment you may be familiar with blast that's one a very old sequence alignment tool there are others like bowtie 2 and diamond that attempts to make improvements to this alignment process and speed it up but the idea is the same and that is that we're looking for either exact or non exact matches to a sequence database to be able to assign that read to a particular organism or if it Maps within gene to a particular function so the way this analysis typically proceeds is through several tiers or steps starting with your input meta-genome you do some quality control to make sure that all the reads are of good quality here you can see this meta genome is made up of some species one in red species two and blue so ambiguous reads that are maybe multi map for and then some novel reads that haven't been from an organism that hasn't been seen before it's not present in the databases and the first step is called a pre screen and the idea here is to map to a handful of genes that are known to uniquely identify species these are called clade specific marker genes and the Hutton Howard group at Harvard has a method called meta FLANN - that works very well for this and the idea is to reduce your search space down to a handful of genes to speed up this process of identifying what is present in your sample as you can see here gene a is mapped in gene B is map so we know we have species 1 & 2 present but gene C is not detected so that species is not present then in the second tier we can do a pan genome search so we can go to our database get rid of the species that were not detected include only those that were detected and then search across the entire genome for those species so now we're mapping reads to all of the genomes that were detected in the pre-screen this allows mapping to all of the genes in species 1 and all the genes in species 2 at this point we'll have some reads left over that do not map some of those are ambiguous and some of those are from a novel organism and so that goes on to our third tier where we do translated search we take unclassified reads and align them to it a comprehensive and non-redundant protein database so this is a search that translates DNA to protein and aligns to protein sequences and attempts to assign function to these novel reads so once you get mapping across genes and a species you can use those mappings to estimate a species abundance for example here we're showing this mapping to species 1 and we can see we have 6 reads mapping across 3 genes and so a rough estimate of twofold coverage whereas we have 3 reads mapping across three genes so species one is present only at one one fold so that's conceptually the way that we begin to assign relative abundances to species in this analysis you can also do a similar thing with functional modules and metabolic pathways each step in the pathway is governed by it enzyme you can look at reads them map to those enzymes and you can say something about the relative abundance of pathways after averaging across all of the genes in that pathway so mapping works best for genes and species that have been already characterized and deposited in public databases so this is an example of a bacterial genome that's 4.6 million base pairs it's been completely sequenced and assembled and deposited and so it is a good candidate for mapping against but we have to keep in mind that there are millions maybe billions or even trillions of species that have not been mapped have not been characterized and are not present in the databases and so this presents a challenge so you can see in this graph that going back to 1995 we only had a couple of microbial genomes and then as of 2016 we had about a hundred thousand total microbial genomes and about ten thousand unique species so that's another problem in that the number of unique species is about tenfold less than the number of sequence to microbial genomes so if you work on projects that are related to human health and medicine you're unlocked because a lot of genomes have been done of human commensal organisms and also disease-causing bacteria for example staph aureus has over six thousand genomes which this was current as a three years ago is probably many more now so if you're studying a human associate in microbiome you're pretty good for mapping but if you're studying microbiome samples from unusual things that do not have well characterized databases then mapping can become tricky and you're going to miss some things so the alternative approach to mapping is to assemble de novo the genome or pieces of the genome of organisms in your sample so we're trying to stitch together sequencing reads to reproduce the original genomes from which they were derived this is a simplified example of in the English language how we would assemble a sentence from its pieces so if you think of this sentence as the reference genome and then these pieces are the short reads and the assembly process proceeds through this process of finding overlaps between the short reads so once upon a on a midnight etc etc then we overlap these short reads and we start to build up this reference you can have problems where you have ambiguity so you don't know at this point whether it goes while I pondered or if it goes while I nodded if you're not sure of which way this should go perhaps because you have two or more competing genomes in your sample which is usually the case you can get challenging situations where complete assembly is not possible also you can get gaps in coverage where you simply cannot find reads that connect this bit to this bit and so you have to stop there and then pick up again and continue assembling so in practice this is typically done assembly is done with what's called a debris graph which is a technique that's implemented computationally to to overlap short read samples short reads and derive parsimonious sequence from their overlaps and that is the assembly process in in real life if you have a simple community might be able to get full genomes reconstructed this way but more commonly we end up with variably sized chunks called context and generally the longer the contact the better the assembly examples of tools that do this our meta velvet mega-hit meta spades etc this is now the middle of 2019 I'm sure there are even newer tools now and they are changing all the time so to kind of wrap up what I told you in this section of the talk we're going to compare some of these microbiome sequencing methods to each other the pros and the cons starting with amplicon sequencing over here the pros are that you don't need very much material for it it can handle contamination pretty well because you're amplifying out a particular gene it gives you approximate taxonomy even for things that you haven't seen before because you can relate them to other known genes other known species in the 16s database and it's very cheap the cons are that you have lower tax nanak resolutions so the idea of the OTU is that it's not necessarily a species but it is a clustering or a grouping of highly similar sequences that represent a taxonomic level or feature in your data but it's difficult to say sometimes that you're looking at particular species with 16s you're often limited to bacteria or archaea if you want to look at fungi you might look at 18s for example you can't see within within genomes and the way you can with whole meta genome shotgun and you can introduce biases from PCR amplification and amplicon copy number variation on the whole meta genome shotgun side some of the pros of the mapping based and assembly based approaches are that you can get very high resolution so down to the subspecies level you can look at individual genes you can look at single nucleotide variants you can see everything that's present in the community including virus and fungus mapping based approaches can be fast they scale well to large data sets and they let you allow you to find rare entities assembly based approaches can identify novel genes genomes and synteny patterns however the cons of both of these approaches is that they do not work well if you do not have much input material and they are expensive on the mapping side you're limited to identifying sequences that you have seen before on the assembly side a con is that you can miss rare entities and this is very computationally intensive so now I want to talk about some of the properties of microbiome data because it is quite different from other types of data so microbiome data is unique in a couple of ways for its sparsity and dynamic range so if we're looking at a typical microbiome data table we'll have samples and columns and taxonomic features and rows along with some sort of metadata so it could be something like body sight different samples from different body sites gender of donors things like this and then these rows are different taxonomic features so this could be genus a ones or family a one genus B one et cetera so you know and then you'll have some counts or counts per million or relative abundance and percent or something like that so here I'm showing counts per million and these are normalized read counts and so you know across this feature it looks pretty well pretty evenly distributed but if you look at this feature you have some very high dynamic range where you have thousands of counts over here and the let's say the males and then you have very few counts and the females that is high dynamic range the other problem is sparsity where a feature might drop out altogether so you have all zeros over here and then some moderate counts and then very high counts and so those two features make this kind of data a little bit more difficult to deal with also it's important to understand that microbiome data are compositional so we're not looking at absolute abundances of cells in the experiment we are mapping with reeds and we're using a fixed sequencing depth per sample thus that the number of reads is much less than the number of cells in the experiment so we are sampling proportions not counting absolute values so if we use four reads per sample here and we look at the same sequencing depth from microbial community one and community two you can see that even though community one is a lot bigger the proportion of reads will be 50 50 and 50 50 in both and thus both communities look the same by metagenomic methods so there are a number of ways that microbial ecologist describe microbiome data and I think it's important to understand the basics of these as you encounter this in the literature it's increasingly important for human disease and you'll be seeing probably more microbiome data results in the future in the clinical literature so the first and sort of simplest way to talk about results from these studies is abundance and prevalence a microbe that is abundant is highly enriched in one sample but if it's not prevalent then it's only found in that sample and not in other samples and in the experiment microbe that is prevalent but not abundant might be found at some level across all samples and then abundant and prevalent is something that is both found at a high level and also found at that same level across all samples in the experiment now getting on to this idea of the richness of samples and their their distribution of taxa we have what we call alpha diversity analysis so this looks at both species richness and sample evenness and it is a measure within the sample of home that samples rich and evenness so the simplest metric for it is simply the number of taxa that are present so the sample intuitively with 100 different taxa is richer than a sample with three different taxa there are other types of metrics that are more sophisticated than just counting the number of taxa but the idea is similar it's a way to describe the diversity of each sample and so we can see from these real world data that you know soil a soil sample has higher alpha diversity than a freshwater sample for example so then there's there's more richness in soil than there is in freshwater and there's more richness in ocean than there is in freshwater even this I'm not showing an evenness metric here but evenness is a measure of the distribution of organisms in the sample so the extent to which so the extent to which the abundances are distributed throughout the sample so if you have 100 species and they're all equally abundant that's very even sample if you have 100 species and three or four dominate the sample that is an uneven distribution and there are ways to measure that and that is another typical metric that community microbial ecologist use to describe samples so here's a real-world example of using alpha diversity to infer something about a sample this is from a paper that Ellen Black wrote with on the effect of freshwater mussels on the anaerobic ammonia oxidizers and other nitrogen transforming bacteria in river sediment and she showed that in mussel communities the overall alpha diversity was lower than in no mussel communities and so those communities where mussels were present had fewer species in them and those species were enriched for certain species so the evenness was lower as well so that's alpha diversity the other approach is called beta diversity so if if alpha diversity is concerned with within sample richness beta diversity is concerned with between sample distance so now we're looking between samples in the experiment and applying a distance metric to calculate how similar they are to each other so this process starts with the abundance tables that we talked about earlier in the talk and then you apply a distance metric like break hardest for example which there are many different distant distance metrics very Kurtis is a very commonly used one and that gives you a value that is the distance between that or the dissimilarity between those two samples and the in the matrix and so you can see here that the difference between sample1 and sample2 is pointed inverse between sample1 and sample2 so they're quite a bit less similar to each other and then you can transform this table to do dimensionality reduction on it and you can use principal components analysis and this shows that in this hypothetical case you have three samples that are quite similar to each other and one that is somewhat different and so that is the way that we begin to interrogate how different samples are from each other so here's another real-world example of an application of this this is again from the same paper by dr. black and again she showed here that samples from the no muscle communities and muscle bed communities segregated by the presence of muscles and not by sediment depth so you can see that in this coordination plot there is a pretty clear separation between the presence of muscles and so that tells you something about the effective muscles on those communities here's another example of an ordination plot on break curtis distance this is a t-sne plot instead of a PCA plot but the idea is the same that we are looking at clusters of samples that are from the same or similar regions and the distance between them gives you some sense of how similar they are to each other here's another use of break curtis distance in the literature this is a paper that I did with Matthew Nana Minh and he was looking at the medicine of a composition of microbial communities in chicken production facilities and this heat map is clustered by Bray Curtis distance so each column is a sample each row is a species that was detected in all of the samples in the experiment we're taking the top 20 or 25 here most abundant that have been detected across all of the samples and then we are performing a break hurtis distance measure on the samples and clustering on that the colors represent abundance so red is highly abundant as in here and black is low abundant and you can see that doing this distance based clustering it actually segregates the samples by type so all of the litter sampling samples clustered together the mortality collection samples clustered together and the settled dust samples clustered together as well finally I want to talk a little bit about stratification this is important to understand that you know if you're prescribing an intervention that affects a particular bacterium in a patient different patients may react differently to that intervention because of the fact that their biochemistry may be carried out by different organisms so you know for example if we're looking at adenosine Rev a nucleotide synthesis this is real data again from the HMP project you can see that this is a case where you have one dominant bacterium per person so these patients all are dominated by lactobacillus Chris POTUS others are dominated by a other types of lactobacillus and so if you were to give a drug that hits lactobacillus Chris modest but not these other lactobacilli language I don't know if such a drunk exists but if you were hypothetically to do this you might see some response from these patients but not from these patients so it's important to understand how these the human microbiome is stratified in terms of performing various biochemical functions or disease related processes something that hits one bug may not be enough to show effect across all different patients because some people have different bugs doing different doing the same thing so here's an example of a mixture of bugs doing the same thing but that proportion is different for different patients in this case purine ribonucleoside degradation the mixture is present and in relatively the same proportion across all the patients and here you have the same dominant bug in each person basically producing this biochemistry and so here's a case where one drug should affect all the patients the same way so with that I will summarize what I've told you I've told you about dr. Terry wall and her miraculous recovery from ms by changing her diet and the potential link between ms and the human gut microbiome and the presence of various species there that are being studied here at the University of Iowa I've told you about metagenomic sequencing and how it's typically performed along two lines amplicon sequencing done by 16s or 18s and whole meta genome shotgun sequencing where you look at all the DNA in the sample I've told you about the pros and cons of both approaches that amplicon sequencing is inexpensive and you don't need much gene but that you're limited in a resolution and it doesn't reveal any functional pathways where as whole genome shotgun surveys all the bacteria a virus and fungi etc in your sample it reveals functional pathway enrichment and novel chemistry but it is expensive and it's computationally and time intensive to analyze I've told you about the unique qualities of microbiome data that it's sparse highly dynamic and stratified we talked about half the diversity measures for evenness and rich richness within samples beta diversity analysis for between sample dissimilarity we talked about ordination plots for reducing that higher dimensional data down into a two-dimensional space to visualize and also heat maps for looking at species abundance and we talked a little bit about stratification for planning before the interventions for disease so with that I thank you for your attention and that concludes my talk

hello my name is Michael Tremonti I&#39;m a research scientist in the Iowa Institute of human genetics at the University of Iowa and I specialized in bioinformatics my talk today is called methods for computational microbiome analysis and while this talk title is a little bit dry I want to tell you a story that better illustrates the human aspect to why we should care about the microbiome and this story is about a researcher here at the university of iowa a clinician named dr. Terry walls in the year 2000 dr. walls was diagnosed with multiple sclerosis and as you can see in the picture on the left by the year 2007 she had declined to where she was wheelchair-bound and she recognized that the progression of her disease would leave her unable to work eventually she would be confined to a bed and unable to work so she said about using her medical training and her science scientific training to look in the literature look at animal studies look at pharmaceutical trial results and she developed a strategy that she decided was going to help slow the decline of her disease and what she did was she started taking supplements that she felt would provide nutrients that would improve her brain&#39;s ability to resist the disease to generate new neurons and to slow her decline and eventually she transitioned from taking supplements to just adopting a diet that was rich in nutrient and foods a diet that was can be described sort of as a paleo type diet now you can see that in the right hand photo just one year later adopting this strict diet and having a few other lifestyle changes she was able to get out of her wheelchair and start riding bikes walking and so not only to achieve reverse not only did she arrest the decline but she actually reversed her disease and if you want to learn more about dr. walls I can recommend visiting our website here at Terry walls calm she wrote a book about her experience and her diet now this is related to the microbiome because there is another person in Iowa who is doing work on the connection between the microbiome and the disease of MS and I would want to postulate here perhaps there&#39;s a link between not only the food and the amount of nutrients and types of nutrients in this diet that she adopted but maybe she also was able to modulate her gut microbiome in response to this change in diet and that may have been at least partly responsible for her improvement and so there is a researcher here also at the University of Iowa doctor Ashutosh Mangalam and he has done some very interesting work looking at the relationship of the gut commensal bacteria to protection against MS in a mouse model and so he wrote on his website here that they isolated private Ella has to cola from a healthy individual a bacteria linked with phytoestrogen metabolism and observed that it can induce regulatory cd4 T cells as well as suppress disease in an experimental mouse model of MS and so here is a link potentially between the microbiome the prevalence or abundance of particular species what they&#39;re doing functionally in that environment and the onset or continuance of disease or the suppression of symptoms of that disease in particular in this case ms this goes to show that we really should care about the microbiome I think we&#39;re just scratching the surface of how the human microbiome interacts with disease and health and there is a lot more to understand so when we talk about a microbiome what we&#39;re really talking about is a community of bacteria virus archaea fungi and protozoa that are found in a particular environment thus you can have microbiomes from thousands of different environmental samples could be a surgical implement it could be a sample of soil it could be seawater it could be a countertop it could be your skin etc those are all microbiomes you can extend the definition of microbiome to include the genes and gene products that are found as part of that environment so for example we know that bacteria trade antibiotic resistance genes with each other on pieces of DNA and those genes can be considered as part of a given microbiome to have a microbiome you have to have more than one species so if you just have one species you have a monoculture but a microbial community or microbiome is 2 3 4 up to thousands and thousands of species many of the slides that I&#39;ll be using in today&#39;s talk I adopted from a talk given by Curtis Hutton Hauer at Harvard he is a top researcher in the field of men genomics and he&#39;s written some very popular methods for computational analysis of these data and I adapted several of his slides here for the purposes of this talk so I&#39;ve already told you about the potential causal link between ms and they got microbiome and the abundance of particular species that appear to be important for modulating those symptoms but the dysbiosis or unhealthy nature of the microbiome has been implicated in other diseases as well including the metabolic disorders so obesity syndrome acts and diabetes autoimmune diseases like eye IVD and crohn&#39;s neurological conditions even colorectal cancer it has been associated with that dysbiosis of the microbiome and possibly autism as well so there are many reasons to care about the role of the microbiome in and human health and to learn about how we both explore and interrogate these microbiomes and analyze and understand the data that they generate so microbiomes helped us maintain a healthy state they can cause disease when they go become unbalanced and it&#39;s even been found that transferring healthy microbiomes to sick patients can improve and occur their diseases so this is the basis for the so-called fecal transplant therapy this is where fecal matter from a healthy donor is put inside of a pill which was swallowed by a unhealthy patient and that material then recolonized is the gut of that patient hopefully crowding out the bad or the virulent bacteria to restore health and so this has been used successfully for example in recurrent c diff infections and it&#39;s also being studied for inflammatory bowel disease and ulcerative colitis the overuse of broad-spectrum antibiotics which are routinely prescribed for ear infections for children for example may be contributing to increasing rates of asthma and allergies and children and this could be modulated through the effects secondary to the primary effect which is to cure the ear infection on the child&#39;s microbiome and we just don&#39;t know enough about what kind of impact that&#39;s having and what downstream effects that will have for the child so when we perform a microbiome study there are three main questions that we&#39;re asking first we want to know who was there then we want to know what are they doing there and how are they doing it so looking at who is there that breaks down into two questions what is the relative proportion of the community members in that microbiome what percent of that microbiome is bacterial virus archaea and then had more specific levels what percent is from genus a genus B genus C etc and the other part of point one is how rich is this community and how evenly distributed so richness ISM is a measure of how many taxa are present in the community so microbiome with a hundred different bacterial species is richer than a microbiome five in the simplest way of thinking about it and then how evenly distributed so if you have a microbiome with 100 species are they all present at the same abundance or are two or three highly abundant and the extremely lowly abundant that would be a very uneven distribution once we know about the abundances and the distributions we can ask what are they doing there that is to say what is the functional chemistry that&#39;s being carried out by these bacteria or viruses etc how are they supporting themselves how are they interacting with the environment how are they collaborating and cooperating to carry out chemistry how are they carrying out chemical warfare with each other or with fungi or virus basically what is going on in that environment what environmental products are consumed and what is excreted and third we want to understand then we know what they&#39;re doing but we want to know how they&#39;re doing it so what enzymatic pathways are being activated or overexpressed in these bacteria are those pathways all contained in one species are they present across several different species which is this idea of stratifying pathway activity across several microbes so to begin to answer these questions we turn to metagenomic sequencing and computational analysis of those sequencing data so in metagenomic sequencing you start with a microbial community here on the left and you can take two main approaches one is called amplicon sequencing and the other is called shotgun sequencing in amplicon sequencing we focus on one particular ancestral ribosomal gene that is known to be contained within all of the bacteria and archaea in our community and then we amplify and extract that particular gene in this case the 16s ribosomal RNA gene we sequence it using next-gen sequencing methods produce short reads as a result of that and then go through a quality control and clustering process whereby those clusters are compared to a reference the result of that operation is an abundance table that features samples and taxonomic features the output of this is called operational taxonomic units okay and so this process gives you operational taxonomic units generally inferred to the genus level and then you can produce this heat map and do things like clustering on it and things like that on the other side of this approach you can have a start with your microbial community but now you&#39;re extracting all of the DNA in the community not specifically extracting that out purifying it and sequencing it again using a short read sequencing approach and now you have reads from across the genome from thousands of different genomes in your sample and you do QC on those as well you can try to assemble them which we&#39;ll talk about a little bit and again you&#39;re aligning to a reference database typically in this case of thousands of genomes and once again producing this abundance table of samples by features so to talk about amplicon sequencing a little bit more depth first and amplicon marker gene is a single ancient gene and because it is ancient has shared in many modern species but it has subtle evolutionary differences and those differences can be used to tag species so the gene encoding the microbial 16s RNA ribosomal RNA is a common target for amplicon sequencing the 18s gene is another one and the 16s gene has both alternating conserved and variable regions so you can see in this phylogenetic tree that our gene X here was shared in an ancestor and then it was adopted by or passed in to the archaea branch and into the bacterial branches and bacteria have some version of it and gram-positive size of a different version etc etc and so it&#39;s these differences that were going to use to identify different taxa in our samples this is a graph of the base position of the 16s ribosomal RNA gene and the variable regions v1 through VN to we don&#39;t sequence all of these we just sequence a handful of them like before through v6 or something like that and then align those sequences and this is an image of the 16s ribosomal RNA itself so you have a microbial genomes with containing gene X you amplify out just that gene and then you have these sequences all containing gene X you sequence that and you get back the sequencing results and you can align all of those with each other and see that the different species have in this case single base pair differences that you nee Klee identified that that genus or that species so what to do computationally with all these short reads that you get from amplicon sequencing there are a couple of different approaches in this graphic the dots represent 16s read and this is describing what&#39;s called the closed reference approach where you attempt to create clusters out of your sequencing reads and then once the clustering is done you identify one representative member at the center of that cluster and you map that to a database like green jeans or Silva to try to identify within some tolerance the genus definition of that read or this is closely most closely matching genus or species for example that is closed reference picking with sequences that do not map you can do a different approach called de novo where you instead you still group sequences by similarity but because they do not map to a particular known organism in the green jeans or Silva databases what you do is you pick a representative sequence in that cluster of similar sequences and you try to approximate the as the taxonomy by getting as close as you can typically what&#39;s done is a combination or a hybrid approach open reference picking that starts with close reference picking and then takes the leftover sequences and does de novo and you get both closed reference groups and de novo groups assigned okay so now talking about whole meta-genome shotgun sequencing you recall that is the other alternative way to interrogate a microbiome sample here you have microbial genomes in your sample we&#39;ve got two from red one from blue and two from gray and we shatter these into short relatively equal lengths the nucleotide fragments this is a random process we do not introduce any PCR amplification and this is your DNA meta-genome right here and we sequence this to generate millions of short reads from these short pieces of DNA now when you have these millions of short reads you have to do something with them one approach for that is called the mapping based analysis where we&#39;re mapping a DNA read or fragment that and attempting to find a sequence in a database of either genomes or genes or proteins that could be the source of that particular read and this is done by what&#39;s called sequence alignment you may be familiar with blast that&#39;s one a very old sequence alignment tool there are others like bowtie 2 and diamond that attempts to make improvements to this alignment process and speed it up but the idea is the same and that is that we&#39;re looking for either exact or non exact matches to a sequence database to be able to assign that read to a particular organism or if it Maps within gene to a particular function so the way this analysis typically proceeds is through several tiers or steps starting with your input meta-genome you do some quality control to make sure that all the reads are of good quality here you can see this meta genome is made up of some species one in red species two and blue so ambiguous reads that are maybe multi map for and then some novel reads that haven&#39;t been from an organism that hasn&#39;t been seen before it&#39;s not present in the databases and the first step is called a pre screen and the idea here is to map to a handful of genes that are known to uniquely identify species these are called clade specific marker genes and the Hutton Howard group at Harvard has a method called meta FLANN - that works very well for this and the idea is to reduce your search space down to a handful of genes to speed up this process of identifying what is present in your sample as you can see here gene a is mapped in gene B is map so we know we have species 1 &amp; 2 present but gene C is not detected so that species is not present then in the second tier we can do a pan genome search so we can go to our database get rid of the species that were not detected include only those that were detected and then search across the entire genome for those species so now we&#39;re mapping reads to all of the genomes that were detected in the pre-screen this allows mapping to all of the genes in species 1 and all the genes in species 2 at this point we&#39;ll have some reads left over that do not map some of those are ambiguous and some of those are from a novel organism and so that goes on to our third tier where we do translated search we take unclassified reads and align them to it a comprehensive and non-redundant protein database so this is a search that translates DNA to protein and aligns to protein sequences and attempts to assign function to these novel reads so once you get mapping across genes and a species you can use those mappings to estimate a species abundance for example here we&#39;re showing this mapping to species 1 and we can see we have 6 reads mapping across 3 genes and so a rough estimate of twofold coverage whereas we have 3 reads mapping across three genes so species one is present only at one one fold so that&#39;s conceptually the way that we begin to assign relative abundances to species in this analysis you can also do a similar thing with functional modules and metabolic pathways each step in the pathway is governed by it enzyme you can look at reads them map to those enzymes and you can say something about the relative abundance of pathways after averaging across all of the genes in that pathway so mapping works best for genes and species that have been already characterized and deposited in public databases so this is an example of a bacterial genome that&#39;s 4.6 million base pairs it&#39;s been completely sequenced and assembled and deposited and so it is a good candidate for mapping against but we have to keep in mind that there are millions maybe billions or even trillions of species that have not been mapped have not been characterized and are not present in the databases and so this presents a challenge so you can see in this graph that going back to 1995 we only had a couple of microbial genomes and then as of 2016 we had about a hundred thousand total microbial genomes and about ten thousand unique species so that&#39;s another problem in that the number of unique species is about tenfold less than the number of sequence to microbial genomes so if you work on projects that are related to human health and medicine you&#39;re unlocked because a lot of genomes have been done of human commensal organisms and also disease-causing bacteria for example staph aureus has over six thousand genomes which this was current as a three years ago is probably many more now so if you&#39;re studying a human associate in microbiome you&#39;re pretty good for mapping but if you&#39;re studying microbiome samples from unusual things that do not have well characterized databases then mapping can become tricky and you&#39;re going to miss some things so the alternative approach to mapping is to assemble de novo the genome or pieces of the genome of organisms in your sample so we&#39;re trying to stitch together sequencing reads to reproduce the original genomes from which they were derived this is a simplified example of in the English language how we would assemble a sentence from its pieces so if you think of this sentence as the reference genome and then these pieces are the short reads and the assembly process proceeds through this process of finding overlaps between the short reads so once upon a on a midnight etc etc then we overlap these short reads and we start to build up this reference you can have problems where you have ambiguity so you don&#39;t know at this point whether it goes while I pondered or if it goes while I nodded if you&#39;re not sure of which way this should go perhaps because you have two or more competing genomes in your sample which is usually the case you can get challenging situations where complete assembly is not possible also you can get gaps in coverage where you simply cannot find reads that connect this bit to this bit and so you have to stop there and then pick up again and continue assembling so in practice this is typically done assembly is done with what&#39;s called a debris graph which is a technique that&#39;s implemented computationally to to overlap short read samples short reads and derive parsimonious sequence from their overlaps and that is the assembly process in in real life if you have a simple community might be able to get full genomes reconstructed this way but more commonly we end up with variably sized chunks called context and generally the longer the contact the better the assembly examples of tools that do this our meta velvet mega-hit meta spades etc this is now the middle of 2019 I&#39;m sure there are even newer tools now and they are changing all the time so to kind of wrap up what I told you in this section of the talk we&#39;re going to compare some of these microbiome sequencing methods to each other the pros and the cons starting with amplicon sequencing over here the pros are that you don&#39;t need very much material for it it can handle contamination pretty well because you&#39;re amplifying out a particular gene it gives you approximate taxonomy even for things that you haven&#39;t seen before because you can relate them to other known genes other known species in the 16s database and it&#39;s very cheap the cons are that you have lower tax nanak resolutions so the idea of the OTU is that it&#39;s not necessarily a species but it is a clustering or a grouping of highly similar sequences that represent a taxonomic level or feature in your data but it&#39;s difficult to say sometimes that you&#39;re looking at particular species with 16s you&#39;re often limited to bacteria or archaea if you want to look at fungi you might look at 18s for example you can&#39;t see within within genomes and the way you can with whole meta genome shotgun and you can introduce biases from PCR amplification and amplicon copy number variation on the whole meta genome shotgun side some of the pros of the mapping based and assembly based approaches are that you can get very high resolution so down to the subspecies level you can look at individual genes you can look at single nucleotide variants you can see everything that&#39;s present in the community including virus and fungus mapping based approaches can be fast they scale well to large data sets and they let you allow you to find rare entities assembly based approaches can identify novel genes genomes and synteny patterns however the cons of both of these approaches is that they do not work well if you do not have much input material and they are expensive on the mapping side you&#39;re limited to identifying sequences that you have seen before on the assembly side a con is that you can miss rare entities and this is very computationally intensive so now I want to talk about some of the properties of microbiome data because it is quite different from other types of data so microbiome data is unique in a couple of ways for its sparsity and dynamic range so if we&#39;re looking at a typical microbiome data table we&#39;ll have samples and columns and taxonomic features and rows along with some sort of metadata so it could be something like body sight different samples from different body sites gender of donors things like this and then these rows are different taxonomic features so this could be genus a ones or family a one genus B one et cetera so you know and then you&#39;ll have some counts or counts per million or relative abundance and percent or something like that so here I&#39;m showing counts per million and these are normalized read counts and so you know across this feature it looks pretty well pretty evenly distributed but if you look at this feature you have some very high dynamic range where you have thousands of counts over here and the let&#39;s say the males and then you have very few counts and the females that is high dynamic range the other problem is sparsity where a feature might drop out altogether so you have all zeros over here and then some moderate counts and then very high counts and so those two features make this kind of data a little bit more difficult to deal with also it&#39;s important to understand that microbiome data are compositional so we&#39;re not looking at absolute abundances of cells in the experiment we are mapping with reeds and we&#39;re using a fixed sequencing depth per sample thus that the number of reads is much less than the number of cells in the experiment so we are sampling proportions not counting absolute values so if we use four reads per sample here and we look at the same sequencing depth from microbial community one and community two you can see that even though community one is a lot bigger the proportion of reads will be 50 50 and 50 50 in both and thus both communities look the same by metagenomic methods so there are a number of ways that microbial ecologist describe microbiome data and I think it&#39;s important to understand the basics of these as you encounter this in the literature it&#39;s increasingly important for human disease and you&#39;ll be seeing probably more microbiome data results in the future in the clinical literature so the first and sort of simplest way to talk about results from these studies is abundance and prevalence a microbe that is abundant is highly enriched in one sample but if it&#39;s not prevalent then it&#39;s only found in that sample and not in other samples and in the experiment microbe that is prevalent but not abundant might be found at some level across all samples and then abundant and prevalent is something that is both found at a high level and also found at that same level across all samples in the experiment now getting on to this idea of the richness of samples and their their distribution of taxa we have what we call alpha diversity analysis so this looks at both species richness and sample evenness and it is a measure within the sample of home that samples rich and evenness so the simplest metric for it is simply the number of taxa that are present so the sample intuitively with 100 different taxa is richer than a sample with three different taxa there are other types of metrics that are more sophisticated than just counting the number of taxa but the idea is similar it&#39;s a way to describe the diversity of each sample and so we can see from these real world data that you know soil a soil sample has higher alpha diversity than a freshwater sample for example so then there&#39;s there&#39;s more richness in soil than there is in freshwater and there&#39;s more richness in ocean than there is in freshwater even this I&#39;m not showing an evenness metric here but evenness is a measure of the distribution of organisms in the sample so the extent to which so the extent to which the abundances are distributed throughout the sample so if you have 100 species and they&#39;re all equally abundant that&#39;s very even sample if you have 100 species and three or four dominate the sample that is an uneven distribution and there are ways to measure that and that is another typical metric that community microbial ecologist use to describe samples so here&#39;s a real-world example of using alpha diversity to infer something about a sample this is from a paper that Ellen Black wrote with on the effect of freshwater mussels on the anaerobic ammonia oxidizers and other nitrogen transforming bacteria in river sediment and she showed that in mussel communities the overall alpha diversity was lower than in no mussel communities and so those communities where mussels were present had fewer species in them and those species were enriched for certain species so the evenness was lower as well so that&#39;s alpha diversity the other approach is called beta diversity so if if alpha diversity is concerned with within sample richness beta diversity is concerned with between sample distance so now we&#39;re looking between samples in the experiment and applying a distance metric to calculate how similar they are to each other so this process starts with the abundance tables that we talked about earlier in the talk and then you apply a distance metric like break hardest for example which there are many different distant distance metrics very Kurtis is a very commonly used one and that gives you a value that is the distance between that or the dissimilarity between those two samples and the in the matrix and so you can see here that the difference between sample1 and sample2 is pointed inverse between sample1 and sample2 so they&#39;re quite a bit less similar to each other and then you can transform this table to do dimensionality reduction on it and you can use principal components analysis and this shows that in this hypothetical case you have three samples that are quite similar to each other and one that is somewhat different and so that is the way that we begin to interrogate how different samples are from each other so here&#39;s another real-world example of an application of this this is again from the same paper by dr. black and again she showed here that samples from the no muscle communities and muscle bed communities segregated by the presence of muscles and not by sediment depth so you can see that in this coordination plot there is a pretty clear separation between the presence of muscles and so that tells you something about the effective muscles on those communities here&#39;s another example of an ordination plot on break curtis distance this is a t-sne plot instead of a PCA plot but the idea is the same that we are looking at clusters of samples that are from the same or similar regions and the distance between them gives you some sense of how similar they are to each other here&#39;s another use of break curtis distance in the literature this is a paper that I did with Matthew Nana Minh and he was looking at the medicine of a composition of microbial communities in chicken production facilities and this heat map is clustered by Bray Curtis distance so each column is a sample each row is a species that was detected in all of the samples in the experiment we&#39;re taking the top 20 or 25 here most abundant that have been detected across all of the samples and then we are performing a break hurtis distance measure on the samples and clustering on that the colors represent abundance so red is highly abundant as in here and black is low abundant and you can see that doing this distance based clustering it actually segregates the samples by type so all of the litter sampling samples clustered together the mortality collection samples clustered together and the settled dust samples clustered together as well finally I want to talk a little bit about stratification this is important to understand that you know if you&#39;re prescribing an intervention that affects a particular bacterium in a patient different patients may react differently to that intervention because of the fact that their biochemistry may be carried out by different organisms so you know for example if we&#39;re looking at adenosine Rev a nucleotide synthesis this is real data again from the HMP project you can see that this is a case where you have one dominant bacterium per person so these patients all are dominated by lactobacillus Chris POTUS others are dominated by a other types of lactobacillus and so if you were to give a drug that hits lactobacillus Chris modest but not these other lactobacilli language I don&#39;t know if such a drunk exists but if you were hypothetically to do this you might see some response from these patients but not from these patients so it&#39;s important to understand how these the human microbiome is stratified in terms of performing various biochemical functions or disease related processes something that hits one bug may not be enough to show effect across all different patients because some people have different bugs doing different doing the same thing so here&#39;s an example of a mixture of bugs doing the same thing but that proportion is different for different patients in this case purine ribonucleoside degradation the mixture is present and in relatively the same proportion across all the patients and here you have the same dominant bug in each person basically producing this biochemistry and so here&#39;s a case where one drug should affect all the patients the same way so with that I will summarize what I&#39;ve told you I&#39;ve told you about dr. Terry wall and her miraculous recovery from ms by changing her diet and the potential link between ms and the human gut microbiome and the presence of various species there that are being studied here at the University of Iowa I&#39;ve told you about metagenomic sequencing and how it&#39;s typically performed along two lines amplicon sequencing done by 16s or 18s and whole meta genome shotgun sequencing where you look at all the DNA in the sample I&#39;ve told you about the pros and cons of both approaches that amplicon sequencing is inexpensive and you don&#39;t need much gene but that you&#39;re limited in a resolution and it doesn&#39;t reveal any functional pathways where as whole genome shotgun surveys all the bacteria a virus and fungi etc in your sample it reveals functional pathway enrichment and novel chemistry but it is expensive and it&#39;s computationally and time intensive to analyze I&#39;ve told you about the unique qualities of microbiome data that it&#39;s sparse highly dynamic and stratified we talked about half the diversity measures for evenness and rich richness within samples beta diversity analysis for between sample dissimilarity we talked about ordination plots for reducing that higher dimensional data down into a two-dimensional space to visualize and also heat maps for looking at species abundance and we talked a little bit about stratification for planning before the interventions for disease so with that I thank you for your attention and that concludes my talk

Transcript for:Methods for Computational Microbiome Analysis

Transcript for:
Methods for Computational Microbiome Analysis