Transcript for:
Microbiome and Genomic Analysis Lecture

okay so from the introduction you sound like everyone has some experience of with microbiome in a genomic analysis so this lecture my sound that they like preaching to the choir so I thought what I'll do is I'll go over it fairly quickly but we'll use this opportunity sort of as it's discussion so if there's are any points that that you want us to go more in depth or that you want to bring up your perspective please feel free to to speak up and also I want to say this least lecture slides were mostly about from raw pecos previous lectures in one year he couldn't make it so I took over the slides and I got stuck with with them but it bakes the question wise and rocking this extra to it so I'll do my best okay so as you know you're here for three-day intensive workshop in men that microbiome analysis so we broke the like that the workshop into eight different modules and briefly go over them so the first module is this one right here so we'll introduce the basic concept that definitions that you know approach in some of the resources available in module 2 will go into marker gene analysis mainly based on 16s analysis and use that to demonstrate how you can measure community diversity or sample diversity in other words what sample diversity Alfred Ivor Alfred diversity and beta diversity for the different samples and communities and for module 3 well you will go into pie crust which is a tool that Morgan developed to link between marker based analysis in other words taxonomic markers to functional mark two functional genes and infer functions from marker genes in module four then we'll go to go into shutdown metagenomic analysis talking both about the taxonomic classification and functional classification of the samples that you get from metagenomic shotgun sequencing this used to be two separate lectures but we condemned or two separate modules but we condense it into one module because to make room for additional topics but also as the tools have improved over the last few years it's also streamlined up the process quite a bit and module five is is new and Laura will be talking about how you can take metagenomic samples and assemble them sometimes you have to pre bender to manage anomic reads before you can assemble it and how do you attract how do you extract you normally sequences sometimes for genomes from metagenomic data module six would be from oh people would be young men are transcriptomics and john would be covering how you can do RNA seek analysis okay module seven is also new and rap will be covering module seven give you some more advanced statistical analysis that you can apply to I think mainly on marketing data are you going to cover any shotgun data as well oh yeah so yes so you'll be an extension on module 2 and module fourth give you some more background information on statistical analysis of the data sets that we'll see in this workshop and module eight would be a lecture delivered by Fiona you know once you carried out the sequence analysis when she done yours your statistical so cisco analysis doing the abundance analysis differentiating that the different microbiome samples how can use that the results to select for biomarkers that can be associated with different conditions such as diseases or different environmental conditions any question so far any thing that we missed or that you think we should have covered okay so the general learning objective for this entire workshop is to be able to define the different types of metagenomic projects and process the data so there will be a lot of opportunities for hands-on usage of the different tools and you will show you how you can run some standard pipelines for marker genes for metagenomics and Metatron's grimbo transcriptomic datasets and we'll also be making these tools available to you so you can replicate the analysis when you go home with your own data set and if you want to you can also have opportunity to to try out your own data set during this workshop you fear more more advanced and also important throughout the electric throughout the workshop we will bring up some technical and sometimes philosophical limitations of the metagenomic studies so you're aware on some of the important limitations and not make predictions we'll make custom ations overestimate of the power of genomic studies so for this specific module you will apply heat terms in metagenomics for example you understand what microbial communities so how many people have heard of the I've been exposed to the OT you versus ASV kind of debate how many have you heard of ot you almost everyone how many have her ASV or ESV or pempek on sequence variant much fewer people so I guess this is routine to to bring up a particular discussion in the in the next session so we'll also show you a few types of main objective for why we carry our metagenomic studies and the in memoriam module to interpret the content of sequence data and lastly I'll cover some of the common resources for reference databases and so on so the term microbiome has been a tribute to Joshua Lederberg and by Laura Cooper and Jeffrey Gorton so he defined Michael Wallace collective genome of our indigenous microbe microbes which used to be Co microflora but as you should all know that bacteria and archaea are not plants so the term Michael Flores being sort of sort of fell out of favor so it's not really being been commonly used as I've been in the not not in the microbiome community anyway so the idea being that the comprehensive it takes a comprehensive genetic view of the human as a life form and it's Mike and it's microbiome so that sort of take a holistic view of human as a as an ecosystem so the term microbiota is the actual set of microorganisms that I found in a particular setting so there's a bit of a historical confusion about microbiome and microbiota cause some people would assume microbiome men in they interpreted as the microbial biome so they use it to refer to the the organisms but in in our case we sort of made differentiation but ultimate sometimes the term are interchangeable method genomics on the other hand is quite different from the term micro bomb or microbiota and Jo Handelsman in 98 actually use it to describe functional functional aspect of the microbiome so it's and she was more referring to medicine in beyond so defined it as the events of molecular biology and and eukaryotic genomics which have laid the groundwork for cloning and functional analysis of collected genomes again so a community-based approach of soil micro flora and she turned that the meta-genome of of the soil so we sort of made the distinction that merit genomics refer more to the functional aspect and and take a shotgun approach to identify the functional genes rather than the marker gene based approach which typically don't give you a functional aspect of the community but gives you a taxonomic aspect of the community so the the goal of the microbiome studies the risk or the relationships of the microbes and their habitats including human and its effect on our health so to accomplish this we use different molecular biology techniques and computational techniques to make inference about the community so in this workshop we'll be talking about how you use marker genes to characterize a community how do you then take the use pie crusts to go from marker genes to function but then we also will show you how you can do metagenomics analysis using sha Camela genomics data and then we'll also talk about RNA seek data sets using magic formula transcriptomics we will not be covering proteomics and a metabolic type of so not not proteomics work or metabolic type of studies in in this workshop but the point here is that there are many terms now ending with omics or ohms to refer to sort of a community or a based or systems based approach to understand community holistically so there's also cultural which talks about how you will culture different organisms and and so on so so why do we take the metagenomics approach so as you know most organisms don't live in isolation they live in the community so a traditional cultural based approach where you try to isolate a single organism and which is still the dominant practice in diagnostic labs and many other medical microbiology labs is still very much a clonal view of a pure culture view of to pathogenesis to to diseases it's and but it often missed the intricacy with the interactions between different organisms living in the same community and a while back there's a paper published about a great play count I'm normally sort of comparing the number of organism that can be successfully cultivated on the plate versus what's observed under microscope under observed molecular Li so it's an estimated that less than 1% of the organisms across habitats can be cultivated and this of course now it's a big controversy because people have been systematically trying to cultivate organisms especially ones found environments that we care about such as the human gut or other parts of human body and the percentage of that can be culture it can actually be higher and more importantly the the ones that if you can culture them in the power reactor or some way of allowing them to interact with each other it greatly increased the the culture by organisms that you can grow into in the lab so but in any event it's it's not possible to culture all organisms at least at the present time so to take an alternative approach where you can interrogate the community without culturing them is why microbiome in metagenomics analysis became so popular and as of last year I last checked there's about 25,000 papers and I'm sure some of you have contributed to to that count of number of microbiome papers population in the last 10 years so when I was preparing the lecture for for some other like some other course it was doing Thanksgiving so I was looking around for examples and I found this one which still appropriate I think given this context so but anyway like the that there's micro biomes of many different things and I think throughout this lectures 3 how this workshop you're here alum more about other studies but I want to highlight this one because it's a good example where you know microbiome of a food product or organism food yeah I guess a food product or bird can actually be intricately linked to human health so we care about the turkey microbiome because the because we know gut microbiome can drive the top in prime host immune system and the studies have shown that while Turkey and domestic turkeys actually have very different microbiome due to the in interference that in in the agricultural process and moreover many of the organisms unknown one on culture pole yet we apply low dosage of antibiotics as a gross promoter in they're still in practice in certain settings even though it's we believe not the poor it's not allowed in you and I'm pretty sure it's not allowed in Canada but people from see if I can pull we let me know if that's the case we're not but antibiotics as we'll see in a module to can actually in the lab session can actually affect the diversity of the native microbiome and give pathogens and opportunity to just live in an environment that they otherwise will be out-compete right so the opportunist active pathogens due to the intervention of Tomic biotic antibiotics as growth promoters can actually affect our food safety and the study of the gut microbiome may actually of turkey may actually lead to better ways of enhancing the growth without you the use of antibiotics yeah so and I just by searching for Turkey or Turner I also came across a drink sorry a study that looks at different Turkish from fermented drinks so it's not surprising a lot of lactic acid bacteria that I found in these fermented drinks and the host the I was in the Oxford nanopore how many people have heard of me nyam were or Oxford nanopore okay so it's a it's a it's a small device for for sequencing so I was at a shore one-day workshop and the workshop actually went to it the instructors went to a store and bought some kefir and extracted DNA from kefir and sequenced in the workshop and so these type of study that I think was published in 2013 - you know months if not years - maybe not years month - to prepare can now be done using the current sequencing technology in the single day as a demonstration in the workshop to show you what kind of organism what kind of microbe can be found in your key fridge well in your kombucha that was the other type of drinks that was used to to extract DNA for sequencing in that workshop and also it really brings home the idea that we can we now have the tools and the post sequencing tools and as we learned in this workshop the bound from AG tools to really study what type of organisms are you know surrounding in our food in our and in our in an on our body and most of you how we have heard that most this phrase most of you is not you stemmed from the observation that most of the cells found on your body actually are non human cells and the ratio is approximately four to two ones live in the third latest latest estimation and that but the microbes in and on your body encode 500 times more genes than the human genes and it weighs about two kilo of your body weight so how many human genes do we do we have personally 20,000 yeah yes yeah that's about right so so imagine you have about 20,000 or 25,000 genes the number of microbial genes is rent sort of collectively is about two to three million different types of genes in longview okay so I was referring to the sequencing technology available that had vastly speed up our ability to generate sequence data and therefore use sequencing based approach or molecular approach to interrogate microbiome the most of the so ROS 454 is actually it rarely in use these days and sort of a discontinued product in support meant with minimum support and Sankar of course is the traditional sequencing platform the most terminator resequencing platformers may know is that the Illumina series of sequencers ranging from desktop my seeks all the way to large-scale sequencers such as different versions of the high seek and more recently the so-called third generation or single molecular sequencing have been sort of I'd be made available publicly in the two dominant ones are the Pike Pacific bioscience will pack bio platforms which occupy an entire room and requires reinforced concrete floor is this thing I think weighs at about a ton or so compared to the Oxford nanopore device which is as you can see the scale here sort of a thumbdrive sized device and you just plug into your USB a port on your laptop to to to generate sequence data so the different devices has drastically increased our sequencing capacity in the last few years okay any questions so far any observation so far so how many people are using alumina sequencers for their to generate their data almost everyone okay anyone using Oxford nanopore mania okay phew okay so yeah if you might wanna if you're interested I want to talk to each other to share your experience in these platforms especially the new the newer mean ion platform how how how it goes when it when you generate your when you try to run a managing on examples or even a marker gene sample and from these mmm pecan samples on these on these devices okay so so so what can we answer with microbiome studies roughly speaking there's four four different general questions first it's just who's there and what's in the microbiome and this can typically be achieved using a marker gene based study and but of course you can also do metagenomic shotgun sequencing and in inferred taxonomic information from the shotgun data and port talked a little bit about that in module four okay so the other general questions what are the functions that are present in this microbiomes and the the study here is drunk ongoing study in Rob's group looking at the different antimicrobial resistant genes in an elderly population and so I'll let you add any comments but he was pointing out essentially that so along the the x-axis at the different classes of antibiotic-resistant genes and and the their true their proportions in in the in the samples and so the different types of resistant genes are present a different level but also as you can see in the sort of the the height of the the uncertainty or the the error bars it shows the the very the variations the cost the cost to subjects so some genes present in low abundance and highly variable across different subjects of their microwaving of course their microbiome in CODIS these genes and some of the other genes that seem to be found in whole subject in high abundance such as methyl op lactamase resistant genes in some Mars or in-between that everyone has this particular gene but they seem to be we could everyone consistently have this gene and there's low variations across different subjects anything you want all right so the next question is asking what do the functions or the taxonomic profile of the microbiome correlate with and this is looking at the different characteristics of your sample which were of the of the conditions that you want to study in correlating your microbiome with those indicators and this is the topic of our Cisco analysis module and and also it will be brought up in in some of the other modules as well so in this particular study for example this looked at the correlation between the sort of microbiome in terms of its diversity versus the pH level in the soil and you can see there's a nonlinear relationship between the two indicators that between the two variables I mean and another study in looking at the frequency of saliva that sort of the selected correlating the saliva microbiome similarity to the kissing frequency of presumably couples who knows okay so what so not just finding the relationship between microbes and its in its environment it's also possible to to use time series to look at how microbiomes will respond over time to two different treatments so this is a study looking at essentially in a mouse model looking as c diff infection so my Sarah tree that was sorry my cell that have that are healthy and not being infected by CAFR in this classroom here and we'll talk about the this type of display which is called a principal component analysis essentially projecting high dimensional data in in a two dimensional structure so you're looking at the maximum separation between different groups of organisms so in this corner here they these are the healthy individuals and what in this group here are the organisms that been treated with with antibiotics and what's interesting is that this is the the the group that are persistently shedding c-diff and have some clinical signs of infection this this or acidic group and this researchers then introduced a micro fecal samples from the healthy mice into these c diff infected mice and over time as you can see the the small number here indicated the number of the at the time ten points in the study you can see that over time the the number increased the c diff infected organism in fact that mice gradually might become more and more similar to the truly healthy ones so by day fourteen it has the similar microbe on profile as the healthy individuals showing that the fecal transplant was able to improve the health and they're no longer looks like the persistent shatters okay so as i mentioned in the insertion sort of want to give a bit of historic perspective of how metagenomic studies came about and it really started with the different development of different sequencing technologies and allowing us to look at DNA or the genetic material as a proxy to phenotypic studies of these organisms so in the 70s the Sanger sequencing technology was developed along with some other alternative sequencing technologies and and surely after that it was applied to two different communities to identify to end in you and use it as a marker gene to identify that different organisms in in the community and and towards the end of 1970s one of the first so at that time it's not called bound firmly but one of the sequence and analysis tools host toolkit or sport package was developed coast at and and people quite optimistic with the technology development and the staffing would develop the package have this observation that DNA sequencing is now a fast procedure and the availability of computers gives the possibility of more efficient overall strategy for for sequence determination and of course compared to what we can do now a day this is considered a low throughput technology yet I'm just encouraging you to think you know five ten years or maybe twenty years and now that it was technology improvements we will be looking back at our current technical challenges and think you know we could have achieved a lot but but the field detector technologies is moving faster so problems that you might not be able to solve today and tomorrow so don't don't get discouraged and focus on what you can solve and what you can accurately interpret which day it was the current technology limitation namely the show reads in accuracy of reads and so on okay so so and in my 80s as I mentioned that they've been looking at different communities and and finding marker genes as a way to characterize those community and this is primarily an effort out of normal cases group where he his group looked at the different lo complex communities and was able to extract enough DNA's calling them and sequence them and you see sequences like this and comparing a known sequence that's in the database with a given name to an unknown sequence a curry sequence that that's in your sample and through these type of similarity search comparison you can then infer what's found in the community of interest by the 90s Sanger sequencing have been improved so now you have capillary sequencing and you'll be able to do 96 or 384 sequences in the single run so the development led to for example this the different different studies to look at using 16s as a marker to look at different communities it's also the air of whole genome sequencing you have enough food but now it's a sequencing market single market genes can take a shotgun sequence approach and in the sample the entire genome so in 1995 that's when the first bacteria what you know was sequence assembled and and published in metagenomics as a term that was defined in two rounds of the 90s and this is also where alumina was found it so in 2000's this is arguably sort of the the early ages of microbiome studies people are applying early next-gen sequencing and also Sanger sequencing to different communities so some of the sort of very well-known one sister Sargasso Sea expedition led by craig Venter essentially go around the belief the Gulf Coast at that time and and attract DNA's from sea waters and sequence to identify what kind of bacteria and archaea found in in though in the industry water and the Eisley the acid my drainage is another interesting study they look at a low complexity in community and what and this is one of the I think this is the one of the first paper if not the first paper that's shown you can actually assemble a complete genome from micron metagenomics data if the the community complexity is slow enough and and with with sequencing become increasingly cheap increasingly less expensive there are sort of commercial offerings and even citizen scientists sort of nonprofit offerings to let you sequence you know your own gut or to sequence your cat or to sequence your thoughts and or even there's something called a second genome project that look at your microbial community and for a low fee you can actually pay these companies to sequence okay so for the last bit I'll move into some of the major concerns with metagenomic analysis so off the top of the list this is data quality she was alluded to so sequencing is not air free and depending on the sequencing platform you can have very accurate sequencing such as the one generate an Illumina with less than 0.1 percent error rate due to substitution as a sign note here Anu now reads the quality drops as the reek it's longer so towards the end of your read the substitution there the error rate is stuff is significantly over 10.1% so this is our the average of average error rate cost sequences the single molecular sequencing platforms such as pact bio and minions however has much significantly higher error rate ranging from 10% to 15% so imagine one out of ten base in your sequence kept is incorrect and in those cases you need to be able to in these needs need to learn how to interpret those those results and how to correct for the errors so for M pecan studies chimeras can be an issue there's about 1% chance of getting a chimera crease and this is when during PCR reaction two or more templates were combined artificially into a single amp account and there are tools I will help you detect chimeric reads the other data quality is not issue is associate with the meta data would contextual informations about the sequence data that you're generating and how many of you have gone into a public database trying to find a similar study to yours what the ones that you're interested in what rather paper and say okay this is look like interesting data I said I want to download it and try it and when you go to the say NCBI you realize that the metadata found in the paper in the metadata found in the public archive file essentially nam mat matching so you either have to contact the authors or just gave up on that dataset so how many of you have to try that and and failed say okay so yeah yeah right yeah so so you just want to highlight the importance of metadata a bit like don't fear that people my scoop your work or scoop your data I think there's enough you know samples out there for everyone to to sample they it's important to be able to to reuse the data that that are been generated to help in your own study or your own interpretation so think that way and when you deposit your own data into the public repositories make sure it's easy to for other people to reuse okay and there are community standards that can help you in make the metadata more consistent I won't go into the details here but roughly speaking it consists of a minimum checklist asking you to specify some key informations about your your study but in addition to that depending on the environment that you're trying to study what it or the the sample types there are also specific environmental packages with additional data fields that would be good to specify so other people's can can reuse your data without having to to recompile the metadata themselves so if you go to this website it will give you its of spreadsheet of all the data fields what these meta data standards don't really enforce is the values we put into the into the field so some fields are easier to enforce such as they say formats or specific measurements of specific units and so on but there are still a lot of free texts so actually some of the work that I've done in my group is trying to improve the terminologies used in in these in these metadata standards to ensure that you describe the same the described things consistently through the use of controlled vocabularies and and what's current ologies and so recently there's a paper published by mark Watkinson and a tall called affair principle such beginning a lot of is getting a lot of notice as defining how a data set should be stored and curated to ensure that it's findable accessible interoperable and reusable by other by others and more and more funding agencies actually looking at this fair principle as an indicator of how they should the data generators should behave or she should try to achieve with with their data data sets okay so the other yeah another major concern of metagenomic analysis is the comparability or reproducibility of the data and as I mentioned already often the public data sets that you want to use for your own compare and essentially are not usable and often and in many cases even if you want to reproduce the experiment using similar simple types and using similar process or similar SOPs it is still difficult to reproduce in it in experiment and some of the factors affecting this is the use of different marker genes with different marker regions and that the this can affect the the result of your of your microbiome study and the different sequencing platforms and sampling conditions can also give different results and we'll talk a little bit about that later and lastly the workflows are often ad hoc so a lot of time the details of a workflow such as the parameters used and so on are not kept so in this workshop will actually show you some of the for example how trying to help to publish the the workflows as well as the results so you can keep track of the the analysis you did and and the datasets you use and so on okay so another concern is is regarding the the linkage and resolution issue so the the current technology especially marketing based analysis only look at a small region of a 16s gene and even if you look at the entire 16s gene its notice some doesn't have the resolution often to to differentiate different strains within within the species so the strain level diversity in manage in ohms will often be missed due to the difficulty in either interpreting success genes or when it comes to shotgun sequencing inability to assemble we reconstitute your genomes at the at the strain level so I think Laura would touch on this a bit more in her in her lecture so and we will talk about how whether the pros and cons of assemble your metagenomics reads and how to interpret the the quality of yours assembly okay so another issue is concerning taxonomy and ot use so taxonomy is the the names you give to an organism or group of organisms and as mentioned already a lot of the organisms in your samples are unknown so in other words doesn't have a name so the approach that's taken to to to deal with that is essentially to give them an OT you auto rate operational taxonomic unit as a placeholder for a proper name the issue with ot uses it's not correlated to the function what do the phenotypes of the organism and it's often an arbitrary threshold and often said on 97% sequence 97% sequence similarity is the threshold for grouping organisms and we're talking in module two why that that's an issue okay so last concern is with the functional annotation again there's many unknown genes of unknown function so hypothetical genes in your dataset and some of the studies shown here that on average even with some detailed annotation in terms of the molecular functions there's still a large number of large proportions of genes that have unknown function so in this case roughly 60% of genes are have a annotation but the rest are and when it comes to biological processes the the proportion is even lower so when you do a marriage anomic study often you've morphed and then now you're - especially in the environmental samples more often that now you're doing with with genes that simply don't have equivalent in the database and therefore you will not be able to use similarity search to identify the function of such gene and you might need to look at correlation of that gene to your sample context in order to try to understand what other possible functions well you might need to do some pathway studies to try to infer the functions of genes of unknown function okay so given the time I'll go through the resources very quickly this is really just to highlight some of the common databases that you can use to reference database size you can use for your analysis so for 16s sort of the most common ones about weeds right now Silva and green gene datasets these are curated 16s sequences and other marker gene sequences that you can compare your your samples to often you might also be interested in whole genomes as reference genomes - then for your metagenomics data set and again and so NCBI GenBank has a list of Korea genomes and over the last few years the microbiome communities or the had what the the research community had systematically trying to sequence genomes strong from common managing of genomic samples in order to to improve the reference data set available in in these genomic databases in Patrick and in addition to to being a repository also provide some tools allow you to study genomes now metagenomics again there are several repositories so for the human microbiome project the data is archived in the quad collection PTAC data coordination centers for the human microbiome project so as a wealth of information ranging from the SOPs to the different data sets available and you can also request access to some metadata through through this portal EBI has a tonne metagenomic and can see bi2 metagenomic read archives that you can access and mg wrasse is another resource but one cautionary note is that mg rust the tools provided often overestimate the over predicts so be careful when you use the tools provided within some for functional studies I'm just highlighting a few such as metabolic pathways you can use cake to annotate your your own genes for some protein families analysis can go to unipro for reference proteins so for protein family references card is a antimicrobial resistant database at Justin's here actually has a was was helping to help to build and it allow you to to curate different antibiotic-resistant genes I found in your samples in this rap point out you know it could also have some annotation issues that like if you do the wrong prediction but overall it is high-quality manually curated data self oriented by and database for antimicrobial resistance genes and gene ontology provides consistent naming schemes with the different function functional genes functional proteins okay any questions if now we can have coffee or [Music]