Phylogenetic Tree Building in MEGA X

good afternoon my name is Claudia Russo and I am an associate professor at the federal University of Rio de Janeiro in Brazil I'm gonna talk to you about building trees with mega X and a little bit about alignment also so I'm gonna pick up from where Bryan left us and well so let's build fellow Jenny's in mega well the most commonly used methods to build the phylogeny is press the Enter key as many times as you need until a tree pops up on the screen when it does you just print out and then submit the paper but there is an alternative well define your problem very well select taxa and markers and then you download the sequences perform a multiple alignment which will be useful to establish homology among parts of different sequences tree building so build the best possible tree topological tests to verify persistent consistency of clays and then your actual intellectual part of your paper start so you start interpreting your tree and extracting it starts tracking biological information from your tree so that's where it all begins so how to do an alignment in omega so first of all so we need to find out which part of one sequence is homologous to other parts of all other sequences and homology is not equal to similarity although many people actually confuse these two concepts homology similarity due to common ancestry so two sequences or two parts of an organism our models of organisms our model as if they're similar parts are there since the common ancestor four sequences we assume that the higher the similarity between sequences they hire their chance of being a models such as this alignment download so unimportant note is that performing a multiple alignment does not test for mammalogy it's assumed psalm ology so when you you when you select your sequences you need to be able to assume this homology before you run the multiple alignment in this case here this alignment was perform on purpose so I selected for these six birds I select the different mitochondrial genes for each species on purpose and brand the alignment to see to check if my alignment Pro would press or warn me these are not our models but it doesn't actually it does perform the alignment and the best alignment possible for these non homologous sequence and if you look at it in in the distance and you run a tree with it it will run the tree and of course the tree is meaningless because the there's no homology there so there's no historical information in the preparing their sequences so it's important that you know that your sequences are a model goes before you run you test the homology so you know in this case this is a bad alignment because this is the end envelope gene for the HIV virus so here I have a protein coding sequence and here my alignment program put a gap here at this position for seven of the sequence and for the eight sequence it puts an alignment here so that means that my line is perfect in this side and it's perfect on that side but here we have a problem so my one solution to this problem is always always to use codon based alignments that means that you have your nucleotide sequences you translated them into amino acids you align these protein sequences and you translate it back to the nucleotide sequences so then you avoid those those problems that I mentioned before so here it would be something like the how how the problem will do this so in mega you would do this well start building your alignment such as a Bryant or told us to retrieve sequences for a file and here mega gives you four options of alignment aligned by cluster or muscle regular alignment or you can align by cluster using ho-dong alignment cotton-based alignment and you can align by muscle also using the column based alignment so if you run this using some of the default parameters for the alignment you you'll have the the final alignment here this is for for an example file for mega so about these parameters well we usually don't we usually don't modify those and people say that if if you modify the alignment parameters and your alignment changes is that because that's a that illustrates that the poor alignment and the poor homology hypothesis for your sequences because if the homology is straightforward and that means that your sequences are very similar then modifying these parameters for the alignment for the Alliant wouldn't well wouldn't modify the final alignment anyway so we should just use the default so and now our after the alignment we should find out how this how these sequences evolve so we have different models that may illustrate how well or how how your sequence is involved and one of the simplest models is the the simplest model is the Poisson model also known as Jukes contour or one parameter model it actually this is the actual divergency between two sequences so this genetic distance starts from zero and goes all the way to infinity here and this is the observed difference the percent of the percentage of the difference that you observe see that if flatlines here at 75% by what Brian was telling us about the cuz we have four different nucleotides so to independently produce sequences with four different nucleotides will average be similar at 25% of the residue or dissimilar and 75% updating the residue so actually this I'm gonna talk about so estimating this observable differences is actually useful so to determine how well are into this into this flat line you your sequences are or how divergent these sequences are this is the distance formula that's used for distance based methods but this model may be used for likelihood and all the methods also a more complex model would be Hasegawa toe actually there are many I'm gonna just I just selected a few here so this model is more complex it has more parameters there are s may be from the data such as the proportions of LTC and G's rates different rates for transitions entrance versions and it should use if your sequences are more divergent than those that you would use for the one parameter model and then the favorite model for everybody is the GTR which uses many parameters the atcg plus rates for all nucleotide transitions and would be the highest divergence between the sequence that would be used on top of those you may use or not you may or may not use the gamma parameter which measures how similar our evolutionary rates when we compare different signs of your molecule or of your alignment so the the highest while that the this would be a curve this would be a graph on the different rate parameters here so if you have like just about one rate all the rates would be regular in in a sort of normal distribution here around the some kind of mean but if your sequences are if rates among different signs of your sequences are very divergent then you'd have like a shape such as this one the green one so the smaller the gamma parameter the more you need to correct for those so gamma is not an another model but it's something that you place on top of your models so you can use GTR plus gamma you can use a Poisson plus gamma and etc so selecting models in mega X you can do it manually you use we will use the Explorer active data this is a very useful tool or and compute pairwise distance first in the in the distance menu so I was telling you about that calculate the estimated distance is very useful because sometimes you see here this is these are distances for J's oculus Drosophila flies and for the ATH gene so here we can see that my sequence is varied the distance between sequence is vary from zero point six to maybe point 21:24 here 25 so these are our red blur are not so so variant sequences so that would put me in a curve in my curve in which the observed difference which is this distance matrix would be more or less what exactly happened if you consider multiple hits multiple substitutions backwards and convergent substitutions so so I advise even if you're not using a distance based method to estimate distance to avoid placing a sequence that is two to divergent for instance one sequence that has a distance of 0.7 or point 65 then years you would spot it right here but then going back a to the explore the sequence data Explorer will bring you to this window and in this window you can you do a lot of things you can unselect taxa for instance that you have notice that the sequence is very divergent from the others you can unselect taxa and if you unselect it here it will answer you excluded from all the subsequent analysis that you would make you can edit and select taxa groups for instance here I have created a taxa group which I call Hawaiian Drosophila and I place it Joseph Levine is intact on the other mimic adjust when you grok Recife mode and Alberta Tata into this group and now mega will see it will see those tracks as as as members of these this Hawaiian Drosophila group and I can use this to estimate distance between groups and distance within members of a single group and compare those averages for instance I can edit and select genes in domains here I can add a domain for instance I could if my sequences are called protein coding sequence with introns and exons I can tell mega here that that some portions of my sequences are non-coding and other portions are coding and I can eliminate some of those when I'm gonna do the phylogeny when I'm doing the phylogeny I can by pressing this button I can see all my sequences without the identical symbol I can translate it to amino acid so I can check if the translation is perfect as it should be before I run an analysis I can compute the number of conserved here are the number of conserved signs out of set so I have 421 concerted signs are of 762 number of variable sites parsimony informative sites and I have different types of the generate side so 0 4 degenerate site to fold degenerate site and 4 4 degenerate sites here's the fourth node and I can compute some statistic for instance nucleotide compositions that for a protein coding sequence I'll have it species by species on the lines here on the different lines and I'll have overall and separately for all three codon positions you should notice that the third column position are the ones that very most so you have be more careful about this one and you can use this to select a proper model so that if these these numbers vary a lot you should select an evolution model that would take into account the different based compositions look at our compositions I can estimate nucleotide pair frequencies that will also be useful selecting different transition transversion ratios so here I it's the same average for all for all positions together and separately for the first second and third position and here are the pairs so this is when I have achieve one sequence how many times I have another tea another scene the other sequence comparing pairwise pairwise yeah tell race data and here when I have a team I have a gene the other I have this number when I have a scene one RT and the other and so on and so forth and then I can use this data if transitions are more frequent denturist versions I can use this to select the model that would take this into account also and the codon bias also and this this statistic will give me the relative synonymous codon usage so we would expect if things were random that you know if you had like four different synonymous codons each one of those would be used around 25% but this is often not how it happens so this statistic book gives you a portrait will illustrate how how different from the the expected average that you would have and mega does also for you it was that that automatically so finding the best DNA and protein models in vml you do this by selecting this menu just the model menu and then you just just press it and you have the result with the top result would be the best the best choice yeah not necessarily the the highest parameter but usually one of the top parameter used so in this case for this Drosophila the tamarind a 93 plus gamma would be our model of choice so the tree you can you can construct it and build the tree in mega you can open previously saved or a user tree in mega and you do analysis with it or you can use your data to generate for instance likelihood based trees using in which you have a given tree and a models it's a sick model and you estimate the probability of observing the alignment based on these two these two things that the tree and evolutionary model so what the algorithm does is if you have the model selected by previously it will run by all the trees all the possible trees using eristic all algorithms and varying branch lengths and to determine the the tree that well would fit your alignment best you would do this using the phylogeny menu construct and test maximum-likelihood trees so you may use Jukes counter model which was the one that I I explained previously the cumulative parameter that corrects for transitions and transversion ratios Tamura tree parameter is you know high severity signal in you know that I also talked about tamarin a and the GTR model so you can select each one of those here and all the all these boxes are selectable items in mega so you can select the substitution model you can select rates among sites you can use the gamma distributed you can add or not invariant sides to it would be another category you can here you can include first second third cordon position it can sometimes when third code musicians are saturated you might want to exclude condom position then we'll run your tree to check if homo lazy is not interfering in your final tree sometimes lowering bootstrap values use your not your non-coding site that you had already selected as such in that data Explorer and you can select the eristic method for the the maximum likelihood tree the maximum likelihood tree method a burden is not exhaust exhaustive so it's also always eristic so you may select are less exhaustive methods such as the NNI nearest neighbor interchange or more exhaustive and I'm consuming methods such as the the SPR level 3 and the SPR level 5 would be the most exhausted here you can also select how you're going to input the the initial tree for the maximum white hood that means maximum like who algorithm will start from a tree and start rearranging branches on that tree and computing likelihood like whose values and selecting those trees with highest values and you can here select the type of the initial tree for the maximum likelihood the default is the M shape and bio and J but you can select maximum parsimony only neighbor join only Byron J or you can input your own tree into that or use the topology editor so when you jump computing the MM tree if you don't use the bootstrap then this doesn't take that long to to run this but if you use the bootstrap it might take a long time to run computer they to compute the maximum likelihood tree and this will be the tree how our mega presents the tree and here mega automatically generates the the tree caption with all the methods that it was you that were used to generate it and give you the reference as well for the methods you can also use the distance based methods in which you take your alignment and transform it to a distance based a pairwise distance matrix and use this to build the tree for instance the neighbor-joining algorithm that you start with a star tree and isolate two neighbors and compute the the total sum of branch lengths in this tree with the one in two separated as neighbors and you compute this as one and two and you calculate all assets for all possible neighbors that means all out I'll compare the s 1 and 2 with s 1 and 3 s 1 and 4 and so on until all possible pairs are compared and I'll select the pair of neighbors that minimize s and then I'll group the pair into a single compound Oh to you and repeat all the steps until the tree is fully resolved so here in the first step I have the 1 into drawing the second step I have the 4 the 5 and 6 join and then the third step I have the 1 and 2 joined with the 3 and so forth until all the tree is completely resolved so mega can estimate the neighbor joining tree also you can perform a test of phylogeny and I'll talk about this more on the day after after tomorrow on Wednesday and we can use the bootstrap or the entire branch test we can also select different substitution models or the P distance we can include only transitions only trees versions or both kinds of substitutions we can also include gamma-gamma parameter we can select complete deletion pairwise deletion and partial deletion our different ways of treating gaps and missing data on the alignment can also select one first second and third column position and non-coding sites and voila the neighbor joining tree so here's the tree caption and the references and you can press this button here to root the tree in this case scattered resort all the balances would be the out-group so if you press this into when this branch was selected it will automatically route to the tree on this branch you can also start looking a little bit more into your tree and presenting it to the user in a way that it passes on more information than only the topology and the names for instance here if I select this and if I select this right into your branch and press this this button this window will pop will pop on the screen and I can place the name hawaiian drosophila here and this will automatically give me a tree with this how a result is shown here that would be useful for my for my discussion in my paper etc so and mega has many different kinds of tools to present your tree so you can input images you can color code different branches and and a little bit more about this so [Applause] sequences have a banach days so then you can say my cutoff is 50% so that means that will actually to positions that you say might about 95% all the sides that have maybe 5% or less coverage are we close to complete deletion and basically any position so that's hundred percent every every sequence that their site has to have database so as you generate these big forget ourselves which contains huge number of missing data different parts of the alignment it's good to maybe select 70% so that I mean 50/50 minimum when I will say so 70% sequences have a base at each position will be surprised that balls the datasets are published in bed big journals if you include both 50% cacao have not many sites will be a common size and they look a debate but they don't really have any information to resolve big things so it is very very good to robust to check your reserve as his friend cutoff level so this day missing Gator I did not have a big impact it's also useful because when you have gaps sometimes you have more variability there so if you overall you you ask how good is the there's no way thrown away I have answer that and what Grimes does is take your alignment career guidance and it evaluates it goes through it covers various regions according to how to the quality of the alignment of that region it also covers sequences you may go through it it says you just don't want the sequence in theater it just it's just misses matching many think so often okay if I take that sequence out my quality violin proves you may say you know there's his this part of it just is terrible just doesn't align very well depending upon what your purposes are you may be able to just cut that out and guidance gives you nice tools to be able to say take this alignment and cut out this badly aligning regional so it's not a substitute for but it's an additional tool that you can use and just Google guidance alignments you get there yeah GUI DMZ and use that to get some idea of quality of the alignment and where the good bars often you don't get any better any more robust its treated making worse by completing regions you have many different ways of doing it well you could reuse your friend estimation then you really need to do many things if the trees approximately is okay for you you just be some information about the data then in that case you may not actually spend so much time there's many many things to a guidance or establish progeny of are several species of parts and they G blocks in the past and some others that you can use to graduate how to the alignment any video if you have any question but you want to see how this works we have a computer here and just you know

Transcript for:Phylogenetic Tree Building in MEGA X

Transcript for:
Phylogenetic Tree Building in MEGA X