Starck wins that's coming at you stick quiz stats gonna find you stick quiz watch out hello and welcome to stat quest stat quest is brought to you by the friendly folks in the genetics department at the University of North Carolina at Chapel Hill today we're going to be talking about linear discriminant analysis which let's be honest sounds really fancy and it kind of is but not really I think we can understand it let's see what it does and then we'll work it out that is let's look at some examples of why we might need linear discriminant analysis and then we'll talk about the details of how it works imagine that we have this cancer drug and that cancer drug works great for some people but for other people it just makes them feel worse wha-wha we want to figure out who to give the drug to we want to give it to people who it's going to help but we don't want to give it to people that it might harm and since I'm a geneticist and I work in a genetics department the way I answer all my questions is to look at gene expression maybe gene expression can help us decide here's an example using one gene to decide who gets the drug and who doesn't we've got a number line and on the left side we've got fewer transcripts and on the right side we've got more transcripts the dots represent individual people the green dots are people who the drug works for the red dots represent people whom the drug just makes them feel worse we can see that for the most part the drug works for people with low transcription of gene X and for the most part the drug does not work for people with high transcription of gene acts in the middle we see that there's overlap and that the no obvious cutoff for who to give the drug to in summary gene X does an okay job at telling us who should take the drug and who shouldn't can we do better what if we used more than one gene to make a decision here's an example of using two genes to decide who gets the drug and who doesn't on the x-axis we have gene X and on the y-axis we have gene Y now that we have two genes we can draw a line that separates the two categories the green where the drug works and the red where the drug doesn't work and we can see that using two genes does a better job separating the two categories than just using one gene however it's not perfect with using three genes be even better here I've got an example where we're trying to use three genes to decide who gets the drug and who doesn't gene Z is on the z axis which represents depth so imagine a line going through your computer screen and into the wall behind it and the big circles or the big samples are the ones that are closer to you and the smaller circles smaller samples are the ones that are further away and those are along the z axis when we have three dimensions we use a plane to try to separate the two categories now I'll be honest I drew this picture but even for me it's hard to tell if this plane separates the two categories correctly it's hard for us to visualize three dimensions on a flag computer screen we need to be able to rotate the figure and look at it from different angles to really know and that's tedious what if we need four or more genes to separate two categories well the first problem is we can't draw a four dimensional graph or a 10,000 dimensional graph we just can't draw it that's a bummer wha-wha we ran into the same problem when we talked about principal component analysis or PCA and if you don't know about principal component analysis be sure to check out the stat quest on that subject it's got a lot of likes and it's helped a lot of people understand how it works and what it does pca if you can remember about it reduces dimensions by focusing on genes with the most variation this is incredibly useful when you're plotting data with a lot of dimensions or a lot of genes onto a simple XY plot however in this case we're not super interested in the genes with the most variation instead we're interested in maximizing the separable 'ti between the two groups so that we can make the best decisions linear discriminant analysis Lda is like pca it reduces dimensions however it focuses on maximizing the separable 'ti among the categories let's repeat that to emphasize the point linear discriminant analysis Lda is like pca but it focuses on maximizing the separable ax t among the known categories here we're going to start with a super simple example we're just going to try to reduce a two-dimensional graph to a 1d graph that is to say we want to take this two-dimensional graph aka and XY graph and reduce it to a one-dimensional graph aka a number line in such a way that maximizes the separable 'ti of the two categories what's the best way to reduce the dimensions well to answer that let's start by looking at a bad way and understanding what its flaws are one bad option would be to ignore gene Y and if we did that we would just project the data down on to the x axis this is bad because it ignores the useful information that gene Y provides projecting the genes onto the y axis ie ignoring the gene X isn't any better Lda provides a better way here we're going to try to reduce this two-dimensional graph to a 1d graph using Lda Lda uses the information from both genes to create a new access and it projects the data onto this new axis in a way to maximize the separation of the two categories so the general concept here is that Lda creates a new axis and it projects the data onto that new access in a way that maximizes the separation of the two categories now let's look at the nitty-gritty details and figure out how Lda does that how does Lda create the new axis the new axis is created according to two criteria that are considered simultaneously the first criteria is that once the data is projected onto the new axis we want to maximize the distance between the two means here we have a green mu character which is a Greek character representing the mean for the green category and a red mu representing the mean for the red category the second criteria is that we want to minimize the variation which Lda calls scatter and is represented by s squared within each category on the left side we see the scatter around the green dots on the right side we see the scatter around the red dots and this is how we consider those two criteria simultaneously we have a ratio of the difference between the two mean squared over the sum of the scatter the numerator is squared because we don't know if the Green mu is going to be larger than the red view or the red Muse going to be larger than the green meter we don't want that number to be negative we want it to be positive so whatever it is um whether it's negative or positive begin with we square it and it becomes a positive number now ideally the numerator would be very large there'd be a big difference or a big distance between the two means and ideally the denominator would be very small in that the scatter the variation of the data around each mean in each category would be small now I know this isn't a very complicated equation but to make things simple later on in this discussion let's call the difference between the two means D for distance so we can replace the difference between the two means with D now I want to show you an example of why both the distance between the two means and the scatter are important here's a new data set we still just have two categories green and red in this case there's a little bit of overlap on the y-axis but lots of spread along the x-axis if we only maximize the distance between the means then we'll get something like this and the result is we'll have a lot of overlap in the middle this isn't great separation however if we optimize the distance between the means and the scatter then we get nice separation here the means are a little closer to each other than they were in the graph on the top but the scatter is much less so if we optimize both criteria at the same time we can get good separation so what if we have more than two genes that is to say what if we have more than two dimensions the good news is that the process is the same we create a new access that maximizes the distance between the means for the two categories while minimizing the scatter so here's an example of trying to do Lda with three genes we've got that three dimensional graph that I showed you earlier here we've created a new axis and the data are projected onto the new axis this new access was chosen to maximize the distance between the two means between the two categories that is while minimizing the scatter what if we have three categories in this case two things change but just barely here's a plot that has two genes but now we have three categories the first difference between having three categories as opposed to just two categories like we have four is how we measure the distances among the means instead of just measuring the distance between the two means we first find a point that is central to all of the data then we measure the distances between a point that is central in each category and the main central point now we want to maximize the distance between each category in the central point while minimizing the scatter for each category and here's the equation that we want to optimize and this is the same equation as before but now there are terms for the blue category the second difference is LD a creates two axes to separate the data this is because the three central points for each category define a plane remember from high school two points to find a line and three points to find a plane that is to say we create new x and y axes however these are now optimized to separate the categories when we only use two genes this is no big deal the data started out on an XY plot and plotting them on a new XY plot doesn't change all that much but what if we use data from 10,000 genes that would mean we need 10,000 dimensions to draw the data suddenly being able to create two axes that maximize the separation of the three categories is super cool it's way better than drawing a 10,000 dimension figure that we can't even imagine what it would look like here's an example using real data I'm trying to separate three categories and I've got 10,000 genes plotting the raw data would require 10,000 axes we used Lda to reduce the number to two and although the separation isn't perfect it is still easy to see three separate categories now let's use that same data set to compare Lda to pca here's the LDA plot that we saw before and now we've applied pca to the exact same set of genes PCA doesn't separate the categories nearly as well we can see lots of overlap between the black and the blue points however PCA wasn't even trying to separate those categories it was just looking for the genes with the most variation so we've seen the differences between Lda and PCA but now let's talk about some of the similarities the first similarity is that both methods rank the new axes that they create in order of importance PC won the first new access that PCA creates accounts for the most variation in the data likewise PC to the second new access does the second best job and this goes on and on for the number of axes that are created from the data ld1 the first new axis that Lda creates accounts for the most variation between the categories LD to the second new access does the second best job etc etc etc also both methods let you dig in and see which genes are driving the new axes in PCA this means looking at the loading scores in LD a one thing you can do is look and see which which genes or which variables correlate with the new axes so in summary LD a is like PCA both try to reduce dimensions PCA does this by looking at the genes with the most variation in contrast Lda tries to maximize the separation of known categories and that's it tune in next time for another exciting stat quest