Understanding Lift Charts and Classification Metrics

so here we see the lift chart which is the validation data set and for the lift chart what we're going to comment on is that the point 10 5 on the blue curve means that if the 10 observations with the largest estimated probabilities of being class 1 were selected from that table of values which is cut off one five of these observations correspond to actual class 1 members in contrast the point 10 comma 2.2 on the red curve means that if 10 observations were randomly selected an average of only 11 divided by 50 times 10 which is 2.2 of these observations would be class 1 members now the decile wise lift chart has the interpretation that the first decile group corresponds to 0.1 times 50 which is 5 observations most likely to be in class 1. the second decile group corresponds to the 6th through the 10th observations most likely to be in class 1 and so on for each of these decile groups the decile wise lift chart compares the number of actual class 1 observations to the number of class 1 responders in a randomly selected group of 0.1 times 50 which is five observations in the first decil group the top ten observations most likely to be in class one are three class one observations a random group slash sample of five observations would be expected to have five times 11 over 50 which is 1.1 observations in class 1. thus the first decile lift of this classification is 3 divided by 1.1 which corresponds to the height of the first bar in the chart in the panel now visually the taller the bar in a decimal wise lift chart the better the classifier is at identifying responders in the respective decile group now the ability to correctly predict class 1 observations is commonly expressed as sensitivity or recall and is calculated as 1 minus the class 1 error rate whereas the ability to correctly predict class 0 observations is commonly expressed as specificity and is calculated as 1 minus the class 0 error rate so for instance if we're at the error rate for 0 right over on here that leaves us with one minus 61.54 percent uh which is a 38.46 sensitivity or probably specificity and then that then leaves us with then one minus the class one error which for this particular example is a 63.8 sensitivity now precision is a measure that corresponds to the proportion of observations predicted to be class 1 by a classifier that are actually in class 1. so precision follows the following formula and this again comes from our table of values our table of values is of course from our confusion matrix and the f1 score combines precision and sensitivity into a single major and is defined as f1 is equal to twice n11 divided by twice n11 plus and zero one plus n10 now the receiver operating characteristic or the rock is an alternative graphical approach for displaying the trade-off between a classifier's ability to correctly identify class one observations and its class zero error rate in general we can evaluate the quality of a classifier by computing the area under the rock curve often referred to as the a uc which is the area under rock curve or area under curve and in general the greater the area under the rock curve the larger the auc and therefore the better the classifier performs so we can look at an example rock curve right here and the red line is a 0.5 random classification result and so we can see that this example rock curve performs better and provides a value therefore over a random classification as the area is above 0.5 so the area underneath this red curve is exactly 0.5 so we then come to evaluating the estimation of continuous outcomes for that we're going to look at the following excuse me so this leads us to two common measures which are the average error which is the sum of the e i's from i equal 1 to n that quantity divided by n and then there is a different quantity then that is the root mean squared error which is sometimes denoted as rmse which is the square root of the sum up to n from i equals 1 of e i the quantity squared divided by n where the e i is the error in estimating an outcome of observation i so the average error estimates the bias in a model's predictions and if the average error is negative then the model tends to overestimate overestimate the value of an outcome variable if the average error is positive the model tends to underestimate so for instance we can look at the following which are performance measures for the 10 values that we had from before or not from before we can look at the following actual average balance estimated average balance and then the error associated with them as well as the squared error so because the average error is negative for these calculations we observe that the model overestimates the actual balance of these 10 customers furthermore if the performance of the model of these 10 observations is indicative of the performance on a larger set of observations we should investigate improvements to the estimation model as the root mean square error of 774 is 43 of the average actual balance which is quite high so we now come to the topic of logistic regression and logistic regression is a very lovely topic that attempts to classify a binary categorical outcome where y is equal to 0 or 1 as a linear function of explanatory variables so a linear regression model fails to appropriately explain a categorical outcome variable so what do we mean by that well what we mean by that is suppose that we were looking at something like uh suppose we were looking at something like the following so suppose that we consider the following of where suppose we had certain number of nominations and then a certain number of wins for an award and and the idea is whether or not you win so you can receive a lot of different nominations and never win so let's uh sort of scatter plot here let's also remember to first select the data search scatter plot bottoming okay so it's fair to say that we want to add a trend line on your regression line like so display the equation and the r value should be nicely viewable and okay so for this data you know there seems to seems to be something occurring and the something that occurring is that we don't necessarily have really a perfect fit for this this model and were we to do a simple linear regression on this let's uh let's go to data data analysis we'll do that regression and the y's will be the winds the x's will be number of nominations and let's get residual plots there we go residual plots bam okay great so you can see that from this chart of the residuals an unmistakable pattern of systematic misprediction suggests that the simple linear regression model is not repro not appropriate uh this is not a linear model so the odds is a measure related to probability so if an estimate of a probability of an event is p hat then the equivalent odds measure is given by the following which is the division of p hat divided by one minus p hat and the odds metric ranges between zero and positive infinity so we eliminate the fit problem by using the logit function like so and estimating the log of the linear function results in the estimated logistic regression equation or rather the logistic regression model which more specifically looks like the following which is the quote-unquote logistic function where what we're doing is is that we're taking into account that the following is true of where we're following that logistic function so given a set of explanatory variables a logistic regression algorithm determines values for b0 b1 up through bq that best estimates the log odds so applying a logistic regression algorithm to the data that we just looked at let's let's see what we have okay so we're going to create our model for here and so in creating our model what we're going to do is is that we're going to uh eventually figure out what values to have as our b0 and b1 in following along with uh what we commented let me go back to here and what we commented would look like this model okay so in doing this x is going to be equal to the logit value so we need equals and dollar sign b dollar sign 29 i want to continually reference that plus b1 which is dollar sign b dollar sign 30 times uh times the value of the number of nominations which is right here which is a2 so press enter for that fill that down so now we have the logit values for that so now we've filled so we now have the values for x and so now we need to find then the exponential so e to the x which means that we need to go equals exp of this fill that down that's good so now we need to calculate the probability value p of x which is equal to this value divided by 1 plus this value there we go and now we need the log likelihood function which where is where we're going to now compute uh some lns of what we are currently doing so we need the log likelihood which is the wind value times ln of probability plus one minus the wind value times the ln of one minus the probability value which you'll notice is a negative that's okay it's supposed to be negative let's move this chart a little out of the way there we go now we need the sum here some of the log likelihoods great great great so now the only thing we need to do is that we have to use the solver tool to be able to uh find what our actual values are of b0 and b1 so we need to go to data solver and then on our solver tool we're going to set our solver tool to where the specific entry we're going to be solving for is the total so it's the total that we're trying to figure out let's delete that and actually enter the specific entry there we go and by changing variables those two variables uncheck non-negative and let's see we are all good there we go keep solve for solution so we now have the log function which is the solution so if we would like to figure out what this graph is in particular then what we can do is is that we can come up with the following equation which will graph us see where is uh here we go there's just the right window so we're going to create the graph that matches with our data and as we said the equation that we're specifically graphing is this one which is the logistic function so that logistic function is y well actually we can go p equals 1 divided by 1 plus e raised to the b b0 b0 for our example of what we just computed was that plus b zero or part of b one times x and i actually need an additional set of parentheses there that it correctly computes there we go so it's this sort of s looking function and i am missing something ah yes it's the negative of this there we go now we have it so it's this sort of s squiggle which you can see right here quite a nice quite a nice picture uh we can actually include that in here this is the specific data there we go that's the equation we end up with at the end so this uh this is quite the interesting you know little equation and so the s-shaped curve appears to be better explaining the relationship between the probability of winning based upon the number of nominations so instead of extending off to positive and negative infinity the s-shaped curve flattens and never goes above one or below zero so we can achieve this s-shaped curve by estimating an appropriate function of the probability p of winning best picture with a linear function rather than directly estimating p with a linear function so logistic regression like this classifies an observation by using the logistic function to compute the probability of an observation belonging to class 1 and then comparing this probability to a cut-off value if the probability exceeds the cutoff value the observation is classified as class 1 and otherwise it is classified as class 0. while a logistic regression model used for prediction should ultimately be based upon its classification accuracy on validation and test results malloy's cp statistic is a major commonly computed by statistical software that can be used to identify models with promising sets of variables and then based upon then this uh we can then make predictions about uh winning results for based upon you know numbers of uh nominations and then the likelihood of a win likelihood of a loss etc etc and then where the predicted classes for that so one of the next things that we come to is what's referred to as the k nearest neighbors method this method can be used either to classify a categorical outcome or to estimate a continuous outcome the k nearest neighbor uses the k most similar observations from the training set where similarity is typically measured with the euclidean distance now euclidean distance is something that comes from the pythagorean theorem which is the square the hypotenuse is equal to sum of the square remaining sides that's the general result people are told a k-nearest neighbor classifier is a lazy learner quote-unquote that directly uses the entire training set to classify observations and the validation and test sets the value of k can possibly range from one to n where that's the number of observations in the training set if k equals one the classification of a new observation is set to be equal to the class of the single most similar observation from the training set if k equals n then the new observations class is naively assigned to the most common class in the training set when the k nearest neighbor is used as a classification method a new observation is classified as class one if the percentage of its k nearest neighbors is class one you know in class one is greater than or equal to a specified cutoff value and the default value is 0.5 when k nearest neighbor is used as a prediction method a new observation's outcome value is predicted to be the average of the outcome values of its k nearest neighbors so we can in fact look at an example here k nearest neighbor one and so here we have an average balance an age and a loan default and then we have the observation the average and the standard deviation for these calculations and there is of course then from this we could do further observations including we could look at how these values are placed as an average balance in accordance with a z-score now as we look at these observations we can also specify the percentage of class 1 neighbors and so when we're looking at this particular observation that we want to classify its nearest neighbor in terms of euclidean distance is the value two when we want the one newest or nearest neighbor and then we would then classify then the associated points based upon their closeness values to that uh based upon what the k was so two nearest neighbor would be this point and this point three would be at that point that point at that point and further out as we go from euclidean distance away so next time we're going to take more of a look at classification and regression trees so more on this later

Transcript for:Understanding Lift Charts and Classification Metrics

Transcript for:
Understanding Lift Charts and Classification Metrics