Transcript for:
Logit Model (Logistic Regression) Lecture Notes

hello everyone and welcome again to nettle the best platform around for distance learning in business finance economics and much much more please don't forget to subscribe to our channel and click that bell notification button below so that you never miss fresh videos and tutorials you might be interested in many thanks to our current patreon supporters for making this video possible and would also greatly appreciate if you consider supporting us as well so please check the link in description for more details my name is sava and we are going to investigate the logit model or as it's also called the logistic regression it's a go-to technique for estimating regression models when your response variable so your dependent variable your y is categorical or binary often in finance and economics you have got your dependent variable be not a real number but a categorical number so it can only be 0 1 or can only take some limited number of outcomes most importantly and some of the most important applications of the logit model are credit scoring predicting whether your borrower would default on their debt well zero bean and non-default everything goes according to plan and one being default and you are pretty interested in figuring out which characteristics of your borrower predict defaults so you could allocate lending as a bank more efficiently and minimize your credit risk alternatively in economics you can be interested in predicting recessions zero being no recession and one being recession and figuring out which macroeconomic or some other indicators perhaps can forecast recessions quite a legitimate research task or in education we might be interested in what predicts or determines uh success or failure in an exam how does the number of hours studied in preparation to an exam contribute to success rate and all of these questions are also legitimate and can be answered using the logit model so without further ado let's try and estimate the logit model on a very textbook case for its application that is credit scoring and predicting retail loan default so here we have got a sample that's a real world testing data sample of 500 applicants that either defaulted or haven't defaulted on their consumer credit so we have got zeros denoting non-defaults so the individuals repaid that that and once being default so here you can see that among the 500 individuals that have applied for the loan and being granted alone 127 have defaulted and that's quite important for our binary choice models login included to be functioning properly you need a sizeable chunk of your sample to have either zeros or ones if the overwhelming majority like 95 percent of your sample being either zero one so if almost everyone have repaid or almost everyone have defaulted then the logit model would have been not the best choice however here as of what roughly a quarter of our sample defaulting logit model is appropriate so that out of the way we can study other variables that we've got here so the two most important perhaps categorical variables that you could ask for when deciding whether to grant someone alone or not is to ask if they have got any retail property on their hands whether they are homeowner or not whether they can pledge their property as a collateral perhaps for their loan or whether they have such uh property as an asset that could back them up if they encounter some difficulties in repaying and also whether they have a full-time job so whether they are full-time employed and those two variables are again binary variables that are treated here as independent as predictors as explanatory variables for our why which is default or not and we have got also some real numbers as our explanatory variables and we'll try and transform them into some project that we could later use in our logit model so we've got income and expenses of our borrower's household in thousands of dollars per year so we can see that this particular household earns roughly 129 000 a year and spends 73 000 out of that we can also record uh the assets and debt right now before the loan is granted of this particular household and the amount of the loan they're asking for so this data can be then coded into other explanatory variables that we can later use in our logic model so first of all let's consider three variables that transform our income expenses asset that and loan amount data into some more interpretable indicators for example we can figure out the natural logarithm of the ratio of expenses to income and figure out how thrifty how likely to save their income this household is and we expect that the more they save and the less they spend the more likely they would be to repay their loan as they would be more disciplined they would have more spare cash to meet their monthly payment schedule and so on and so forth so this is a legitimate variable to consider then we can consider leverage as in that to assets like we do for uh corporate analysis but here we've got individuals and their outstanding assets and that and the debt that they're planning to take and they plan to fund some asset purchases with such a debt perhaps they want to buy a new house or a new car or the like so here we can figure out natural logarithm this is just for scaling purposes so that we have got no outliers at the top so we can figure out the natural logarithm of the leverage of our applicant after they would have been granted the low so in the numerator would have total debt which is outstanding debt that before they were granted a loan plus the loan amount divided by their assets after they were granted a low so assets plus the loan amount here we assume that they'll spend the loan amount to purchase some assets for themselves and here we can also figure out one further variable that would basically mean how long would this individual need to keep earning to repay their loan and that can be calculated as natural logarithm of the loan amount over their typical income and here we have got five explanatory variables which is more than enough for a typical logic model especially with 500 observations so here we can bottom like it all the way down and now start figuring out how to calculate the optimal values of these parameters which denote just the constant the constant term as in all regression models and our coefficients for our explanatory candidate variables in the logit model what you utilize is the odds ratio and you try and estimate the probability of default conditional on these values of explanatory variables and as it is also called the logistic regression it should not be a surprise that the equation that relates the predicted values of y the variable of choice the probability of default utilizes the distribution function of the logistic distribution and we have got a video on the logistic distribution our channel already so please check this out later on if you're interested so here we use the logistic distribution logic to relate these variables they can be anything they can be bounded or unbounded they can be categorical or real numbers and so on and so forth so we basically scale them to meet the criteria that the probability can be estimated from zero to one and this is what makes logistic regression and logit more applicable to categorical variable modeling than for example just multiple linear regression because in multiple linear regression you can theoretically get estimated values of your dependent variable outside of zero to one bounds and that could be tricky to interpret it in context of probability of default for example or probability of recession a probability of failing or passing an exam and the like this transformation the logic transformation allows to avoid that so here we can calculate the logit which is just the exponent of the weighted sum of our explanatory variables and the coefficients so we can use sumproduct and refer to our coefficients over here and lock in the row as coefficients stay the same for all observations and referring to explanatory variable plus a constant for simplicity over here corresponding to a particular observation and here we calculate the value of log it and unsurprisingly as all the coefficients are zeros as for now the value of log it is one as the exponent of zero is one and here we can use this logistic transformation to convert this logit value into our estimated probability so we just divide our log it by one plus log it and get a default estimated probability of 0.5 which is unsurprising if you know nothing about it you can just assume it's a coin toss it's 50 50 either you default or not quite intuitive isn't it and then we can bottom right click it all the way down and estimate it for all of our observations all of our 500 lone applicants and here it's actually important to understand how one might optimize these coefficients to arrive at the best fit possible and here is where the logit model differs from multiple regression in terms of the function that you try and optimize for multiple regression you try and minimize the squared sum of residuals while for the logit model the most robust approach possible is to maximize log likelihood and the log likelihood is defined in terms of actual variables actual dependent variables that are zeros and ones y i over here and the estimated values the expected values of y i denoted here as y bar and we can estimate our log likelihood for every single observation by just multiplying our default categorical variable onto the natural logarithm of our estimated probability plus one minus the default categorical variable times the natural logarithm of one minus the calculated probability and we can bottom likelihood all the way down and then we can calculate the total log likelihood which would be the sum of log likelihoods across the whole sample and now we can maximize our log likelihood by varying the coefficient parameters this can be done using solver so we can go data solver and specify our optimization task so we want to maximize the value of log likelihood currently at cell l1 we want to maximize it so here should be max and we need to change variable cells that correspond to b0 b1 all the way to b5 that are the constant term and the coefficients for all five explanatory variables so we select the array c2 to h2 and we don't want to impose any constraints on our parameters as theoretically any of these parameters could impact probability of default either positively or negatively we just have some reasonable suspicions whether for example leverage would affect the default probability positive or negatively but the whole purpose of logic estimation is to test these suspicions these hypotheses so we need to untick this box to make uh all parameters either positive or negative depending on what maximizes log likelihood and click solve and wait until the algorithm converges to an optimal solution and we can see that we have converged to an optimal value of log likelihood we can see it increased from roughly minus 350 to minus 233 here the negative value of log likelihood should not um be of concern it doesn't really matter whether it's positive or negative it only matters in comparative terms whether it have increased or decreased and we can see that it increased quite a lot and here we can already see what the optimal values of coefficients are the closest values to some true parameters that could be possibly estimated using maximum likelihood and here we see quite unsurprisingly that being a homeowner and being full-time employed reduces your probability of default makes it more likely for you to repay the loan on time and in full which is again quite intuitive while being more careless in your spending habits spending more in proportion to your income makes you less likely to repay being more leveraged makes you less likely to repay and also taking on higher loan with respect to your income also makes you less likely to repay and more likely to default all of those relationships are quite intuitive in terms of the logic either psychological or economic of the theory that is behind it but now we need to figure out which of these relationships are statistically significant and reliable and which are perhaps not significant and uh can be neglected in further modelling so here we need to use some procedure perhaps to estimate the variance of our estimator to come up with standard errors for our coefficients just as we do with multiple regression however here just as with log likelihood instead of minimizing uh squared sum of residuals we have got a slight tweak on the conventional procedure of estimating the variance of our grasses and we have got it covered over here to estimate the covariance matrix uh the variance of estimator of b we need to calculate the inverse matrix of such matrix product that has the transposed x the transposed matrix of our explanatory variables including the constant multiplied by the weight matrix and weight matrix will be explained a little bit later so stay tuned and again finally the matrix of x as in explanatory variables again and uh what is the weight matrix well the weight matrix corresponds to the variance of individual probabilities and here is actually one of the another reasons why logit model is preferable to linear probability models when you just regress your zeros and ones in a multiple linear regression as if you recall in a binomial distribution when you have some probability that an event happens and some probability that doesn't happen the variance of the probability can be defined as probability times one minus probability so basically in such a categorical variable estimation your data is heteroskedastic by definition you have got different variances for different observations for example here if you were to regress it using the usual method the simple multiple linear regression you would have assumed as per the gauss-markov assumptions that variances for all these observations are the same while in fact they can be massively different and depending on the value of the estimated probability and the highest variance of probability would be observable for the estimated y being equal to 0.5 as is usually the case as is always the case with the bernoulli distribution so here we need to calculate the weight matrix that would be the weighting matrix of our variance estimator based on the diagonal of our probability estimators so here we have got the template to calculate the 500 by 500 weight matrix and here we need to first figure out whether we are on the diagonal and if we are on the diagonal we need to input this particular variance estimation and if not we need to input zero so first we check if the column indicator and row indicates are the same we can input the product of probability and we need to lock the column here times one minus probability log in the column here as well and if not if we are not on the diagonal we just return zero and here we can use ctrl right and bottom lightly knit all the way down we can calculate the whole weighting matrix and we can see that on the diagonal we have got variance estimators them being quite different and being the highest the closest they are to 0.5 as expected and now i can finally calculate our covariance matrix that we can then be used to derive standard errors of our coefficients so here we first need to input min inverse which is inverse matrix then we first need to input our mod function so matrix multiplication the first component would be the transposed array of explanatory variables x starting from the constant all the way to here so it would be a 6 by 500 array and as a second component we input another matrix multiplication and here first we input our weight matrix so from here all the way to here so a 500 by 500 matrix and as the second component in this particular product will input our x and we don't need to transpose it at that particular point so we close all the parentheses and make sure that we close them uh all and then enforce this formula using shift ctrl enter as it multiplies a bunch of matrices together basically and we get our covariance matrix and now on the diagonal of our covariance matrix we would have our standard errors squared so now to derive our standard errors we just can calculate the square roots of the diagonal elements and that's what we will do first of all for our constant the estimator would be the square root of the very first element of the matrix and if we drag it across we'll get the square roots of the first row but we need to calculate the square roots of the first diagonal so what we can do here is we can change the references here to refer to the diagonal just increasing the row reference by one each time and here that's how you get the standard errors for the coefficients now we can use the assumption that in large samples the distribution of coefficients is approximately normal so dividing the coefficients by the standard errors we get z stats our usual and uh well-behaved z-stats that can be tested for significance using a two-tailed t-test so two times one minus standard normal distribution as as arguments we input the absolute value of our z stat and one for cumulative and here we can get our p values for every single coefficient and here we can see that the most significant of the predictors of default are being full-time employed so if you are employed full-time you're much less likely to default on your consumer credit and also the significant positive predictors of default leverage and repayment time given income the more leveraged you are the less likely you are to repay and the more it would take you to repay your loan out of your income the less likely you are to repay while three other parameters the constant and most notably homeownership and expense to income ratio are of expected science but not as significant as one could imagine given these p-values are greater than ten percent and now we can use our logic model to actually predict uh whether a particular individual would default imagine we have got this model uh up and running in our bank and someone approaches us and asks out for a loan so we in our credit scoring procedure ask them to provide some data about them so obviously the constant would be one um as it's always one that's why it's a constant uh and we ask the applicant whether they own a home and uh imagine they say yes and we put one for yes then we ask if they have a full-time job and they say yes and provide some evidence for that then to calculate these three variables we need to ask them about their income expenses asset that and what is the loan amount they want to apply for so imagine that this particular person's household makes 150 000 per year and they spend roughly 120 000 of it they own their home and it's currently valued at one uh one thousand thousand dollars so one million dollars and they haven't got any debt taken on previously and they want to take out one million dollars in loan potentially guaranteed as collateral by their property by their house so now we can actually just copy these formulas across to calculate these ratios and we can also copy this across to calculate our log it and our probability can be calculated just like that and we can see that our probability which is logged divided by one plus logit is less than 20 so actually we can be reasonably sure that such a person would not default on their loan their default probability being less than 20 percent 19.19 which is quite good which is also less than average over here we can see that roughly a quarter of our applicant's default and the probability of this applicant defaulting is less than 20 so we can determine that our applicant is actually credit worthy and feel quite good about providing them with a low and that's all there is for the logit model and its application for consumer credit risk credit scoring and much more please leave a like on this video if you found it helpful in the comments below i'm going to see any further suggestions for videos and business economics or finance you would like me to record and please don't forget to subscribe to our channel or consider supporting us on patreon thank you very much and stay tuned