Transcript for:
Machine Learning: KNN and Association Rule Mining

good evening Professor hope you are feeling better no I yeah I'm well thank you very much and thanks Shashank uh yeah I've always been well actually I I saw the message last week when it was postponed and it says emergency there was no emergency I was and I I mentioned the thing uh I was under the impression my class was on Sunday and so de BR sent and normally I don't check with de but that day because both Saturday and Sunday were showing up on my calendar I just checked with him and he said he has sessions on both Saturday Sunday so by the time I checked uh with the team it was kind of late and I had not yet completed my preparation so so I said I would take it on Sunday and then Sunday uh it seems at the last minute the BRS uh system crashed but I I was not aware otherwise I could have taught the class on Sunday as per my original thing but then I felt that gives a nice break to work on the assignment so that that's the that's the reality of course I mean I'm I'm so glad nothing serious but that gave us a lot of time for the assignments yeah thank you thank you thank good morning sir uh just a quick question the group assignment are we going to have that's uh sometime next week the other 50 marks of the assignment let's wait for for the course to progress a bit more sure sure sure sure yeah the later it comes it's better because I just got back from a trip so it's it's good for me okay thank you so much thank you and thank you so much for extending the date till 10th of July it's very helpful okay we have 26 people why is it uh asking for so many times I'm saying allow recording Samy's autopilot orland's autopilot okay many people are asking for it is the utter pilot doing a good job I I've seen it not do such a great job on the on the meetings the AI tool that is transcribing this yes I have some message okay so it's uh 6:31 uh in India uh okay so good morning good afternoon good evening and late evening for some of my other friends here uh so let's get started and uh the agenda today is primarily to complete uh the K's neighbors not much left in that uh and then get on to a uh very nice simple but very very uh commonly used algorithm Association rule mining uh variously known as Market baskets analysis and other things but we'll get uh but let's just uh finish K nearest neighbors what we started doing was uh looking at K nearest neighbors as another um classification technique of course it can do uh regression as well just like decision trees but primary use is in classification uh similar to decision trees and we had seen the main difference between decision trees and um as a concept and and K nearest neighbors was the definition of nearest in decision trees we go access parallel breaks or split the data in a way uh such that the entropy is minimized and there we call the nearest uh regions where uh we need to look for our data point and then assign the class that is in that region the nearest definition for K nearest neighbors is just by the distance and it just measures the distance um variety of dis uh the distance metrics are available we have only looked at what we call ukian uh which is the common thing but it works for numeric data I did mention that when we come to clustering we will spend a lot of time looking at variety of distance metrics when there are categorical variables text Etc how do I know uh the distance between two words for example so so those same things will be applicable here as well uh so so right now we were only looking at the distance for numeric variables but that is uh and the moment the distance concept came in U we noticed that U the scale the units they matter a lot age for example was a two-digit number whereas uh income or uh the loan amount uh were five six seven digigit numbers depending on uh What uh uh currency we use U and so uh the the larger one just has a much bigger weight that it takes in terms of distance calculation and the smaller uh units the digits they don't have much impact so we said that standardization The Zed score or the normalization so Zed score that we talked about in the first couple of classes U standardizes everything and brings all of them to the same range and uh or the same scale and unitless quantity and so it is now easy to uh measure and so no one variable will have a bigger uh impact that was an essential thing and then we said there is nothing more you can decide the K andn the K can be three five generally it is uh an odd number that we pick uh but 3 five 7 n anything um and and uh that's that's where we try to see how many closest uh uh neighbors are there and then assign the class of the majority class there but we also felt we could improve TN uh where uh we can actually give a higher weightage to the data points that are closer to the point in question and so distance weighted and beyond that it Remains the Same idea so I give the how much weight should I give the normal practices the inverse of the squared distance many people use the inverse of the distance so there are no hard and fast rules but inverse of the squared distance and then you we check for each data point how much weight it has and whichever class has more combined weight we assign the new point to that class then we also felt that KNN when it starts looking at uh distances from the closest neighbors and if we have a smaller number for K1 or three or five uh then it it is prone to overfitting because then it just if there is an outlier or something else it just goes around each of those data points to draw the decision boundary and so one way to avoid overfitting is increase the value of K uh so that it averages out multiple points and so uh that's one way of uh reducing the uh overfitting which means reducing variance and increas in reing bias if we want to talk in terms of bias variance tradeoff the other way is to uh getting rid of uh the class outliers so if if the closest nearest G neighbors belong to a different class if this point any point is uh belonging to a different class than its nearest K neighbors just throw it out uh and that was what we called an edited nearest neighbors method or Wilson editing okay so people come up with various tweaks so nothing fancy about that it's just that some tweak where class outliers that's what we call them for that particular class in that neighborhood it's an outlier and so just get rid of it so that the boundary becomes smoother that was avoiding overfitting now we will talk about the last two issues that are there with KNN and how we resolve that and one issue is with having a lot of columns lot of variables the other issue is with having lot of data the lot of rows of data both of them we will look and once we are done with that uh we just go into uh Association rule Mining and look at a couple of algorithms that are probably the most commonly used ones not the only ones but there are some more but everything gets built on top of uh these ideas the first time the idea uh started in 1993 uh looking at retail markets the transactions in the retail store and uh in 1993 uh Rakesh Agarwal Arun Swami and and imilan ski they they came up with this algorithm a priori and and still highly popular okay so we will take a look at uh and then variations around those ideas because obviously anytime an algorithm is devised it has some issues with certain data sets either because large dimensions large data set whatever and then people start coming up with alternatives to fix them and obviously they will have some other problems so it's a NeverEnding cycle and but we will look at two of the most popular algorithms and very powerful technique okay so let's just quickly look at the uh the dimensionality issue and uh the complexity with the large data set last class we had seen that uh it requires scaling as I mentioned just now in the last five minutes uh if you have different scales unless you standardize uh then the distances get impacted uh because of different scale and all those things and uh we all also saw last class that when we had age and uh and salary I believe then there was some data point that was closest to the point in question but the moment we added another variable uh called I think it was the loan amount and then some other data point got closer because the the dimension increased and in the new dimension some other data point might have been closer to this point and so the dimensions also have an impact a big impact on KNN especially it would not have such a big impact on for example linear regression or a logistic regression or even a decision trees uh but uh here every variable you add suddenly the distances change so and Visually just to see again uh if I had an unscaled data and this is how the data points looked and if I have a point here that I'm I'm checking the K nearest neighbors I would say it will be the plus green positive uh uh class that I will assign to that in a three nearest neighbor situation but the moment I scale it and because I need to scale it that's a mandatory requirement unless all the variables are on a similar scale which generally won't happen in any practice iCal data set so scaling becomes essential and the moment I see that and then there are a lot more uh red circles than the green uh plus signs and so it changes so the scaling has an impact but scaling is also essential in KNN and uh and these are two variables I have two axes one variable here one variable here and suppose I only want to see the dist es with respect to one variable let's say age and so what happens essentially I'm projecting all of these variables onto the floor of the room so to say if I have to say and so this point on the X1 axis it is somewhere here this point is here this point is here and that's what I'm showing here and this thing this point there that point here and you can clearly see that the distances have again changed a lot okay so because now it's only so again highlighting the point that the more the variables you add from one variable you go to two variables and uh the thing becomes uh the distances start Get Growing and bigger distances can be seen let me just talk a bit more from two uh variables to three variables and what is the kind of impact that we see with respect to adding many many variables like this and we call that the curse of dimensionality you will hear this phrase very often uh in in machine learning things curse of dimensionality which means as more and more Dimensions more and more variables are added it becomes a problem for my machine learning that's pretty much the definition for curse of dimensionality so as we saw uh also heard about dimensionality reduction yeah so you right jha so that becomes the solution for this problem uh how to uh reduce the dimensions there are multiple ways uh we will not cover everything but I will uh mention what these do uh of course we have seen linear addrress logistic regression Etc we using the P value if it was not sign ific you would drop a variable and so essentially you are reducing the features you are reducing the dimensions in some uh way and and that's dimensionality reduction in that sense but I will uh talk about dimensionality reduction where we actually use the term but the idea still is the same drop some of the columns in the data set and uh it becomes easier so yes so dimensionality reduction is the solution ution to curse of dimensionality reduce the dimensions and uh I'll just prove that in the in the next slide I guess or in this slide uh or in both the slides uh as the the dimensions grow the amount of data I need also grows uh and let me just talk it talk about it with some small example let's say I want our training data to cover 20% of the the variables range if I have multiple variables so if I have only one dimension as we saw in the previous slide for example income I would need 20% of the entire data for my train if I have only one column of data but if I have two columns of data and I want to cover the 20% of the entire data set it actually requires me to get 45% of the entire data in each column because 45 45% times 45% in both the things and I will show this visually in the next slide is 20% So 20% so now if I have income and age for both of them I would need about 45% uh of the data so that that I say now I have 20% of the uh the training data to cover the entire feature range that I have in two dimensions and now if I increase the dimensions to three dimensions I would need about 60% data in each Dimension 59 cubes is 20% and let me just show that visually uh and this is two Dimensions I have two variables and if I go to one dimension like in the first slide today that point drops down here and like that all the points drop down and so this is this is not 20% this is 80% but just for the sake of I could have shown the same thing with 80% in the previous slide but 20 highlights the impact from 20% to 60% requirement but let's say I have this in one dimension I have practically 80% of the area covered but you see in two Dimensions it's not covering as much okay there is a lot of empty space and so if my new data point comes and sits here where are the distances the distances become much wider now let's increase this to three dimensions and in three dimensions now I have added and if you see this as the floor of the roof boom and you are looking top down into this and this is the wall uh in front of me and this is the wall to the side of me uh and then looking at the same three data points that I have marked here where in one dimension these two points were very close and farthest away from this point and now if I look at the same three data points and now you see the distances are very different and and now four dimensions I cannot even visualize anymore but the idea is you go higher and higher in dimensions and uh the data becomes sparer and sparer the distances become wider and there is a lot more empty space that you start getting and and so that that is what essentially the curse of dimensionality is so this was all just highlight the uh the very fast way that having more and more Dimensions increase the need for more and more data okay that that becomes uh and distance starts becoming uh less meaningful uh when they are far away from each other which one is the closest uh is really a very difficult thing to talk about and sorry yeah so what does it mean to us Professor like that we need to solve it so that we we need to fix this problem uh so if you have it how do we fix it is what we will talk about okay this is only talking about the problem right now so and and we won't spend time because uh I'll just give an idea of how we can do that we have like I mentioned in linear regression logistic uh we have seen how to minimize the variables using p values for example that's one idea and that can be done here as well so so you can and I I have mentioned that in the probably next slide you run a linear regression or a logistic regression and don't use it for classification just use it to pick the most valuable variables and then run the KNN on that so all this mixing and matching is very common in the work that we do in machine learning anyway I I'll mention that again in the probably the next slide or two slides later the problem with is that uh if you have irrelevant features in linear logistic regression I have a P value that can tell me what to get rid off here there is no way for me to figure out it's just a lazy method I do not build a model all the data is just waiting sitting there in my storage when the new data comes and I ask what class is this that's when it starts measuring the distances and so uh there is no other way to for us to know whether this uh feature is relevant or not so uh so when the dimensionality increases especially then it becomes very problematic for us so how do we solve it uh well one thing you can do so this is the answer pratique what is it that we are supposed to do uh when we Face a lot of Dimensions one thing uh use domain knowledge and you have a lot of experience is say ah these variables don't make sense throw them out a simplest thing anything to do with any ID okay anything to do with any address and those are I don't even need domain knowledge I can just throw them out because otherwise again then we'll start measuring distances from everything okay so we have to do that uh what jisha was mentioning dimensionality reduction there are mathematical statistical methods that allow uh you to uh like a principal components analysis what it does is it will look at all the variables the independent variables look at the ones that are highly correlated multicolinearity we talked about and then do some mathematical jugglery so a lot of mathematics involved in that now so we will not get into that but uh mathematical jugy to find linear combinations principal components analysis especially find a linear combination of variables to create New Dimensions that are not correlated for example I have age is a variable and number of years of experience work experience is a variable and let let's just think and talk about these two obviously I don't go to principal components analysis when I have 2 3 5 10 variables but let's say these are highly correlated what principal components analysis is going to do is find a linear uh mathematical method where it says and let's say there is another variable called um uh years since last degree was gained or earned okay so it might say that well two times the age plus and let's say these are all highly correlated so it will say two times the age plus 0. five times the uh the years of work experience is one dimension I have created and another dimension uh7 Times age plus. five times uh years since I earned my last degree um and and it has created the math will work such that these two are not correlated and so now I have a combination of these variables which are not correlated and I have now got two variables the third one can be thrown away obviously like I said for three variables it's not a dimensionality problem so we don't do that but when I have a million variables in some DNA analysis and I have 2,000 4,000 5,000 rows of data no linear regression will work on that more number of columns less number of rows it won't work but there a lot of the variables will be highly correlated a method like principal components analysis or a factor analysis there are or a SVD uh various techniques are available that will do their mathematical stuff and combine highly correlated variables to create new variables which we call dimensions we don't use the word variables then we use the word Dimension so but it's the same thing new Columns of data which are a mix of all these variables so it's different from the feature selection that a linear regression is doing where I'm dropping a variable completely here it's not dropping a variable it's combining multiple variables to create new Columns of data fewer Columns of data that still contain all all the variation in the data or at least 80 to 90% of the variation and now the computer only sees a few Columns of data and the variation in the entire data is still more or less intact covered in these things that's the mathematics that works and so that's how we reduce Dimensions so principal components analysis is one of those kinds of techniques where when you have many many variabl was 400 500 even there it it does a good job um Dr Anan jaman was once telling me when he was doing his trading um and he was creating a lot of uh uh uh so he he wants to wanted to trade in the Indian market and of course it depends on uh how the NASDAQ ended whether it went up or down the dollar rupee pricing and various things so he doesn't just look at the yesterday's stock market price and then does it he creates his equations and so he said I would collect the data once the US markets are closed I download I start downloading some data from somewhere and it takes a couple of hours and then he starts running his simulations where he uses PCA because to because there are so many uh variables that were there and the variables could be uh what what is the percentage change in the index from yesterday and today what is the uh change between last week and this week so he creates a lot of new variables also to see the overall patterns it's not just not just the the stock price there and so that becomes a lot and so he starts running a variety of simulations he and and by the time all the simulations get done on his system 12 to 14 hours he gets results he is ready to trade by the time the Indian trade Indian stock market closes so again in the night he starts downloading the Mater thing but he never gets to trade he only gets the get to see the patterns he uses principal components analysis to not have those 400 500 Dimensions or variables and brings them down to about 40 New Dimensions which combines all of them computer sees only 40 columns and so now it is much easier for for it to work with and can do that so that's the purpose and the same idea works here you use principal components analysis you use your domain knowledge you use your linear logistic regression Etc throw away the variables anything all of it is fair game the goal is to minimize the number of columns I have and whether you use statistical approaches whether you use modeling approaches whether you use your domain knowledge uh or a mix of all of that everything is fair yes Goan so so so and penalty penalty points or penalty applications can help to solve this uh curse of dimensionality is that correct what is the penalty applications applying pen penalties to applying penalties okay so so that's something that that can also be done there are regularization methods where you have a penalty factor that you add to your cost function uh so you have let's say in linear regression you had sum of squared errors as your normal penalty function but then you add a penalty on having a lot more uh variables uh and add to that and and that we call lasso uh yeah R regression uh Ridge regression unfortunately uh does not uh remove variables because it only minimizes the the effect of variables but lasso uh would do that so sudna what you are writing about Ridge uh idea is exactly the same that is what Goan also mentioned so a penalty Factor if you add into the which is added in Ridge also uh you add to that uh uh cost function or the the function the error that you are trying to to minimize and if that leads to uh removing some of the variables that's another uh method that you can employ that's perfectly fine thanks so much for explaining that as well um slight slightly maybe off tangent but that does neural networks uh suffer from from the same issues or do do they handle the large dimensions better neural networks can handle a lot of large uh things uh by itself uh so uh that's fair but again if you want to deal with computational complexity there is no harm in mixing uh these methods again with neural networks also but yeah neural networks can handle this thing uh better okay so Professor one say question on the same line in the same note is are if when we are using KNN it would be more explainable rather than using neural networks right that's perfect yes KNN the explanation is that this is these are the points that are closest so it's it's very intuitive anybody can understand neural networks as of now is a blackbox model so if explainability is important we cannot go to them uh performance- wise many times they do uh better but not necessarily okay so so it always is a balance between explainability accuracy and uh and what kind of problem are you solving is am I might uh uh going for an overkill for a simpler problem etc etc so so all those considerations come into play but yes yes R uh Professor one question you talk about explainability but in today's world um can you give some example where explained explainability would be more important than performance per se or would also play any role in in well more than half the problems that we deal with uh will require that let's say uh you are uh and that's why there is a huge amount of research going on on explainable AI explainable all these blackbox models how can we do so for example uh in the finance sector you are uh giving loans to people and and if you cannot explain why you are rejecting someone understood the biases can come in you go into uh uh this safety industry for example I'm trying to predict the uh in Airlines industry I'm trying to predict failures of some components and then maybe my accuracy will be great with the neural network but then if I cannot explain uh when it is going to fail uh and I cannot explain then uh what variables are contributing Etc then then it's a disaster no I'm placing my faith on a neural networks model and that so so there are those places so where financial decisions have to be you are helping someone with the trade trades of people and and and and financial decisions which can ruin our health related things safety related things explainability becomes very very important understood I think that explains it thank you thank you so that is the curse of dimensionality idea is pretty simple when there are too many variables figure out some w of minimizing because km especially gets impacted quite badly with irrelevant features also because the space becomes larger and larger so so and these are the solutions the next one is the computation part itself okay and uh increasing the K in KNN from 3 to 5 to 7 to 19 nearest neighbors find the distance of all the points from its 19 nearest neighbors we don't go that high anyway but but even that is not the biggest problem that does add to the computational complexity but the bigger thing is having to measure distances from a lot more data points the larger the size of the data we have the more expensive uh it it gets okay and so how do I fix that and for that oops where is this sorry okay yeah so I have just some data here and I call it original data two classes and there is a nice boundary that separates them but here I'm showing only a very few data points but what if I have huge amount of data set is there a way I can cut out a lot of these points and and have only a subset of these points still ensuring that my classification accuracy does not change when the new Point comes its classification would not get impacted that is the idea that I have in my mind is there something that can fix it so here it's not dimensionality reduction not reducing the columns but removing the rows of data without impacting the new data points classification see I'm not building any model here anyway I'm only storing the data can I store a subset of the data and not impact the overall classification if I can do that that's going to solve my problem here okay and I'm showing a picture here which I will explain in the next slide and I have selected some blue class and some red class and the idea in this approach is that if I can uh Mark each of the data points all the data points in my entire data set either as a prototype prototype means I have to keep that uh data point in the data set and that will ensure correct classification absorbed point which means if I remove that data point my classification is not getting impacted so its effect is getting absorbed in the Prototype itself so I don't need the Redundant point this is a redundant point for me or an outlier which we talked about last class uh to make the thing smoother class outlier then just throw it away okay so uh that way I will have smoother boundaries so I every data point I will mark it either as a prototype where in this case these are the prototypes I and I will explain the algorithm that uh selects these uh and uh those are the uh the absorbed data points I don't have an outlier here in this case but uh but that's uh that's what we need to do so then instead of all the data points if I have these 1 2 3 4 five six seven data points and I can achieve the same classification that's great for me computation time goes down storage requirements go down it's it's beautiful and that can be achieved by something called Hearts algorithm and this Improvement on the traditional KNN is called condensed KNN people used to write it as CNN but CNN noway stands for um convolution neural network and you will learn learn about these things in the next course when you go to the AI uh techniques neural Nets and all those things so uh that will confuse uh everyone so let's not call it CNN even if someone else write writes it let's call it condensed Ann I am condensing the size of the data and uh and by selecting prototypes that will allow me to have the same classification how do I do this well if I call my entire data set X that is my data set can I select a subset U from the X with the goal that one nearest neighbor performs equally well on the entire data set or on the subset that I pick which is the Prototype set one nearest neighbor that's heart's algorithm okay it just looks at one nearest neighbor that should perform similarly okay so how do I do that first I randomly pick any data point and that has an impact on how the boundary goes but overall uh it doesn't have a big huge impact it can change some uh things here so initialize you randomly with one element and this was an image that I had from booked and so I just didn't explain which path they went but then uh going back working backwards from the algorithm I I figured out that the first point it would have selected is that so that point is now taken and put it in another set called U that's my prototype set that's my first and that could be any random number in this case it was this data point it could have been this it could have been this it could have been any data okay it doesn't matter then it starts looking at the entire data set where it is finding the nearest Value from here that belongs to the other class nearest from here that belongs to the other class among all these data points that is the other class what so so what is happening here it says one nearest neighbor so when I look at this data point for example okay and nearest neighbor in the Prototype set is this randomly selected point so because that's the only point I have as of now and so it will say oh this is also red and that is correct so I don't care about it okay that that's my OBS let's say I next I go to this point and I have only still in the Prototype set I have only one data point still in the larger data set the closest point is that and so that class is whatever so one nearest neighbor it's red color in the Prototype set also there is only one data point at this point and it gives red so it says okay righted correctly classified right so I don't care about it like that it goes and then let's say it has seen this one this data point and this one in the entire data set X the original data set if you see that well then it has these data points and so uh it will be uh it will have some other it's a blue class actually it has a class called blue but in the Prototype this is the only distance and so when this is the only point it says nearest neighbor is red so I will classify it as red but reality is that it's blue so it has been misclassified by the Prototype the moment it the Prototype has misclassified I will add that data point to the Prototype list now let's say I look at this point this point now I have two prototypes the closest is blue I mark it as blue the original thing is also blue it stays there so I don't need that point so what is the next Point well the next point so I I keep doing that uh iteratively let's say next point was selected this one now it has two prototypes to look for the distance and it says oh this this is closer so and than this so I will classify this as blue based on the Prototype set this data point is being classified as blue and that is wrong misclassified so I add that to the Prototype so now I have three data points in my subset next I go to the fourth one and from here I have distance to this point I have distance to this three and I have distance to one distance to one is far and one nearest neighbor I have to check so the closest point here is red so I will say this data point is red reality is blue misclassified so I add that to my prototype let's say I looked at this data point next to that well it will look at the distances from 1 2 3 and four and four is closest it's a blue and since it is correctly classified I don't need it it's an absorbed point then the fifth data point is this so I I look at the distances from 1 2 3 and four and two is closest so I say this is blue but it is wrong so I add that to the Prototype like that the sixth data point and I checked it using this line I drew uh because four also looks close enough and I checked uh this one was closer than this and so I would say based on prototypes the five prototypes I have already selected this will be R and because that is wrong I have added that to the Prototype and then the seventh data point uh will say blue because this is the closest and so I add that and once I add that when I get a new data point and I look at it and it uh one nearest neighbor and that that says blue that's correct and and its classification will not change okay so I don't need all these other data points the boundary continues to be the same and depending on some selection of the random thing and variable sometimes the boundary can uh change slightly here and there but more or less you are able to condense using this algorithm to a smaller data set that still achieved the same performance with one nearest neighbor now with three nearest neighbor it might change a bit here and there but overall you have solve the problem of computational complexity big time so that's what it's doing randomly start with a data point from the original data set call that as a prototype then start looking for data point that the data points that are misclassified using the Prototype compared to the original data set and anything that is misclassified comes into this thing which means now it is drawing a nice boundary between those two classes and that's the whole idea anyway this is only to understand there is nothing that we need to work with on this uh the algorithm does its job but uh understanding how we are solving that problem is by reducing the data asset okay so that pretty much covers KNN let's quickly in three or four slides uh let's see how it is used or useful in anomaly detection in this particular case uh credit card fraud and uh and and then uh we'll just move on to Association rules and it's this is not the only technique again anomaly detection uh variety of methods can can be used but this is also useful credit card fraud um a very uh big problem globally and uh a couple of years back it was 33 billion dollar annually that's the kind of and it's only increasing the fraud is only increasing the percentage changes fluctuates but that's a huge problem and I will share the data also uh the data that I have here uh the description of that data uh these were transactions made in September 2013 and in Europe uh credit cards the data set as expected is highly imbalanced but K years neighbor Works beautifully in such cases highly imbalanced data set total transactions 284,000 of which fraud were only. 17% less than 500 9 99.83% data was good transactions okay and there were 30 independent variables in fact these variables were uh from larger set of variables the PCA was done to create 30 New Dimensions and time and amount in the variable in the data set they have mentioned as time and amount the other 28 variables which were done by PCA U so they have not named it to to protect confidentiality and other things of the data uh so uh and uh and it's all masked so we don't know what is each of the variables what are the original variables that make up uh the the 28 Dimensions we don't know that but let's just see how well we can do with that so as with any data set we split the data into training and testing however the training here has a different meaning because we do not build any model K&N does not build any model so what is training here when we say training we are just storing the data we are splitting it and storing the data that is all the training that that is okay so there is nothing else happening and and the split is was done I call it the stratified split to ensure that both the training and the testing data have the same proportion of legitimate transactions and the fraudulent transactions because if you randomly split 284,000 data points and only 400 or 500 fraudulent transactions if I just randomly uh do that it might just happen that all the the fraudulent transactions just remain in the testing data or all of them in the training data only and so uh there is no point in now doing any predictions because one class doesn't exist in one set of the data so uh uh the software and the tools are capable you s that Strat IED it does that and so what I got is uh I got 20% testing data which was 5696 to of this number and fraudulent transactions of these 56,000 57,000 there were about 100 of the 500 so that was maintained and that's how it split the data now I want to see how to classify these data data points based on the training data okay so what we do is each of the variable in the training data set this is I'm showing there are 24 other columns to the left of it the Excel will be shared uh variable 25 26 27 28 these have all been standardized okay the amount has been given classes these are all not fraudulent only 98 or 492 of the entire file will be uh one where it's fraud okay so uh they are all standardized uh using the mean uh of the data set uh and the standard deviation in the training data set we don't mess with testing because testing has to uh behave like new data about which we do not know anything okay in this case we know whether some of the data points are fraud or not that's for our check but originally I don't want the uh anything to do with this thing so I have the training data that's the only data for my modeling for or for any calculations so I calculate my mean and standard deviation on that 80% data I standardize them then on the testing data I do not again calculate the mean and standard deviation on the testing data then it's messing things up I'm using that knowledge so I only use the mean and the standard deviation of the training data to standardize the testing data points also that what we call is uh data leakage is prevented because the testing data should not influence anything about our final model and results so so testing is completely untouched so that's what we did and then you run the K nearest neighbors I don't know what K to use so I let the software decide I said okay pick from 3 to 30 and that takes some time computational complexity in intervals of two so 3 5 7 9 21 and then use your accuracy precision recall Etc classification metrics no to find the best K that I am getting and so I plotted accuracy is useless almost 100% imagine I have 99.83% of the data that is not fraud so 100% of the data points if I classify as not fraud 99.83% I am still correct in my prediction so my accuracy is by default the minimum I am getting is 99.83% k equal to 7 9 11 all the way precision and recall seem to be maximum at k equal to 3 it need not be for every data set sometimes I might just see uh that uh at k equal to 3 it's lower and then it goes up and then it starts going down that can happen in this case as as you can see uh they are going up and down so so that's that's what uh I have and uh my confusion Matrix I had 98 actual frauds 81 of them it was able to correctly predict so 81 out of 98 81 out of 100 nearly 81% recall 17 misclassified likewise 568565 sorry my prediction was that they were fraud and the reality is that they were not fraud but I missed these 17 fraudulent transactions and I misclassified these eight as fraudulent when they were genuine transactions so let's compute all of the metrics uh and the metrics that I had was Precision 91% recall 83% F1 87% accuracy okay we don't care here but I also plotted The ROC curve for every uh U classification technique that can be plotted uh even if you are not specifically calculating probabilities here the way the probabilities are calculated are uh suppose I have three K and two of the for the data point two belong to one class and third one to another class then I'm saying 6 6% probability or 67% probability it is there so so it just gives a probability of 0. 33. 66 and one and and does that so so there are just like not beautifully like logistic regression with a mathematical formula it will compute it's a very simplistic way of doing it but still gives a pretty nice picture of how this uh classification is happening and our Au 93% which is kind of excellent okay so so KNN has done a pretty nice job of uh uh just running and finding the fraudulent transactions okay so that's K&N for us K nearest neighbors it's an instance-based learning or a lazy learning or a model free learning uh where I'm not building any model all I'm doing is just no training just save it and conceptually there is actually nothing I'm just finding distances from all the data points and but it see credit card fraud it it can it can solve that and uh and then we saw certain improvements where it can have problems uh and which every data science technique will have some problems and there is the solution yeah uh Professor one question um rather um just an observation is it true that KNN will work best where the data is fairly lopsided in one particular uh for one particular part of the cluster as in majority of the data belongs to one cluster some part of it belongs to other cluster is it more effective in that particular case I don't think we can say it that way because it will do equally well with uh with balanced data sets also so but let's think about it it's all it's doing is it's trying to find out what are its nearest neighbors and classifying it and and and that has nothing to do with whether I have a lopsided thing or not because these two classes are anyway separate classes now is my point closer to this or that is all it's checking and if it is able to classify that uh now what could happen is if the classes are all mixed up in a very very complex manner uh where uh uh but even there I mean uh it's finding the nearest neighbors and two or three three nearest neighbors I I I think it's it's a pretty uh powerful thing in any situation it's okay no the reason why I was kind of I was just trying to EV you know evangelize this and kind of you know trying to look at my mind imagine it is typically if the if the classes are well separated and I mean that's just to your point I mean if they are way too much into each other then perhaps the accuracy of this model and the computation it takes to kind of reach there may not be justifying the kind of results we are getting I'm I'm coming from that perspective but see other data science other techniques also will face that problem see the idea is if the if the separations are reasonable then all the algorithms will work well otherwise if they are too complicated then others will also face problem but then there will be some solution now uh svm for example what it does is it's the reverse of curse of dimensionality so we don't talk about that here but what it does is if the data points are looking all mixed up it starts the same idea about the dimensionality what it's doing it starts increasing the dimensions where in higher Dimensions there is a much more linear way of separating those things because in multi-dimensions when you go there the data points are so sparse that there could be some linear way of separating them and then it it does a beautiful job so so yeah so all these algorithms the they have certain benefits some pros some cons and and and it's really really mixed up then yeah I think others will also have problems thanks Professor yeah okay one one more question Professor because this is a lazy learning algorithm how is the performance when we actually put it for execution so what does that mean is when I'm actually running the model to categorize a transaction as a fraudulent or not H how would that performance be like so now we see uh the performance is good but now your question is uh a bit more than that in real time how can it do that because exactly yeah because for especially for a credit card transaction it has to be it it has to be a real TR yeah and this is a lazy learning algorithm exactly and and it will have a problem with that yes so uh in real time because it is com uh computational complex it is trying to classify at the time the data point comes and then starts looking at all these other distances from the other points then if the data set is huge now what uh can be uh done uh is there are people work with those things they if you can have uh so data structures used uh similar type of clusters if there are there you do that and within that cluster only then you check so things can be done uh like that to improve but I would still say in real time it would have issues being a lazy learner yeah that's what I was thinking because uh it is categorizing while it is seeing the data that can impact the performance okay so that's K nearest neighbors for us and uh let's get started with Association rules so we will it's pretty straightforward uh technique uh nothing uh uh actually nothing mathematical at all here in this case um and so let's just understand the concept cep and maybe some measures for um for checking how good the rules are and these methods can be used in decision trees also because we saw decision tree uh can also be presented in terms of rules if age is less than 40 and income is greater than 150,000 then loan default is equal to no or something like that so that's where whatever we talk about how we check how good the rules are uh that will apply to decision trees rules as well how do we check for them okay so and then we will the algorithms are very straightforward so we will just talk about them towards the end okay so Association rule mining this is unsupervised so far we were talking about supervised uh linear regression logistic regression decision trees even k neighbors uh so where the label was there and near wanted to buy stocks okay well uh shashant why to use Cann in a in a in a real time if it can solve it we can use it but if it cannot yeah let not okay solution okay good yeah no the the only reason that I ask this is that I mean k is a is a classifier supervised learning based model and you know because it requires human intervene to fine tune the model every time as it you know as the progression happens in this lifetime it is not advis you know these models to be used in such critical scenarios where time is of Ence like you know you need to determine something in fraction of certain so that's why the unsupervised learning models are there they they are much more I mean the supervised model are initially more efficient as opposed to uh the unsupervised learning model because they take a long time you know to learn basically but no but see the issue with the uh with the Cann is it's lazy see the other kinds of techniques that you have you don't have to go through them again once the parameters are learned let take logistic regression for example I I don't need to do anything I just have to plug in the equation it gives me the answer with a in a fraction number a second so yes I mean my point was that there are you know so many algorithms for anomal detection right so obviously anyone can yeah if this is a lazy one we can avoid this use [Music] sure okay Association rules okay fun topic okay I'll have some questions for you there is this and it all started from Market baskets so let's look at the Market Basket uh and a lot of stuff in there and I have a couple of questions let me start with the first question um uh who is more likely to have purchased the above basket of groceries this gentleman here or this uh couple here and and I know this is kind of these people are Indians so in an Indian context and so if others are unable to figure out I will answer it anyway but uh but in the Indian context uh the couple okay Richard says couple unfair question I am I know I was thinking jha what can I do and I have not been able to come up with an idea but but uh I'll clarify yeah yeah but those who can see the see see it's uh you'll suppose the couple okay U context applicable outside India to uh yeah context okay so let's see I mean if someone would want to answer uh I I would certainly think the Indians among the group should be able to figure looking at the items in the basket can you figure out who did that uh take out the wine bottle and then it's a couple as a blue shirt blue t-shirt person I don't know is that the wine okay if that is the wine then both the couple and the gentleman okay uh well uh yeah I I think uh uh the people in this group have a lot of international exposure so so that's what maybe it's the oil uh you looking at things yeah I think it's I'm I'm assuming both are oils this is olive oil and I'm assuming this is also some oil but if it is wine then the gentleman the couple appears too poor to this basket thank you see is one more J hold on to that thought you you have a valid point there okay too many stuff for the gentleman to carry okay okay okay okay par I used to car okay basic item both will order equally okay okay let me get to the answer okay it depends on who want who wants to eat what okay yes um okay I I as uh many Butlers okay the gentleman buys to donate to charity okay what I like is that you are beginning to talk about the people and their behaviors Etc just by looking at this basket even if you have not mentioned who is more likely and obviously there is always a probability but you are talking about the person donates this and that you so there is so much you are gaining from just looking at the basket H they have more time to think about okay so okay yeah I guess previous buying patterns the couple would never buy the items in bulk okay yeah good points so I thought only I knew about my parents but now a lot of people here in this group know about my parents okay so the the that couple they are my parents and this gentleman in the blue shirt is a is a very good friend uh okay uh the couple are based in London not Mumbai well they are in Hyderabad right next door to my apartment uh healthy eating appeal okay gentleman's partner uh order to bring lots okay a big extended family who they live with um well yeah it's uh okay I guess people are still typing but let me uh range of veggies L says okay yeah I think that's where I bet my last pound that they are not from Hyderabad they are based in South Kensington oh well they they are very much J they very much in Hyderabad might be in the other room right now it's the dinner time so they are right in the apartment opposite mine in my brother's apartment brother is in us though okay and this gentleman here is a very very good friend of mine uh ah Samson now say the vegetarian Etc So you you're beginning to talk a lot of things but the whole idea again all of you answered and you were looking at so many insights that you are gaining you are in fact answering my second part of the question also uh on what can you tell about them or him and yeah this is a very good friend of mine of course he's now a US citizen and currently in Bulgaria but PhD from MIT and a very nice down to earth simple gentleman very close friend and they are my parents as I said uh and uh well if it is wine I don't think either of them will buy but I am assuming it's oil so uh because we don't drink and he also doesn't drink but the thing is these kinds of the bread and the mushrooms and the pickles especially these are not the kinds of pickles that we eat in India okay so that's why I said Indians might be able to unless unless uh I I would I mean so because I have had International exposure but uh one thing uh you mentioned Samson mentioned it's uh buer is a vegetarian yeah these are all vegetarians there um and and has been exposed because the pickles these are the kinds of pickles my parents would eat okay and uh do in South Kensington okay London huh okay yeah so so that's the kind of uh pickles uh my father would still experiment with anything my mother wouldn't even try okay so my father you give him any of these things he would happily try them out uh so anyway but you got so much so many insights uh just from looking at some veggies and some oil there okay and some bread uh we do eat bread but not that kind of a loaf so I love it but that's the thing anyway so let's get to uh this thing in for a few minutes and then we'll take a break um so the Market Basket analysis is essentially what it's going to do is which are the products that people are going to be buying together and that depends a lot on their own behavior their own habits Etc but in the larger context when we have uh hundreds of thousands of people uh buying things uh then which products generally go together and so that a marketing team can can decide how to promote these products okay and uh one classic case that's talked about and I think it's made made of data uh but it's it's pretty funny and what you see is what they found uh Huggies and and beer uh in the same basket and the findings were that uh men between 30 to 40 years of age shopping between 5:00 p.m. and 7 p.m. on Fridays and who purchased diapers were most likely to also have beer in their carts okay so uh so now I have a global context here okay so not not limited to that that Indians there in that uh picture okay and and the key words here are men 30 to 40 years of age 5:00 p.m. to 700 p.m. time of the day and Fridays beer and diapers both in their car so if I see these kinds of things I can come up with rules like what is it men who purchased anything in a Walmart likely to have beer in their car Rahul if I go I wouldn't have I don't think but then just a joke I know I know I know yeah but yeah it's so yeah valid point uh if diapers then beer okay very famous I yeah yeah I think it's very famous I think I have a link uh uh uh somewhere in the next couple of slides somewhere if not but and it used to be believed that it was real and then someone came up and said that it's just made up to make this uh concept understand much better okay so even on a Friday evening people will understand this topic here okay if diapers then beer that's one of the rules I can get I can get uh more rules if gend is male product one is diapers then product so if it is it is like if then rules if gender is male and age is 30 to 40 and day is Friday and product one is diapers then product two is beer or if diapers and beer so you can have there is no concept of dependent variable independent variable here it's all kinds of associations that we look for so if diapers and beer then that's what rul mentioned so then male so if it's a male then beer okay so this way that way so all possible combinations are uh are acceptable in association rules but that's a problem also we will talk about that after we take a break but uh it really becomes unwieldy and too many problems the benefit of Association rules very powerful technique uh practically not using any math except for counts um is it's just not for markets or baskets and anywhere you can have data as zero and one uh transactions IDs and items so this is this can be milk and diaper and beer and whatever and the first person second person third person fourth fifth person and what are the things they had in their basket um all of this uh can be done like that and if there is continuous data that also can be converted into this thing but Association rules normally work well and of course people have come up with variations around this to handle continuous data by discretizing by doing other things and and and be able to get rules out of that so works for many uh different domains uh in a retail store obviously it started off with retail store stores plan store layouts things being bought together placed together or far away but now that's a business decision we will talk about that also shortly maybe after the break uh cross- selling of the products promotions how do you design your catalog if you know there are certain items that they go together uh and various other things medical diagnosis I will share in the end of the Class A couple of slides where I will point out some research papers that uh you can read how people are using it for unstructured data also um and and it does a beautiful job uh with text with images Etc so medical diagnosis co-occurrence of diseases treatments and complications symptoms and diseases which symptoms go with that uh and all of that credit card business um uh so which kind of customer High net worth individual is that what kinds of product and services they buy uh it people going to this page will next go to this page we call that sequential Association rule mining uh so what are the U steps people are taking uh which page is associated with which page Etc again for fraud also unusual combinations of claims uh can be a sign of Fraud and uh and many more I mean it's it's a very uh nice thing just about 30 31 years old uh techniques so pretty new algorithms and and like like we will see shortly the algorithms also there is nothing more than just counting that's that's all there is uh in these techniques so what the goal is is to find the items that are purchased together or cooccur very frequently and we call that frequent item set so some terminology frequent item set from a list of all the items which we call Item set so the name given to all the items is item set and from there I want to find out which ones are the frequent item set okay they are the ones that are happening happening much more commonly than what you would expect uh to happen randomly if there is no association between them okay and so so that's that's the thing and what we get is an if then rule If X and Y and Z then a or and if that is the kind of a rule then the XYZ the ones to the left of the comma are are called antecedents and the a is called the consequent the length of the rule is the number of antecedents we don't count the consequence just on the left side of the rule how many things are there and that's one of the ways people try to minimize these rules by saying well the maximum length I want is let's say 10 okay so you can't have too many uh conditions on the on the left side otherwise it will find all kinds of rules so and we will see that shortly so these are some terminologies uh and item set is just everything XY Z and a okay and so examples if diapers then beer diapers are the antecedent beer is the consequent what's the length of the rule one right one only on the left side whatever is there on the left is what it is if gender is male and product is diapers then product two is beer and so here the antecedants are male and diapers and beer is the consequent and two things here two things are so this is the count of the antecedence is the length of the rule and like that we can have the other rules that we had in the previous slide and one two uh 3 four so that's practically okay soces a few years back expert systems are popular are built using this technique yeah it's then rules yes so armm Association rule mining looks for the association in the columns we will soon see when we come to clustering next week uh we are looking at association between the rows of the data okay so this is also unsupervised we are looking at association between The Columns of the data set and in a clustering we will have to be again unsupervised but relationship between rows of data so that's the big difference between this versus cluster but it neither signifies causation nor any strength of the relationship that correlation can capture co-occurring they are occurring together more commonly than if there had been no association between them at all that's it more or less common okay so there is some relationship some Association but uh no correlation no causation nothing X and this is the way we right uh uh X and this is the antecedent the consequent doesn't have any causation there and x if x then y that's how we read it if x then y can be very different from if y then X if men then beer if beer then men they can have very different uh uh values okay so so that's the thing corelation on the other hand has the same thing x and y y and x same correlation oh okay I I did not know that I wrote that okay good oh okay same example good nice so that was when we had beer and Men I had this Rule now if I have two items HDTV voltage stabilizer so I can have HDTV if HDTV then voltage stabilizer if voltage stabilizer then HDTV let's just see the problem and then we will come back after the break to talk about uh the algorithms that solve that problem pre items let's now say we have an HDTV a voltage stabilizer and a soundbar and I want to look at all kinds of rules that can come out of this and well I can have if HDTV then voltage stabilizer if HDTV then soundar if HDTV voltage stabilizer and soundar if voltage stabiliz I have 12 such rules that I can generate now what happens that was with three items now what happens when the number starts increasing these are the books frequently bought together frequently bought together with artific artificial intelligence Third Edition customers who bought this item also bought this customer who bought items in your cart also bought this products related to this item so customer Behavior items the problem the product Behavior which products go together how do customers uh behave uh what type of customers all kinds of associations are there and so the number of rules go up exponentially sorry just length of the rule length of the rule is nothing but uh uh the the it is the name given to how many are on the left hand side of the the rule so here the rule is if HDTV then voltage stabilizer so if the antecedent how many antecedents only one that is the length of the r i don't care about this here for example if HDTV and if you buy HDTV and voltage stabilizer then you are likely to buy sound bar also so HDTV and voltage stabilizer there are two items on the left and then the length of the rule is two that's so it has nothing more than the count of antecedents make second so how many rules are possible uh well it is number of rules has that formula where D is the number of items so if I have two items 3 square is 9 9 - 2 cubed is 8 1 + 1 we get two rules okay that's okay that's manageable three was 12 four is 50 10 items 57,000 20 items re billion rules 25 items 500 items and these are small you look at the number of books Amazon stocks and then crazy we can't do so what is happening here is the rules explode exponentially as the number number of items grow but many of those rules are useless now if you buy HDTV very likely you might buy a voltage stabilizer depending on which part of the world we are in and what fluctuations happen but uh a voltage stabilizer uh will ensure that uh High fluctuations are not allowed and protects the equipment and so that makes sense but if I buy voltage stabilizer then it's not necessary that I'll buy a HDTV I might be buying a refrigerator or a washing machine or not a washing machine but something else okay so so there are many rules Association rules will create everything all combinations of the items okay but many of them are useless how and since there are no models uh we can categorize the items in genre to reduce the rules well we can do that uh but still if I I still want uh individual products not just uh say that okay these are uh Dair products and I can do that that certainly reduces so that's one thing that can certainly be done uh but uh I will still even that might be a lot and and just combining it into something called cleaning supplies may not give me much greater Insight than if it is a mop or uh or bathroom cleaners or uh or a floor cleaner whatever kinds of things so U so uh I really cannot and there is no loss function because no model again is being built uh but then we need to figure out how to cut out these kinds of rules so there are statistical methods there are business focus things and one of which sudna also mentioned that you can combine things uh but how do I make sure that I have a more reasonable set of rules that I can deal with and that's what we will deal with after we come back from the break it's 80 exactly uh in my watch here in India 8 10 10 minutes uh let's come back and uh we will then see quickly what are the uh measures very simple uh calculations and then we move to those two algorithms which are extremely uh straightforward so let's take a break come back in 10 minutes exactly from now and I'll grab a glass of water as usual and come back if there are questions sure e e e e e e e e e e e e e e e e e welcome back let's get started so so uh we we saw that Association rules it's essentially looking at how frequently items appear together in any data set uh and uh is that uh random Association or if uh uh if there are there is really something happening is are they really associated with each other in some way and so we can create rules the if then kind of rules uh where uh the uh the items that are to the left of the comma which is if if X and Y and Z then XY Z become your antecedent and that's the length length of the rule which is three in that case and a then this happens and that is the consequent uh and uh and that's the thing we saw that the number of rules can uh go up exponentially as more items come in and pretty quickly with only 20 items uh then it goes into billions of rules because all kinds of combinations uh with two items three items four items 20 items and left right all combinations it just uh becomes too much many of those rules are meaningless uh but how do I get useful uh uh rules only and not have all those kinds to deal with and there are and because we don't build a model there are no error metrics like Precision recall or uh mean square error and all those things because there is nothing to compare with it is not a supervised method There Is No Label to check but there are statistical and uh business uh ways of uh of minimizing those number of rules and so let's start looking at that sorry sir one quick question is this what is the we can understand this as the hyperparameter tuning of this process uh this algorithm the there are no parameters here because there is no model so there okay okay yeah just looking at associations so let's let's say I have this data I have 13 data points and the first measure to see how good the rule is is called support how much support does this rule have same thing can be applied in decision trees uh where for the classification when you do you can check the Precision recall and all those things but when you have the rules if then rules that get generated from the tree then you can apply these measures as well okay so so support is the first one let's say that I am evaluating the rule if credit card average is medium then loan is accept if that's what a rule that I want to evaluate support is one of the things what is support well support is of all the items in my data set how many items is this rule supporting which means the fraction of the overall transactions where both my anti ident and the consequent appear together so that happens if credit card average is medium then loan is accept these are the acceptances these are the rejections in the loan and these are the medium credit card average so three out of 13 that's my support support of if x then y for that rule is three divided by 13 about 23% which means 23% of the data points in my data set support this rule as simple as that that's support done okay then that's not sufficient let's continue to do that there is another one called confidence now the confidence is what fraction of the transactions where loan has been accepted also have medium already in the basket so given and that's the key word here given medium is the credit card average now tell me how many data points what fraction of the transactions have accepted the loan okay given which essentially means that I have filtered my data set based on my antecedents so I had 13 data points but given the credit card average is medium which means oh these are the three out of those 13 let me filter out these three data points and now check how many of them have loan equal to acccept and in this case all three of them have so three divided by three that is the confidence confidence in this rule is given the credit card average is medium I can confidently say that the loan will be accepted that's it so this is confidence the previous one was support from all the transactions how many transactions are there where the entire rule the antecedent and the consequent all of those items appear together and that was medium and one they appear in three out of 13 together confidence given the anent given that I have the data of people whose credit card average is medium now tell me how many will accept now that is 100% I have 100% confidence in this rule so support 23% of the overall uh data set supports my rule and 100% of the time I have confidence if credit card average is medium I have 100% confident that the loan is accept so these are this is another metric but there is a challenge there is always this but there is a problem okay so so anyway and that can be given by support of both X and Y divided by the support of the antecedent how many are there in this thing so essentially if I do that way then it is support was three divided by 13 and support of x medium is there in three divided by 13 so that way also we can just calculate it but idea is if I have filtered the data and have only the points where the antecedents appear then in out of them how many have the consequent can I go back to the previous slide on second just a thank you so here I just looking at both the left hand side right hand side of the total data set in the second case I am filtering the data set and collecting only the ones where the left hand side is appearing so this is my entire data set now everything else is gone throw out filter I don't need it and now tell me the same thing as in the previous one in how many cases right hand side is there that's it but and so let's confidence let's reverse the rule and say if loan is accept then credit card average is medium now what will happen to the confidence anybody if loan is accept three out of sorry I heard three out of the you went you muted seven three out of seven three out of seven yes so loan is accept is these seven among these now I have filtered the data with the left hand side which is these seven data points of them how many are medium so three out of seven that is the loan equal to accept data and among them I have this support still is the same 23% because support doesn't care left hand side right hand side I'm only looking at all of them so support has not changed confidence changes based on the left or the right of the room okay and so one of the ways we filter the rules is by assigning what we call minsup and Min con minimum support minimum confidence and generally Min up numbers vary sometimes in some data sets people take even 1% but 5% to 10% normal sometimes if you want to really cut out even more even you can take 20% uh but these are some rough rules of thumb me conf minimum confidence once you have filtered the data how many have there 50% 60% 40% people take that but that's these are some some ways I can cut out a lot of the roots now let's see there is these are still not sufficient in certain data sets let's see what more we can do statistically let's evaluate another rule if pregnant then female in this data set okay then the rule obviously has no sense there but I mean it has a lot of sense but still no sense as well it's known something that we all would know uh if pregnant than female the support for the that is pregnant and female pregnant and female seven out of the 13 are there so 54% of that confidence well if pregnant then female so have only the pregnant thing seven out of seven 100% lot of confidence but is this rule adding any value to me it's common knowledge if pregnant then female so when there are a lot of data points where we also have female itself the consequent is there as it is in a lot of data then it's not really adding any more value if the fraction of the records having female is lot then whether you tell me whether pregnant or not not it still has a high value the the transactions containing female still are a lot that's when we need another measure called lift what these rules need update with time nowadays we can have pregnant to identify as male fair enough Mano yeah let's just Define in the traditional sense yeah we can always say the the father is pregnant also but let's go with that the medical definition of pregnant uh in that sense not the the updated uh way okay so fair enough point taken but let's uh focus more on the uh the Ruth itself but yeah you are right if if the data set Now talks about that uh then neither rule will have any uh may not have much uh much meaning there for male and female because the Fe if it is from the same couple then any yeah you got the point okay so so there is another thing that we need and that's called lift how much more lift is the is this rule adding how much more knowledge is it adding if the fraction of the records with female itself were too many then this rule may not be adding much value to me okay let's look at that example and then it will become clear what I really want to do here with this is let's define lift and then we will come back to this and then we will do another example that clarifies it a lot more okay so what I do is I divide the confidence with the fraction of the records where female the consequent already exists so here I have confidence was seven out of seven so if pregnant only this filtered data then seven out of seven are there and what is the support for female by itself even if I do not filter it with pregnant if I do not filter it pregnant and I take the entire data set I have 10 out of 13 records as female here and so I have a lift of 1.3 in this data set that's what I call this is the calculation let's understand and the same thing if I just expand it with the confidence was support of X and Y divided by support of X but let's not worry about that okay what is the interpretation now the interpretation is y given X if the lift let's say is equal to one then what am I saying the confidence of x given X then y if x then Y is the same as Y how much fraction is there yeah support is seven support of y r support of Y is just Y how many are there it's fraction of Y so I'm saying 10 out of 30 support of X and Y where both of them appear together is that so if lift is equal to one then I am saying the proportion of Y given X is equal to the proportion of Y so whether I give you X or not doesn't matter I'm not talking about 1.3 I'm talking about a situation where lift is equal to 1 which means the proportion of Y given X is equal to proportion of Y which means whether you give me X or not does not make any difference which means X and Y are really not associated that is lift of one but everything I think went Bonkers can you just go back and and explain sorry no on the lift only I'll I'll show with another example I I understand it that's why I'm taking it a bit slow I'm repeating it and I will have another example Professor I think if you use the female and Pregnant instead of using X and Y to explain that one it will be lot easier to understand I guess okay than using X and Y sure okay then let let me do that here okay so for that let me even change the data here I'll I'll undo that let me just uh make it female let me just leave it like that for the time being okay these females are not pregnant and these females are pregnant okay so let's just go in which case if pregnant then female the confidence was seven if pregnant means I am filtering out these seven out of and all seven of them are female if pregnant so I have only these seven records and then female all seven of those records are female so seven out of seven hope this is okay the confidence yes it is yeah okay good now let's say then the the the support for why female is 13 out of 13 all the records are female that's okay yeah so then 7 divided by 7 13 divided by 13 would be 1 now what does that lift of one mean it means proportion of females given pregnant which is the confidence is equal to the proportion of females which means giving pregnant or not is not adding any more value to me makes sense yeah so the gender doesn't play a role in this data point is what we are trying to say right exactly so what it's in this particular data set now what it says is proportion of females given pregnant is equal to proportion of females so that giving pregnant that information has no value to finding the proportion of females that I have here right and so that means pregnant and female in that data set is not really Associated agreed I understood it now thanks that is when the lift is equal to one let me undo here and then let's go back to zooming the so Professor can we say technically when the two variables are highly correlated their lift would be one when they are not and it's not correlated let's use the word correlated Loosely there if the lift is one then there is no correlation at all no association at all MH right because yeah so again with the example I have have the proportion of female what is Lyft telling me the ratio of the proportion of females given pregnant to proportion of females and if that is one it tells me proportion of females given pregnant is equal to proportion of females so what am I getting benefit when I'm trying to find the proportion of females whether you tell me whether they are pregnant or not is not adding any value to me if the lift is one in this case lift is more than okay if the lift is one then the confidence in the rule and is the same as the proportion of white what is the confidence proportion of females given pregnant I think the next example will be much easier because here pregnant female I mean that we we go back into the reality of the world and then think from that perspective so it might throws off but proportion of female let's just focus on this part of the statement proportion of females given pregnant is equal to proportion of females so what value is the information given pregnant adding to this rule if pregnant then female well it is every everything is female anyway so what's the point in giving me that information if these were also female the lift is one okay let's let's park that thought here if it is still confusing next page I have another example that will make it very clear let me Define what this means with the lift of one lift of more than one lift of less than one and then we will do that example then if it is still not clear we will come back and talk about this in more detail if lift is greater than one as we see here then the pregnant and female are occurring more often than what I would expect if there had not been any association between them X and Y are occurring more often than what I would expect if there had been no association between them likewise less than one means they are occurring less often than what I would expect if there had been no association between them a negative association there is a negative association now let's go to that next example which is much easier tea and coffee and then we will uh understand what is going on here here let's do with some numbers okay and support confidence and lift and that that closes this statistical understanding of filtering the rules there are more metrics also but these are the most common okay if tea then coffee that's the rule we want to evaluate I have three different cases let's just look at the table and understand it total 100 records 80 of the people bought coffee 20 did did not buy coffeee of these 100 25 bought tea 75 did not buy tea it's like your confusion Matrix only okay but there is no prediction and reality here it's just a co-occurrence of two products coffee and tea 20 people bought coffee and tea five people who bought tea did not buy coffee 15 people did not buy tea did not buy coffee coffee 60 people bought coffee but no tea okay this is the data likewise some other distribution in the second case and the third case I want to evaluate the rule if tea then coffee so now the support will be tea and coffee together in the first case it's 20 uh one second sir is there another drink also involved here so that means out of the people 80 B 20 b tea and coffee you mean to say I'm trying to understand that M Matrix is there another drink involved as well in here or it's just an option of tea or coffee just tea and coffee here okay if there's another one then we will have to draw the the bigger Matrix that's it so we can understand it like 20 people drink both items whenever they come uh but only drink tea they don't uh prefer anything else right yeah drink maybe maybe not maybe in the family yeah they buy they buy yes so is the table clear at least the first table is there any question on this table uh uh and then we will go to actually evaluating the rule anybody okay total 100 people data 80 people bought coffee totally 20 people did not buy coffee out of these 100 out of these 100 75 bought uh 25 bought tea 75 did not buy tea and if I want to look at these Sals 20 people bought both tea and coffee 15 people bought neither and 60 people bought coffee no tea five people bought tea no coffee and that's what makes up this data it's clear now it's clear now thank you thank you now I want to evaluate the rule if tea then coffee support confidence and lift are what we will uh uh we will calculate tea and coffee 20 people buy out of the entire data set which is 100 so support will be 20% confid uh confidence is if T then how many people are buying coffee so I only want to filter the data where people have bought tea and then out of them 20 people have bought coffee so 20 divided by 25 is my confidence and lift is confidence divided by how many people bought coffee and that is 80 divided by 100 let me put that in the table and show the numbers 20 divided by 100 support does everyone is is there any question with that the thing is both tea and coffee in the total number of transactions that is the support hope that is okay with everyone only the first case I'm showing yes sir understood yeah second case second confidence question why it not be 25 out of 100 because that's the proportion of people who bought te out of 100 people that were out there correct yeah but the support we defined where both antecedent and sorry understood thank you it's 20 okay that's 20% now confidence is if I filter the data set my entire transaction and consider only the ones where I have the antecedent the left hand side which means I have 25 people who bought te I have filtered the data with only the people who have bought tea those who have not bought I have removed them and for this Rule and then oops and then checked how many of those transactions have coffee and I have in this 25 20 had coffee so 20 divided by 25 is my confidence everyone okay with that yeah confidence now lift this is where we were all stuck lift is I am evaluating a rule if tea then coffee so I want to see the association between tea and coffee and I want to now see what is the support for coffee by itself because if I filter based on T how much more value am I getting in this rule so if tea then coffee the confidence I have got what is the proportion of coffee proportion of coffee is 80 divided by 100 so if I divide the confidence if te then coffee with just coffee the consequent if tea then coffee is 20 out of 25 coffee itself is 80 out of 100 that is one so this case is giving me a lift of one so what it's saying is the proportion of coffee given tea is the same as proportion of coffee in the entire data set so by giving the tea you are not really adding any value with respect to the purchase of the coffee the tea and thee coffee are have no association in if this is the data set it's random there is no real Association because my rule is not getting any additional lift by giving the information about t second case let's take a look and then the third we will it will be self-explanatory in that sense and I will put that in English in the next slide now here when people buy coffee so here you see 20 when people buy tea 20 of them are buying coffee that's the same here when people buy coffee total 25 people have bought coffee here here only eight when people buy coffee five of them did not buy tea 20 of them bought te likewise the same five here when people 25 people bought T 25 people bought coffee and when 20 of them bought both of them five of them did not buy the other product 70 did not buy either no tea no coffee so these people don't drink either tea or coffee when I did the same calculation I get a for this rule 3.2 what and which is a much more than one which means tea and coffee are highly Associated how is that so well what it says is the proportion of people buying coffee given tea is three times the proportion of people just buying coffee proportion of people buying coffee is 25 out of 100 25% but the proportion of people buying coffee given they have bought tea is 20 divided by 25 which is 80% it says like if they buy tea they are three times more likely to buy coffee exactly if they buy T they are three times that perfect statement thank you if they buy tea they are three times more likely to buy coffee compared to when there had been no association between tea and coffee if there had been no association randomly people are buying tea or coffee versus here when they buy tea they are much more likely to buy coffee as well and that's what it's showing when they don't buy tea you see they are very less likely to buy coffee they are not buying either cof nothing when they don't buy tea they don't buy coffee also and very few of them are buying but when they buy tea a lot of them are buying coffee very few are not buying coffee in fact that actually shows this very well here when they don't buy tea they don't buy coffee so buying tea is highly correlated or associated with buying coffee if they buy tea they are three times more likely to buy coffee than if there had been no association in the first case there is no association pretty random stuff the third case actually is reverse where the lift is less than one which means if they buy tea they are less likely to buy coffee compared to if they had not if there had been no association between them so there is a negative association if I buy te I am less likely to buy coffee versus when there had been no relation between them that's what the lift is really capturing what it's saying is that there is great Association or or we can say if we buy if we buy no tea then we are more likely to buy coffee yeah if you don't buy tea then they're more likely to buy coffee so so that's the thing so so that means no tea has an association with coffee but obviously we are looking at association between multiple products so we have to look for tea and coffee versus rather than no tea and coffee but you're right that's a point that's if I don't buy tea in this case third case then I'm more likely to buy coffee so there is a negative association here if I buy tea I'm less likely to buy coffee than if I don't buy T and that's what you're seeing I don't buy technically sorry sorry Professor so technically from this study we can analyze which rules are really making sense yes so what is the strength of the rule technically yes that's exactly what support confidence and lift are doing looking at all three of them is going to give value because we saw an example where confidence was high 100% confidence but well if the support for the if the the the right side appeared a lot in the data set then that's a known information pretty much so it the rule is not adding much value where you give me the left hand side so for that reason we came up with another one called lift okay so all three put together gives me how strong a rule is and what value it adds what does the support tell me support tells me of all the data uh the transactions that I have what percentage of those What proportion of those have is this rule supporting that means all the items appearing in that rule are there in how many transactions of the total that's support pretty simple to understand confidence is okay fine that is support where both of them are happening but now I am saying if this then this so if I filter the data and have only the left hand side what has been given given that What proportion have coffee given I have the transactions of T buyers What proportion are coffee and that is if that is a high confidence good that's also good but what if a lot of people anyway buy coffee so just then giving additional information about t is not adding that much value yes confidence calculation is is a large number but that's also maybe because a lot of the data set is about coffee buyers then give if given tea not given tea how does it matter coffee buyers are there anyway to add to that knowledge of confidence lift what it does is it divides it takes the confidence divides with What proportion are there in that and if that proportion is similar to your confidence then lift is going to be one but if the lift is more then what it's really saying and I think Z put it in nice words is if people have bought tea they are three times more likely to buy coffee than if there had been no association with them if the lift had been one the proportions were equal this is three times more to summarize this like what we actually looking for is the Information Gain we get from the lift so the Information Gain says that if it's one then it's random because uh there's no association and if it's more than one then it's more likely to happen because of that positively Associated and if it's less than one then it's negatively Associated then the opposite of it is true absolutely that's the summary and that is what is there in the next slide that's exactly what is what zah summarized is what we have all cases here have strong support 20% is a strong support 80% is huge confidence I have them but lft is adding that additional information about whether there is a real Association or not and is if that is is it positive or negative in this case when people buy T they there is a very good likelihood they will buy coffee in the first case it's random there is nothing uh associating them while the support and confidence are huge but that's because a lot of the data already has coffee also in there but lift here uh positive in the 3.2 means when people buy tea they are very much likely to buy more coffee or coffee also in the third one if they buy tea they are less likely to buy coffee than if they had been no association so that's what the uh lift is adding to that and together these three help us uh uh evaluate rules because there is no metric for uh uh like in supervis learning Precision recall accuracy Etc this can help us filter the rules where I can say I want to have rules that have at least 15% or 10% support and 7 75% uh confidence and then once I get them I'll sort them in the order of lift that there are and then uh higher the lift I have a stronger Association and then I can take a look at those and remove the ones which have lift of one or close to one which doesn't add much there are other measures statistical measures we call that the P value based so Kai Square measures Etc also do that for Association but these are the most commonly ones used this I think will help us understand lift the best okay the beer always helps uh work things out so in more intuitive manner I will talk about lift here and that will I think clear if there is still some doubt or questions remaining in the mind we can get some interesting discoveries let's say and beer and diapers data and that is the data let's say out of 600,000 records I have the transactions and this is the data 60,000 or 10% of the transactions contain beer 540,000 do not contain beer diapers 7,500 contain diapers and 6 and 5 92,500 do not contain 6,000 contain both beer and diapers and that's the kind of data so the total transaction 600,000 transactions containing diapers 7,500 excuse me which is 1.25% transactions containing beer 10% and transactions that contain both beer and diaper 6,000 out of 600,000 1% now let's say the transactions containing beer are 10% if there is no association between diapers and beer so 10% of the total transactions have beer then if there had been no association between them 10% of the transactions containing beer diapers should also have beer only okay only 10% should have which means 750 if there is no association 10% random transactions have beer so 10% of the transactions that have diapers if diapers should have beer which means 750 should have beer but that means if there had been no association I would expect 750 of these to maintain that 10% 750 should have been but 600,000 have be and sorry 6,000 have be uh 6,000 6,000 divided by the expected 750 is 8 that is the lift and if you do the entire calculation of confidence and other things or this thing that is what we are getting eight so given diapers if diapers then beer that has a lift of eight over just beer when when the transactions containing beer what percentage 10% okay so that's what Lyft is doing if there had been no association I am seeing 10% of the total transactions have beer which means if there is no association it shouldn't matter whether you have diapers in the cart or not 10% of those transactions should have had beer which means 750 should have had beer but I see 6,000 having beer 6,000 divided by that 750 is 8 8 is the lift and that's what the lift is really calculating that means there is a strong association between diapers and beer hopefully this way of looking at lift clarifies the association strong Association versus no association now if that is the thing normally lift of eight is generally not seen But in 200 something 2004 in Florida Walmart saw a lift of seven in the sales of strawberry poptart over normal shopping days why do you think strawberry pop poptarts have that association with hurricanes before a hurricane we know people stock up on water and those Essentials those sales will go up so hurricane and sales of water bottled water and whatever not makes sense why popart and this was real data and and based on that they started stalking just before hurricanes they started talking more of poptarts why do you think there is a strong association between hurricane and poptarts any thoughts s survival food survival food okay great and and and not just survival of course that helps you survive uh but it has these properties they do not require refrigeration because the power will go off off so doesn't matter they need not be cooked no electricity no cooking gas might have if if there is gas they come in individually wrapped portions so you take what you need and the rest is still uh safe protected hygiene have a long shelf life breakfast food snack food kids love them adults love them all of that is happening and so that's why people were just talking upon those things because it's so convenient and survival and so that they learned when they ran these kinds of algorithms they did not know and so when they saw they started stalking them up and so business shoots up uh and people are getting benefited now you don't suddenly people are coming and then they say out of stock the the demand Supply is being maintained stock is maintained otherwise they will go to some other store and hurricane season you can't keep driving around stores and and so customer satisfaction everything is good and so that's what Walmart learned in 2004 and then similar good things can happen uh nowadays of course many of us might relate to these camcorders and VHS players but uh uh current generation does not but uh they had an interesting uh learning buying together is probably easy you can track but here people were buying three to four months later they bought camcorder three to four months later they were buying a VHS player these are expensive items so I I can't afford them together but uh three to four months later I have a lot of tapes now and I want to watch I can do things and so um so these are the things and so you might have your marketing strategies tuned to have discount coupon uh at the right time giving them a discount coupon at the time of buying a cam order along with that doesn't help uh on on the VHS player but three to four months later you give and people are thinking about buying at that time that helps so association with the time lag a bit more complex but things like that they were able to find and so let's uh do this I don't think we will be able to cover both I think the previous I think the previous slide needs a little edit actually what you mentioned and what was written was the other way around I guess okay uh bought VHS players sended to return to uh later to buy cam corders okay the other way whatever I didn't read it I was just going through the floor so uh so uh but yeah the logical thing would be to buy camcorder first or uh they had a VHS player and uh movies also VHS tapes they have now they want to make home videos and maybe that's when they buy home camp water so I guess there might be a reason unless I see all the rules and things so I think there is a logical order this way also so they already have a VHS player watching movies but now they want to make home movies and videos of themselves so they buy can cord maybe but good catch from what I saying but okay those are the calculations Etc but you can also use one slide I think we will not get into the algorithms they are simple but we will leave it to next class um what uh a priori and and this thing so uh so you can use those otherwise you can use uh your domain knowledge uh and and filter out rules some rules may appear interesting but may not not really be interesting so they uh like customers who rent cars also purchase travel insurance well it's probably Lo logical when you rent a car you are also buying it unless your own insurance is covering it um so even even if it is interesting uh appears interesting it may not really be that interesting uh there could be very trivial rules people who buy shoes also buy socks [Music] uh many of times they would do that they go together anyway uh so so may not be uh very important rule okay so it may have support it may have confidence but I'm not learning much from there okay inexplicable rules uh people who buy shirts also by milk it's like that spurious correlation that we talked about back then uh maybe the data is showing that and you can't uh throw it away blindly you will have to think about it why but uh despite all kinds of thinking if that's not going to help with future uh marketing promotions now you want to think about how to Market and then you say well you are buying a shirt and here is this coupon on the milk bottles and if you buy three shirts uh gallon of milk is free Etc I'm not going to buy shirts just to get gallon of milk free maybe if gallon of buy a gallon of milk then you get three shirts free maybe so so I don't know inexplicable uh but some things actionable actionability also you should think of and that's what I was talking about if milk then what action would you take and this was something that was talked about uh people who bought Dove soap also bought Barbie doll if that is a rule and it seems interesting seems odd maybe I cannot explain it inexplicable but if that is the case it's basically does it means that it's actually trying to study people's behavior if it's Behavior related then it's actionable and it's Information Gain if it's something given then it's not really Information Gain Is that what is trying to highlight yeah that that could be the case yes but uh obviously behaviors are pretty complex so if if you really had this uh rule uh and it's not very clear I'm unable to explain but let's say a lot of data is really showing that lot of data there is a good lift also support confidence lift are all high as well what as a business owner what kind of a decision would you take if if you know that Dove soaps and Barbie dolls are going together well what what kinds of business decisions could you take I could uh Professor I could tie up you know Dove could tie up with barie dolls or you know they could merchandise together okay so Dove soap tying up with Barbie doll that's for the manufacturers but if I am a store yeah so you can I'll keep them closer together you keep I about to say that I mean we can put it as a on the planogram application can can use uh those right calculations in the store so the planogram can decide you can keep them together wonderful that's one decision I could take any other thoughts uh maybe we can have discount coupons given uh if you buy one product the other one can come at a certain discount over there yeah Barbie doll being a bit more expensive maybe a discount on Dove soaps could be possible uh well if I'm a manufacturer and it makes sense then maybe I can have a new mold and and have a Dove soap in the shape of a Barbie doll also that's much more complex but but one thing you said in the planogram you can put them together but the other thing I can I can say is I know people are buying Dove and Barbie doll let me keep them at two extremes of the store absolutely I think yeah brilliant Point yeah so they will buy anyway so they will have to now go through all the Isles or variety of ises for that thing and in the process they might end up buying some other things as well so and I used to do that when I would go to Walmart or other places and I have this habit I just have to walk through everything and and nice new flashy labels are attracting me and then my wife sh shouts at me this is our list we need to buy this said no no no and and the reason they used to have the these nice yellow beautiful bananas at various steps was because of these kinds of associations that's one of the things people just buy even if they don't have it on the list it's the impulsive buying and and I would never understand I would say maybe it looks good in the store nice beautiful all of them looking same size yellow and nice and that's the thing and then I read somewhere later on that this is based on these algorithms where they saw this was one of the impulsive buying items and even in the in the line for checkout if you are there and then they used to be back then I don't know what the status is now but uh they used to have them and then you are standing in line now you can't go anywhere and so you're thinking you are seeing that bananas oh yeah looks good ah okay just take a bunch and put it in the cart and so well that's that was based on some kind of data analysis and algorithms so so these are the kinds of ways we can filter out the rules um but then there are even to create fewer rules is an important thing and uh and so uh the problem with all of these things are we still had to create all the rules and then based on uh on support lift uh the domain knowledge and uh all these things we can do but the problem is then with 500 items you have unmanageable amount of rules and so how do I actually not create as many rules to start with and that's where the algorithms like a priori and FP growth eclat variety of things come into play and and so we will see these in the next class this is a nice point to stop uh these are extremely simple algorithms just counting uh the number of times in the transaction some thing appears putting a Min up minimum support Criterion chopping them off then creating uh a larger combination of item sets again chopping off and things like that so so we will see that uh it's actually a pretty short thing in half an hour uh next class we should be able to uh cover uh these and finish that so this is a logical stop what we have done today let me just summarize and then I'll take questions um uh we uh started with the last part of the KNN where we were dealing with dimensionality as the dimensions increase the need for more and more data uh occurs and so reducing the dimensions is uh is going to solve that problem uh and things like domain knowledge or statistical methods mathematical methods are there uh that can cut down feature selection method remove some things up front those can be done larger number of data points Hearts algorithm the condensed Cann is what we saw where we selected a set of prototypes where the accuracy of the one nearest neighbor algorithm Remains the Same so all the other absorbed points can be thrown away because whether you have them or not the classification does not get impacted and so hars algorithm uh picks the prototypes that are sufficient to do the classification the data gets reduced and that's the thing Association rule mining uh really we only talked about the finding items that happen that go together in any kind of a um data set it could be Market baskets that we started with but we will see it can be uh any kind of Association in any kind of a data set uh and uh the thing is the rules are created in an if then manner where if antecedents then the consequent happens and uh the the thing is the rules just uh with more and more items the number of rules just become too many and so how do we uh filter them how do we check how good the rule is support the total number of transactions uh that have all the items in that rule from the overall uh uh transaction data set uh is the support for the rule the confidence is filter it based on the uh the left hand side the antecedents and then how many of those data points have the uh the right hand side that we are looking for in the rule and the lift is given uh something how much more lift I am getting compared to when I'm not giving that so compared to when there is no association so what percentage of rules how much more is this rule giving me a lift over a situation where there is no association between the antecedent and the consequent and those are the things and then we said we can think of business approaches to filter out rules but the problem with all of the still while we can measure them measure their strength that's good the problem is that uh uh that we first have to create the rules so how do we avoid creating uh unnecessary rules using these minimum support minimum confidence Etc criteria and that's where a priority and FP growth will come in and that's what we will do next class next week that's what we I have any questions on any of this uh if not you can drop off I'll stay on for some time wait for questions thank you for thank you Professor thank you thank you Professor Wonder thank you Professor thank you thank you profess for so I'll stop sharing now okay Rohit you had some question yes Professor so I just wanted to quickly go through you want to stop the recording or yeah let me stop the recording there are no other questions on this thing or you want me to dial in again or going