Transcript for:
Data Filtering and Merging Techniques

okay uh in this video we're going to cover the notebook that is named more on filtering data and probability calculations um as always you have um the what the notebook is about the beginning and a quick preview of of previous class as I have mentioned before uh the days may not match for uh with our summer class or any other semester that you guys may be watching this video uh but always just keep track based on the name and following the links for your your current class uh also remember that if you're in the main notebook um if you start working here and you don't make a copy you won't be able to save the the changes that you do or any line that you did practicing in my case sometimes I forget to do it and I am able to do changes and save them is because I own the notebooks but since you don't own the notebooks you won't be able to do it uh so always remember to do a copy by going to file save a copy and drive and in here you can do anything you like and save it because now you have a copy of my notebook but you are the owner of the copy okay so first we're going to prepare the data um in this case we're going to use a a data set based on e-commerce customer behavior uh and this data set provide uh some information about customer demographics purchase purchasing patterns satisfaction levels enabling detailed analysis of engagement retention and overall customer experience okay column descriptions as always is something that is really good to have so whoever is reading this notebook understand what each column is and for your midterms I I I advise for your midterm projects I advise that you do the same thing you introduce your data set and what the columns are if you have too many columns maybe introduce a few of them and like the the ones that you're going to be working on or learn uh we also going to learn how to uh uh actually we already did uh how to select the columns that you just want to focus so you can put just put the description for the columns that you will specifically focus on uh but if it's not too big I I suggest doing an uh a description for all of the columns and that's it so this data set it's uh save into my GitHub and this link uh so we're going to use this link and as the indication says we're going to uh save it into a variable called e-commerce data make sure that when you are dealing with a notebook uh like this following the indication soon you're going to have homework too uh well depending on the class might be named the the number might be different you're going to have a new homework and and in the instruction he will tell you name your variables as such and you have to follow the instruction correctly in order to be able to run it so here uh in the first cell uh you have you have to uh import the packages i only did two but if you need uh ML li or se you should add it okay so let's let's just actually add the uh M plot lift.pot I plot as plt and import seborn SNS right so these are the ones that we're going to those are the basic ones you might need others that you may add add those packages as you go but that's good so the first thing is we're going to import the the data and inspect it uh so you have some practice here to do i suggest that you post the video and you try it first before actually even checking the answers or following with the video and that way uh you can see how much are you uh retaining uh also uh even though you have the answer here I advise that if you have to take a look at something is you take a look at previous notebook so you use the previous notebooks as a reference uh which is what you should do for your midterm project uh which is and and it is like basing your project on sample codes that we have in the previous notebooks okay so so if you practice that idea and reusing the pieces of code that we we learned throughout all throughout the classes that we've been taking uh you you're going to be good for the midterm project so in this case let's just uh put the link that we're going to use let's let's just copy the link uh if you actually click the the cell like that by mistake you just have to click outside of the cell you're going to be fine so we're going to copy the link we're going to name our data frame e commerce data and we're going to read it pdr read CSV URL okay and it says inspected so to inspect it we can just um do the first rows okay so let's see and we have imported our data successfully and we can now see the first five rows uh the next part it says use the following cells to perform any checkins you may be interested in performing before proceeding so anything you like uh this is just for to like explore your curiosity maybe you want to see first of all if there's missing values that's good uh let's say the name of the data frame and then h it's no sum okay uh all the columns say zero meaning there's no missing values maybe you also want to check the data type so that's also a a good thing to do the types okay and we see the data type for each column and you may from this point you may decide to change the columns accordingly depending on the situation okay so so so far that's good then here we formally introduce a function that I have showed you before and is the unique function and this unique function allows you to find what are the unique values within uh within a column right so if your column have uh multiple repetitions for some categories this will tell you what are the unique categories if you're talking about numerical column it will tell you what are the num unique numbers that appear on your uh on your column so in this case for example we using uh the for the example we're using the column gender from our data set called e-commerce data so we just basically uh with the name of the data frame selecting the column and after that you put is unique in parenthesis open parenthesis close parenthesis and that basically going to tell you that you have two categories in this column it's male and female so h so that this is useful because sometimes uh if you have too many rows like let's say 5,000 rows and you you're not going you don't want to be scrolling to see if there's a typo somewhere where they wrote instead of writing mail with uppercase M they wrote it with lowerase M or they wrote it just M right h you want to make sure that your column has the categories that you think they're using so doing it this way it will tell it will tell you what are the unique words or number found in that column okay so now for you to try uh just finding the unique customer satisfaction uh based on the data frame right so e-commerce then in brackets going to find the column customer satisfaction okay and we're going to do that unique okay so remember you should give it a try first uh but if you run this we see that we have the unique values in these columns is 5 4 2 31 so it looks like we have customer satisfactions ranging from one to five and a discrete matter like one two three four five right here they will not show up in order because they only they what happens is they show you as they appear so if they find the first category to be filed they're going to five they're going to put that first okay so very useful function always keep it in mind to explore your your the entries on the column and now we're going to discuss logical operators in Python which we already introduced it on the first notebook but the first notebook was just to give you a brief overview of things in Python that you you can look forward to um like I say uh I going to keep uh reviewing things as as as we go especially because I understand that this classes is also serves as an introduction to Python but the good thing is that Python it's easier to uh to understand compared to other language like Java or C so so now uh if you remember uh so the the what we do uh for logical operators is we we use conditions and based on the conditions uh we can uh compare we use two conditions to come up with some true or false statement okay so as as a surfer a condition is an expression that evaluates to true or false something like uh it's raining outside right that's either yes no true false um so in some other situations that as long as you can answer those as yes or no those are the things that you use to make decisions um so that's a condition and we're going to use legal operator to uh check multiple conditions at a time so so first of all um what we could do is compare by using uh the equal operator so we can say uh and within the column whether the whe whether that entry for that for each of the entries for the column are equal to male right so when we do this if you run this right you're going to see that we're gonna get uh uh the column but with true or false values because what we are doing is uh running this uh this logical operator as in every row to find out where does this match and it's going to assign a true value there so this is like a new column but just with true or false values okay so now what can we do with this is first of all we can assign this to a variable this column right here and and the reason why we want to do this column of boolean values which is true or false is because we can use it as a filter and how do we use it as a filter is that when you pass this type of rows to a data frame what the data frame does is that it will return the only the rows where the value is true okay so you what you could do is we could do the condition and put it into a a variable in this case I decided to name it filter condition you can name it what whatever you want but just keep in mind that you uh you really want variables that are understandable and if someone else read your code they understand what is that line or that variable storing if you put like if you name your variables Pikachu and then another person tries to read it they don't they're going to have to to observe what what the code is doing to understand what is in the variable uh so it's better that you name it to something uh something according to what you're doing now that we have that filter condition if you see here what we are doing is in the instead of passing the name of the column we are passing the condition because when we pass the condition in brackets for the data frame right so uh think about this way when we do when we take a column out like this piece right here what you're doing is you're just telling the data frame give me the column that match that name that's why you just get that column Right but when you pass something like this the uh um a condition like this where you have like true or uh true or a series of true or false based on this comparison uh you're saying uh return the data frame for the rows where this where each entry on this on this new column is true so when you pass that filtering condition into the data frame you're going to see that first of all the index are going to be so uh they're not going to be in order because you are making sure that whatever uh you have in this new data frame it's only when the gender is equal to male so if you take a look here at the new resulting data frame everything in this column will say male because you're filtering the rows that match that criteria so that's the idea that's what we're going to use it uh that's what we're going to use it for now something uh it's something to to to remember is that when you do pro simple probabilities based on some event um what you do is you you find out what is the number of outcomes like that are favorable to that event for example in this case h if we're saying like what is the probability in our data frame that the person is a male okay how many males are in our data frame so right now we already did that we have 526 right so if you remember about uh the probability of a simple event uh then this is our favorable events for if we decide to say what is the probability of a person a random person being a female out of this sample we have 526 males according to this filtering and then you divide that by the total number of outcomes in the sample space which is just uh the whole amount of rows in our data frame all possible samples if we are doing the probabilities based on our data uh which in reality those uh technically you cannot call them probabilities those are called proportion but we just going to use the term interch interchangeably uh keep in mind again the correct way to call uh these type of probabilities that you do based on your observed data and this sample it's proportion right probabilities are more theoretically uh those are uh the actual chance of happening for the population okay so but for now we just we can call it like this so in our case uh like we say like for the sample that that I mentioned And uh and by the way here you can see the disclaimer about it's actually no proportion but uh you can read this more carefully but that's the idea so in our case let's suppose that we wanted to find what is the the proportion of males in our in our data frame or based on our data what is the probability of of selecting a random person and is a male well we filter the data right so we have a new data frame with only males and we can we can calculate the probability by just by just finding out how many males so how how many rows in our data frame are males divided by the total amount of rows and if we do if we if we do the length function remember from the first notebooks the length function allow you to find out how many rows so what we're doing here is how many rows the filter data data frame has which got which remember was 526 but when you do uh problems in Python or programming you don't want to type manually type numbers you want to find numbers dynamic and the reason why we want to do this is because if you're in a job maybe you're working with some data from a database right that keeps changing so you go today and you do your code and you manually assign the number to be 126 but then tomorrow you add five more customers and then there are three more males you're going to have to change that manually right but if instead you do you say the length of this data frame no it doesn't matter how much data is added into this database this is going to keep keeps giving you the to the total amount of males no matter what uh how it changes so you have to think about doing uh things dynamically and not hard coding which is the the how we call it when we assign a fixed value okay so same for this the total length uh we use the length function to find how many rows in this data and that if you divide it is the the the probability so if you run this right we see that um the probability of a random person being a male from our sample just keep in mind if you decide to call it probability you just have to specify is based on your sample not the actual probability because we don't we don't have the the the data for the population but based on our sample the probability based on our sample of the proportion in this case according to this is 0.53 uh just in case to mention uh you don't have to worry much about this but uh when you use the f strings you also have a chance to uh to uh format the amount of uh decimals so if you take a look here we don't we're not only using the variable and that's it we have this column that tof this means that we only want two decimals for this floating uh variable or double variable um I'm sorry not double I'm getting confused with Java uh yes for this uh floating floating point uh value so you can see here that the actual value is 0.526 but it's being rounded to two decimals uh you can always uh play with this and change it like to have one decimal three decimal but that's what this is doing in case you're wondering um okay so now it's just for you to practice pause this video give it a try find the probability based on this sample that the customer has a satisfaction rate of five okay so you should try that first and we could do it together so first we can do the filter condition uh you can use the same name of variable if you want to override it if you want to keep it you can say filter condition example all good it doesn't matter so we're going to do e-commerce data in this case we want customer satisfaction um if you get a red under your code means that you maybe have some you maybe not you have some error so you have to check your code uh so in this case we want to do the filtering of this the persons with a customer satisfaction rate five so we're going to do equals 5 now we're going to save this data frame uh to let's say e-commerce data customer service faction of five it's equal to ecommerce data and we're going to pass this filter okay so we're going to filter the the the the amount of persons that have a customer satisfaction rate of five maybe this is too long and you don't want to uh you don't want to write a variable that long so maybe you can do e-commerce data example to ex for example that's fine uh now we can calculate the pro the probability so let's say that is faction 5 probability it's equal to the length of the amount of people that have a satisfaction rate of five which in this case is this filter data set data frame I'm sorry divided by the total amount of rows in our original data frame right and that should give you the the probability pro print and if you want to do it uh with a f string you have to type to quotation and put f in the front and we say the probability of a random selected person from our sample having a customer satisaction [Music] rate of five is and then in brackets I mean curly braces I'm sorry you're going to put the the variable and like I mentioned before if you want to filter to if you want to format it to two decimal you can just do 2F okay if you run this we should find that we have a probability of 0.21% 21% okay maybe you also want to uh I'm sorry in here is in decimal but you want to convert it into uh percentage you can just multiply the the probability by 100 we can do it here actually right times 100 and we can also put a uh percentage sign just remember since the percentage is text it has to be outside the bra uh the curly braces so if you do that you find your probability okay you can you can round it to one decimal if you want right 20.9% uh good so that's how you do it uh here the answer might be slightly different but should be around the same area just different names uh but it doesn't matter the variable name you can assign it to whatever you want so if you got the same response as the answer that's fine uh okay so more logical operators uh for uh when you have multiple conditions so we have one called int uh and this is true only uh if both conditions are true so if you have condition one and condition two h as long as those two conditions are true then both conditions together are true uh then you also have or which is uh which is true if at least one of the conditions is true so if you have multiple conditions with or as long as one is true everything is true um and not is just uh reverse the the the boolean value so here it says true if the condition is false but uh it's also going to be false if the condition is true it's like it's just reversing you're going to reverse the logic for the the boolean bar h the relation to probability is that if you remember when we did our uh our intro to statistic class when we talk about intersection that you did the probability of a and b and then the probab and the probability of a or b uh you do those here using these operators in python okay which is uh you you probably saw this symbol when you did the stats uh but this is how you're going to write it in Python so here's just an example like if we have condition A true and condition B false if we do condition A and condition B since we're doing N and this is only true if both are true so this third part should be false for condition A or condition B since for or as long as one is true then both are true then this should be true and then for not condition A since we are flipping the boolean for condition A given that this is true it should be false okay so if you run this you're going to see that the first one is false the second one is true the last one is false and basically that's the idea and uh that's how you can do it uh so now one thing is that uh these logical operators uh directly in Python are different than for pandas so pandas have their own way to interpret the logical operators so if you have to do multiple conditions for filtering into pandas you need to uh uh use different symbols so the way that is going to work is if you want to do an N uh you need to use amperson one amperson if you if you're going to do or you need to use the this pipeline if I remember the name correctly and if you want to do not you need to use tilda okay so those are the the symbols that you're going to use if you are doing uh filtering of data frames in pandas with multiple conditions So common pitfalls uh when you combine conditions you need to separate the conditions in parenthesis example if you're saying like oh the column one is more than 30 and the column two is more than 80 you need to make sure that you enclose column one in in a set of parenthesis like this you see we have one opening parenthesis one close parenthesis and then you continue with the next condition you have to enclose each of those conditions in parenthesis uh we cannot simply write without the parenthesis because then otherwise you get the the compiler uh it gets confused on what is that you mean like do you mean 30 and the f or do you mean what like uh it's not uh in it becomes an issue for the computer to understand so we you must use uh these operators for dependent series so let's do a few examples here uh this is uh we can we can pass this filters directly into the data frame or we can separate the filters uh we're going to have both of the samples uh but either either either way what whichever you use it's correct it doesn't matter okay so first one is if as let's suppose that you want to uh so in this problem find the probability from a random person in their data frame that is male and have a customer satisfaction of five so we want to find the individuals that meets both criteria uh if you want to do that you can pass those conditions inside like this right put it in parenthesis each of those and put an ampers because that's how you say and if you want both at the same time to be true that's how you do it and then you have to be careful that you pass this inside the data frame if you do that you're gonna find the the the people that are actually male with a customer satisfaction of five right all of these filter rows are uh customers that meet that criteria so I'm just showing the first five but if you remove that head and you run it you're going to see how many actually you found okay but if you don't want to see all of these uh rows you can just keep it as that head to get only the first five okay anyways if you're curious and you want to find out without having to remove the without showing everything you want to find out the h how many rows you can do just that shape and that should give you the I'm sorry without parenthesis okay and that should tells you okay we have 114 rows comma 11 columns right so so that's just that's the amount of customers that are male with a customer satisfaction of of five uh after that you can just find the probability for those end problems like what is the probability of a random personal data frame being a male and having a customer satisfaction of five well it's the length of our new filter data which is 114 divided by the length of the all the everybody in the data frame okay so if we run this we see that we have 11% probability again you can you can convert this into percentage it doesn't matter uh so we have 0.11 which is 11% another second way is you separate the filters uh this is better in terms of being easier to read for example here for someone that is just starting in in Python it might be tricky to understand what's going on here but uh this is uh is more readable readable and clear because what you're doing is the condition one is that the gender is male the condition two is that the customer satisfaction is five and then we create a filter combining both saying that I want both to to to to be true in order for for uh for the everything overall be true so you do condition one and condition two and pass that into the data frame so this is just uh splitting each piece apart but it should work the same way okay if you if you find you check this you still get the same amount of frost right it works the same way and you get the same probability 11% so both ways work um and yes so that's how you can do it now I have a problem for you to try uh find the probability that a random person in our data frame is a male is a sorry is a female or have a customer satisfaction of three right so now it's or uh as always pause this video and give it a try and since I want you guys to try uh I'm just going to so in this case I'm not going to type the full problem together i'm just going to copy and paste the result remember try it first and then you go over the idea with this is similar to this part right here is you put both conditions which right now what are the both conditions first condition is that the the gender is female the second condition is that the customer satisfaction rate is three okay so you only have to change those so let me let me bring those in so let's actually bring the solution piece by piece so we separate the two conditions right gender is female customer satisfaction is three uh and by the way you don't necessarily need to do equals equals 3 uh maybe you have your question different and say what is the probability that the customer selected a random is a female or have a customer satisfaction more than three you can just change that to more and that's that's also something that you can use because this is just going to be true as long as the value is more than three right so it doesn't have to be equal equals always you can use any other uh type of comparison operator uh now that you have this the only thing you have to do is use your filter instead of with n you're going to use it with or so like this okay so here instead of doing n right because before we say where the prob of this condition and the other condition which was male and customer satisfaction five but now we have female or satisfaction rate of three so for or as filters in in pandas we use the pipeline and then it's the same thing as before you only have to filter your data frame and right away you uh you can answer your question so filter your data frame this line using the previous filters that you already created and with that you can find how many people meet our criteria right and how many people in total if you run this we see that there's 58% erh we have this proportion for the amount of females uh or I have an error here should be or okay the female or have customer satisfaction of three is 58% uh let me see if the answer have this issue right I also have to fix that okay let me actually do it really quick and the original notebook so that mistake doesn't go okay work good perfect so that's how you can find proportion slash probabilities based on your data uh so if you have any question let me know but just practice that part perspective can be very useful then I have a section here that is related to logical operators and is also very useful which is the idea of merging two data friends and of course here we say two but when you you are able to merge two you then you're going to be able to merge multiple of them uh so in here uh so the idea first is that you have two data frames that are related and they have something in common that you can use in order to put them together okay so uh if you take a a class on databases they expand more on that idea but things like things that are uniques like uh let's suppose I'm dealing with students and we have employed uh I can use those to merge two data maybe I have one data that correspond to the classes students are taking and another data to that corresponds with the overall performing and both datas have a hole ID's information i can merge both tables using the employee okay so that's something important to know because it helps you to to expand the possibilities for your data and specifically when you deal with databases that's how uh companies keep their datas organized when they work with tables they organize they have several tables for different purposes but all those tables are connected in some way okay they have an schema and they connect this those cables and these type of unique identifiers they have their their name uh but that will be the idea if you explore more about uh databases and and and you especially start uh learning SQL you're going to learn those things okay so here we're going to use uh uh this data set called uh e-commerce rewards program data set which is related to the previous one uh is from the same company but we see that here we have some other columns we have customer ID we have the name loyalty level like whether they are they are in some level for this reward program and the reward points how many points they have as uh for rewards so here uh I have the link for this and the name but you only have to uh run it okay so here we see we have customer ID name and this is just an extension of uh well let me not call it intention but this is related on the same company level to the other data frame that we were using at the beginning and we can use it to get more information for our customers and and have a and add them to the the data frame that we were using before okay so here I'm mentioning that uh the the the customers ids are unique identifiers for each of the customers so we can use that to put it together now before we put these two different together there's a few things to consider like how do we going to put them together do we want to only uh find the the the information that it's present in the first data frame and also presenting in the second data frame or do we want like give me the information that is either in the first one or the second one and whatever is not present just fill it up with a or give or or after you find the match only give me everything I except those there are many ways that we could actually merge these two data frames so there's something to consider first uh so let's for let's discuss about what function uh you're going to use in order to do the merge and in pandas we have a function called merge and this function allow you to merge two data frames so how you do it is first of all your first data frame which in here I call it left_d but it doesn't matter it's just whatever the first data frame is you're going to pass it as the first input and then you're going to pass the second data frame as the second input and the reason why it says left right is because usually we think about those in terms of circles uh of you maybe saw this in high school or previous classes called vend diagrams so the first data frame is just the left one second frame is the right one doesn't matter okay then you have to specify on what is the key so what column you're going to use as a as a reference that contains this unique identifier that we can use it to merge B's data and then you have to specify how you want like you want you want the columns where it where for the IDs that are following both data frames at the same time or you want you want everything it doesn't matter if you find it or not just put everything together and put NA and and like uh like like that is empty in other parts and so on so here is how you specify how so there's a breakdown here that you can read uh on your own and I also break down one of the most uh a few of the most usual type of merge um uh so for example inner only gives you row with matching keys and both that are friends this is like n then left only give you the the matching rows from the right data frame um and non batches will get na and this one right will give you all the right data frame are returned along with matches rows from the left data frame okay and the outer it give you all rows for both data frames are returned with any uh for missing matches so this is the visual example uh by using uh vend diagrams uh so this here when you take a look at inner join this is what what this is basically similar to when we do n because n is where both conditions are true and it's just this little piece maybe you guys remember from previous class but inner join is basically like doing n then full outer join is like or so full outer joint is or where you want everything not only the things that intersect but you want the things that do not intersect in the left side and the not intercept in the right side give me everything and just fill out missing information with NA then you have left uh which is just returning a and right left join and right join is just returning b okay so you're going to this is just an introduction you're going to learn more about those in databases classes um uh but something that is useful is when you want to get more information for the existing customers and like only find out for the ones that exist we can use the inner uh so we're going to merge both data frames that we have for the e-commerce data uh with the columns that are present in both data frames meaning we will perform an inner join okay so the way we're going to do this is we're going to pass our first data frame which is the e-commerce data the with the customer satisfaction all that then we're going to pass the new one that we just introduced which is the reward data and we're going to say that we're going to use the customer ID as the unique identifier column uh the primary key and how we're going to do as a inert like only the things in common if we run this we see that now we have a new data frame with the new columns um loyalty level reward point uh you see that it says name Y we're going to discuss now in a second and we see name X uh that just means that we found those two that are in columns that are in common and it kept both columns from both data frames but we can just drop one since we know they are the same right so a a little bit of inspecting this new merge data frame right so we have this merge one only have 200 rows and 14 columns because those were only the ones that have matching data uh because we only focus on the ones that h appears in both data frames so we have a full data frame with not much missing value uh missing values none because we only wants the ones that appear in both tables and the columns it's here you can explore but you're going to see that we have name underscorex name underscore y and it's because it's keeping uh the name column in both dataf frames even though it's repeated so what you can do is we can drop it we learned how to drop the columns right so you can drop the column by just doing uh the name of the data frame drop and saying the name of the column that you want to drop so we can do that in this case we decided to drop uh the name underscore y which is just repeated uh and now there's a function that is very useful is that if you don't like the name of a column there's no reason for you to keep working with that name you can rename the column and the way you do it is you have to select the data frame and do that rename and in that rename you're going to specify in columns uh uh a dictionary okay you're going to say column and in this dictionary so before you pass a list which is just with the name of the column but now in curly braces you're going to pass a dictionary which specifies the old name and the new name here is the the generic sample but you're going to have to say like okay the name the the column is named this right now but I want to change it to this and if you use this function that rename you're going to have the column and the correct name that you would like to work with so if we run this we see that now we don't have named_x anymore now we just have the column name and the rep repeated column that was brought from the other data frame was dropped so now we have something that we can work with in terms of having all this data combined and right now we just combine two but if you have a third one that have a customer ID and have more information you can also bring that one okay and and this way you can combine multiple data frames for this class i will not request any uh project that requires you to use more than one data frame because uh it's not it's not so easy to find data that contain primary keys and unique identifiers uh so but keep in mind it might be useful for some of you for some of you in the future or if you find one uh specific example where you have two data frame you're interested in combining them okay now this is summary in conclusion what we cover logical operators filtering data probability calculations and right now we just finished talking about merging data frames so if you have any question as always just send me a message in a slack you can post your your question in the general section of the Slack if it's something more um delicate you can find send me an email or if it's something that you would like to you can also send me a private message to Slack either way Slack email is a good way to communicate with me let me know make sure that you go over this notebook and that you're curious like to inspect what each part does okay see you guys in the next notebook