Transcript for:
Data Cleaning Techniques in Pandas

today we're going to be taking a look at how you can clean up data within pandas in fact we're going to be dropping both rows and columns we're also going to be creating brand new columns or series as manipulating some strings changing data types as well as debugging an issue now the data we're going to be using today is pulled directly from ESPN should be showing you how to import that and we're going to be taking a look specifically at some of the best qriket players with the highest batting averages in test matches and with that in mind I'm going to jump on my computer right now and let's start cleaning up some Cricut data so before we can clean up a spreadsheet have to grab the data now I'm grabbing info from ESPN cricutinfo.com so I'll show you guys what we can do so I'm just going to open up a new tab over here and load this up and show you what this specifically looks like uh so we have this over here and this is different data from test matches now I can't really grab this as a CSV some sites that have data you can but this one does not right there's no option to download it I only have the option to really share it but I'll show you a pretty cool trick in Excel to do this so just grab this link which I'm going to put in the description and then load up Excel so I have Excel right here and we're going to go over to this data Tab and we're going to click from web and I was going to take a second to load and I'll show you what to do once it does so you have two different options of basic and advanced I say just stick with basic and it says URL so I'm going to enter in this URL over here and click ok and this will also take a little bit okay so after a few minutes we have these two display options so you have document zero in table one you're gonna have to click on both of them to see uh what populated or not so this one just populates element HTML table and then text null this is not what we want but then if you go to table zero uh you can see player span matches all the way over to over here fours sixes hundred stocks all that type of stuff so this is what we want and then all you're gonna do is go down here to load and then this is going to load in to uh the spreadsheet so now we have 63 rows loaded which is really cool to see and this is how it is specifically built out and I already know there's going to be a few things that we're gonna have to work on um so you can select over here we have some stars and then over here um we have some dashes we have some plus but I'm gonna update a few things in here um just to make this video a little bit better alright so instead of these dashes here I'm just gonna leave these as blank uh that way they show up as null and we can transform this data show you a little bit later it's kind of interesting um that these have no specific data in here but I will figure that out the other thing I want to do is make a few duplicates that way I can show you how to drop duplicate data uh sometimes it does happen when you're either extracting data or you know you get a spreadsheet or other form of data sometimes records are duplicated and you're gonna have to be able to solve that so I'm just going to copy a few different names over here so we have the legendary Gary sobers I'm going to duplicate all his stuff over here I'm gonna put that over here Hobbs as well very legendary so we're gonna grab Hobbs that over here probably top five Cricket player of all time and then let's grab one more so let's just go over here and see who else is in this list um grab this over here and we have three that should be duplicated and I think that'll be fine uh for this example and uh let's do that oh we'll do a more Headly as well which I'm uh still looking for one of his cards run has two of his I do not so uh we have four examples of players that have a duplicated data uh so now you want to save this as a CSV so I'm just going to save as and then I'm gonna put over here and you're going to want to make sure to save this as a CSV so you see qriket test match data we see CSV dot there and I'm just going to label this dot CSV like that and we should be good on that side of things so next we're going to want to drag this into our Jupiter so I have this over here the main Jupiter screen and then I have this over here uh just in my desktop so I'm gonna drag this over here and then you can see it says Cricket test match data.csv and then click upload and now that has been uploaded so we can actually import this in here and the first thing that you're going to want to do is import pandas so I'm just going to say import pandas as DD always do that and now we can import this CSV so I'm gonna just call this DF so we're going to say DF equals PD dot read CSV and then we gotta grab the name of it uh so if I just go to last modified you can see it's like this so I'm just gonna copy that and then throw this in here and to make sure that this works properly I'm just going to do a new line and I'm just going to put DF in here and let's do that and you can see that this has been entered now if you don't know what I'm doing to run these instead of going over here each time I'm just clicking shift enter so that allows you to run the specific cell now you can see all this data is on here which is really nice but the first thing I want to do is re-label a lot of these different serieses or also known as columns on here um because if I'm an outsider and I don't know everything about qriket I might be a little bit confused on these like this just says no this is HS I'm average makes sense on there uh bfsr like what does this mean uh so let's rename multiple of these so what we're going to want to do first is Type in here DF dot rename and then what you're going to do is open this up in here and you're going to put columns equal and then you're going to just build out a list right and what we're going to have on this side of things is the First Column right so like no I'm just say like this put single quotes no I'm gonna put a colon and then we're going to put the actual name so if we go back over here right it says no means not out so let's put that and what I recommend is you just do like this don't keep a space in there because it can get pretty annoying um well pretty easy now if I run this on here we're gonna just see how this would change it but this isn't assigning it to the original data frame uh to do that we have to put DF equals and then renaming all this but I would like to rename multiple so we already have that the next one I want to rename is HS so I'm going to put HS over here and if we go back over here HS is highest endings score so we'll just put is score okay let's go to the next one so we have BF which I think is balls faced so we go back over here false faced so again BF over here Falls East okay and then the next one Sr and then we have Sr over here is batting strike rate so here yeah [Music] strike rate and that's not BF it's Sr that's always like what the heck strike rate so Sr like that so now if we run this over here let's just make sure match mat for matches I think that's uh we'll put that over here too right so let's say mat like that and we'll just put matches doesn't hurt to get a little bit more practice uh so we have all this now right and it's a little bit long but that's all right so I'm gonna just do a shift enter and since we assigned it back to original data frame DF we should now see uh this over here with the new information so if I just put DF over here or df.head a lot of people notice you just put the up.head right this will show the top five and you can see all this has been renamed and obviously it's gonna be a lot longer now than what it was over here but that's good that means our renaming has worked now we're going to check for null values uh so the way easiest way to do that is you can just put DF and then you can put is null like this and then we're going to put dot any and this is going to show us each of the series is if there's going to be any null value specifically in it and you can see we have two that are true right we have balls face and then batting uh strike rate so I'm going to open up a few more of these cells and let's kind of investigate that so the easiest way that we can do is we can do DF like this and then run it instead inside we're going to put another DF and we're gonna just put this first one balls face and we put that like this and I'm just going to say that this dot is an a which will which is very similar it is null right it's going to show true or false so I'm just going to say equals one and this is going to give me the entire row output uh when false faced is null uh so you can see that we have two of these right we have Ed weeks batters faced and batting strike and and Walcott just like our spreadsheet so let's fix these so the first one we're gonna do is like this so I'm just going to say DF balls face equals and we're going to say the same over here right up at this time we're going to say um dot fill and a and we're just going to put a zero is that probably the correct way like 50 50. I don't know why this data didn't populate but we'll put that in here right and now that's been entered and let's do the same thing for batting strike rate so I'm just gonna put that in here and then let's run that and now let's run our data frame on the side of things and for this data frame I'm going to grab the player weeks over here so what I want to do is put another DF in here we're going to say player and we're going to say that is equal to and then we're going to just grab this and we're hoping that these two values over here are now going to be zeros so we just put that over here and this should give us our full row and we have an error on here let's see what it is this player is wrong and that's because it's supposed to be a capital P so let's put that over here and now you can see it's zero for both of those and I just to play The Devil's Advocate let's put wall cut on here so grab that and put this on here and you can see that is also zero uh so we successfully replaced these n a's with zero which is good because now we can run some calculations on these columns if we really wanted to but now I'm going to show you guys how you can drop duplicates in here so I'm just gonna remove that one cell and we're gonna open up if you uh for the duplicate side of things so we can do first is put df.duplicated and if you run that you can see that most of these are going to be false but we do have truths and it's starting off with four um and that makes sense because we updated that spreadsheet to have a few uh duplicates so kind of like before where we try to figure out what's n a we're going to do the same thing with duplicated so all you have to do on this is I'm going to say DF open this up DF and I'm going to put player in here make sure it's a capital P so you don't get an error and then it's a DOT duplicated like this and we're going to say that's equal to one right like that and then you can see we have four duplicates Hedley sober's OBS and holy and if you wanted to show all eight results all we're going to do on this case is we're going to copy this over here instead of dot duplicated we're just going to put dot is in and then what we're gonna do is throw all these different players in here so watch all I have to do right keep going like this and like this so now when I run this you can see there's gonna be eight results so edley sobers Hobbs and Goldie so not good right or this is all been duplicated but we can drop these now so all you have to do is DF dot drop underscore duplicates duplicates this and this is going to drop the different rows so now you only have one of each of these so we run that you can see our data frame now is to 63 but we have to assign this back to our original data frame uh so just put DF equals df.drop duplicates and then what we're going to do is run this same one as above just to show you that there's only four this time and you can see it's been cut in half we have the one earlier which is eight and this time we have four uh so it is working properly so now I would like to start manipulating some of these different columns over here and creating new columns in series so the first thing that I would like to do is split up the span into like a start and end date um and eventually I want to do a career length as well but we're going to do that a little bit later so I just deleted that don't worry about that for now but I don't like this span because if we wanted to pretty much find players that started like the 1930s or 40s or before a date uh just not the most ideal way instead we can have two new columns so we're gonna have to figure out how we can split the span because if we just run this on here and just put over here like span right this is what our output is going to be but uh luckily there's a pretty easy way so we can do Str like this for string and we're going to do split that's another thing these are strings right now uh they're not integers which would be ideal uh for some calculations so what we can do is say pattern like p-a-t and then equals and we want to split on this over here so just put that right here like this on the dash and now you can see it's been split properly right uh the first part over here we have 1928 and 1948 for Donald Riven the goat of cricket and then 2015 2016 1963 1970s so we know that works now but how are we gonna assign you know our start your or like a rookie season and then our final year well let's think about that for a second so let's say we're going to have a DF and we're going to create a new column so let's say like we'll have rookie year right so rookie and we'll put rookie here so that's the first year uh that someone place and we're gonna say that is equal to this over here uh but to grab that first part all we have to do is put dot Str 0 uh so that'll grab this side and then if we wanted to grab the other side it's going to be one so before I run that command I'm just going to show you what that looks like just to make sure that also works too I don't want to have a bad data frame but you can see Zero grabs all these on here so it works properly and then if I wanted to grab one which is going to be on the right side right 1948 now so that's working properly let's say rookie year is that zero and um a value is trying to set on the copy slice don't worry about that too much right and now we're going to do the same thing uh but we're gonna say final season or final year doesn't really matter and we're gonna do string of one and run this again and if we go back into our data frame you can see now we have rookie year and also final year over here so this is working properly which is really good to see but we really have no need for this span anymore so let's get rid of that so to do that all you're going to do is DF dot drop and then what we're gonna do is inside over here we're gonna put span so span capital S right because it's not lowercase and then we have to determine what axis so since we want the column it's going to be axis equals one so axis equals one and then if we shift enter you can see span is now gone and then again go back to your original data frame so DF equals df.drop span and if we want two we can go back over here DF dot head and you can see span is now gone and we have rookie year and final year the very end so here's a question for you guys and there's probably a few different methods on how to fix this up but I want to update this player um and there's a few different reasons on this I don't like having the country code underneath on this side things at all if this was expanded be over here to the right but I just want the player name I don't want the country and said I want to have another column over here for Country so that way we can run some calculations a little bit later so if you want to pause the video do it right now and solve this uh so essentially I want to have player just have the player name and then another column or series that says country otherwise I'm going to show you a pretty similar way to like what I did over here uh to be able to accomplish this okay so what I'm gonna do first is I'm gonna be lazy and copy this code over here and show you exactly so this time we're looking at um this over here so I gotta change it like this right and I'm not going to do both sides we're going to run this twice so you can see what happens on there and I get an error because I meant to put player not span so you put player in here and then you can see great we have Donald Bradman that is working correctly on the left the right is not working only yet but that's okay um so what we're gonna do on this side of things is first we're gonna name the country so we're going to say DF country and that's going to be equal to and I'm just going to copy this here and like last time one is going to put over here to the right so I'm gonna just say Str one and if we run that and we just can do DF country like this right we have this going on and we can fix that a little bit better uh so all we need to do again is kind of like the same command as earlier so I'll show you how that works so we can just do DF country like this and we're going to dot string split and this time we're going to do the right side since we did the left last time and you can see now we just want to grab the left so we can just say that's string zero grab string zero and I'm just going to say that country equals dots and then if we run the F country great we have that working so it's good um but we still need to fix uh the player on here because if we run player it's still going to show the country code this one that and I'm just gonna put a few more over here right we still see the country but I kind of show you how to do it a little bit earlier if we go like this all you have to do now is assign it back to dfplayer so DF player like that and I'm just going to copy it instead of typing it all out equals this and then if we just run our data frame again so DF the head that's you can see we have player which the country code is gone but although at the very end we do have the country so working great right now this data is already looking way cleaner but next we need to fix up these data types because they're not all going to be correct so all you have to do to see the data types on here is put df.d types like this and then if we run that you can see player is an object matches is also an object which I think this should be an ins not that so we might have to fix that we have some integers over here which is good highest ending score is an object also so that should be an ins then balls face in batting strike rate so betting strike rate should probably be a float not an object uh Ball's face that's a singular number but that should be an inch but you can see that we have some issues with that plus over there all these are ants which is good rookie year final year should both be ants and then country it's fine like that so we have some work to do right we got to figure out why um it's not working properly so let's take a look at one example over here so the first one I want to take a look at since it's pretty obvious is this highest in score so that's this one over here it says it's an object so the reason why it's an object is if you see this star so the data for these Stars if you go back over here just to take a look we have a start right there multiple Stars over here and I'm trying to see why I mentioned that there's a star in there it doesn't say anything why that star is in there so I'm just gonna remove it so if we go back over here let's go back to what we did also like this string split So currently this does work as a string but we can fix that so copy your code on here and what we're going to do is grab this highest end score right and in a string split and this time we're just going to put a start right and put it like this now this doesn't look like it's working properly because it has that comma but if you remember from before the combat is just going to show you uh where it's splitting at in the second part is to the right and since we are splitting on that star it's just going to have a blank on that side of things so if we put dot str0 so like this it should show everything properly so just run that for a second and you can see uh that star is now gone so again kind of like earlier let's just assign it to itself let me just put that over here equals that and now this is going to be working properly we we can just copy and paste this here you can see it's gone right we had it originally at 269 but 269 is like that so it's cool now we need to specifically say that it's going to be an integer so the way you can do that is dot as type like this and just call an ins in here right so ins on here and you can see down here it says d-type is int 32 now instead of object Rose before so now we just have to assign it once again back to where it was so ending score equals that is going to be gone now just put df.d types and you can see highest ending scores now is an INT 32. okay so question for you guys let's see if you can fix either rookie year and final year so if you remember from earlier um we built out those new columns rookie are in first year so see if you can convert those I'm actually going to show you a pretty cool shortcut on how to do that um so all I have to do is DF equals DF dot as type like this and inside over here what we're going to do is fill out a list so I'm just gonna say rookie year and then I can say colon we're going to say this is INS and we can do the same thing uh with final year put that over here finally you're and then if we just run the data frame again for df.d types you can see that both of these are in so we have three more that we need to work on now so we still have matches uh we solve balls faced and then we have batting strike rate so if we just take a look real quick at batting strike rate so I'm just gonna put DF the head um batting strike rate needs to be a float so a little bit different um so again we can just kind of do copy this over here and grab the batting strike rates that's and then we're just gonna enable this as a float and does not work it says could not convert strain to float so we'll have to take a look at that a little bit later so let's still convert matches so like this right just copy this one over here I'll put matches matches as type ins and we have an issue 98 Stars so we haven't got rid of everything with the Stars yet or like kind of weird formatting but that shouldn't be too hard we can just copy this on here for highest innings and what we're going to do is just matches right notches that will grab the left side of it and then we build out a few other lines let's copy this over here and it should fix matches now so if we run the f.d types so object for player right that's fine of all the insta over here uh Ball's face is still an object batting strike rate is still an object and then we have integers over here up besides country so we still need to fix holes faced and also batting so for balls face Let's see we have this plus icon over here so let's split that out um so just grab that over here like this spaced and we're going to put a plus icon this time so that should solve that if we just copy this and we're having an issue so this could not convert float Nan to integer I thought and I thought we converted all the nans let's just go back over here and see why that's the case so if you remember from earlier with our code we have it right over here so let's see what's going on so balls face does have a um null still which might have happened when we're doing this string split so I'm just gonna copy this and honestly it might just be best that we drop those rows if we get some issues still we're just going to copy the same code we wrote earlier and see what's going on right uh because we're solving issues with Ball's face and it looks like false faces and Nan again so I thought we'd set that to a zero not sure why that's still happening right so could be from this but let's try this we're going to say again to fill n a is zero so I'll put another line and we're gonna just see if this still exists and there's none in there but we're still getting an error it says invalid literal uh so we're still having errors so I put a pd.set option display Max rows none I'm gonna take a look and see why we're getting an error um so we're gonna look at Ball's face so we have this on here and it looks like we have all integers um oh here we go so FS Jackson we have this going on um and then you can see also we have that for a batting strike rate did we miss that earlier or how did that happen you know what happened um I think when we're in an Excel sheet and we removed those we forgot FS Jackson so you can see remove that so let's remove that uh row and um that is why we've been having so many issues man so now we're gonna drop this FS Jackson so all you have to do is DF dot drop and then it is number 56 I'm gonna say axis equals zero like that and say DF equals that think that should have dropped it so you run DF again you can see 56 is gone so now the nightmare of getting these fixed should be done so let's just go back and see um we'll first check if we have any other is nulls right so first all false okay and we already see this over here which is good so now let's just fix these last two right and we had the the F Type which is right here run that again and um Paul's face in batting strike records so let's just copy our code from earlier man uh small mistake goes a long way right so balls face this needs to be an INT just grab that this should work perfect now and it does right and um we'll do the batting strike rate as float and check that out that worked too let's run this one more time df.d types and all of the fee types are working properly so now we can actually answer some of these questions before that though um we want to do one last thing so find the career length so all I'm going to do on this one is DF career length and that's going to be equal to DF final year if final year minus the f rookie year and if we just run this data frame again you can see our final year over here career length we have twenty one seven so we can start answering some of these questions down below so question one I want to see the cricketers in this data frame uh what is the average grower length and we just solve this over here so all we have to do is DF and then all you're going to do is dot mean right we run that the average career length is 13 years and and since this is some of the best cricketers out there it makes sense why 13 is the answer uh question two is average batting strike rate for cricketers who played over 10 years so it can do a few different ways um so the first one we can just do like this pretty easy so what we're going to do is play for 10 years so the F career length right throw that in the middle has to be greater than 10 so put 10 in here and then what we are grabbing specifically is the average batting strike rate so we go back into these columns we have batting strike rate so just copy that and this will first show us let me put single quotes around it all the batting strike rates for all these people right and um just put over here dot mean and it is 49. question three by number of cricketers who played before 1960 uh pretty easy so DF we're gonna put DF we're gonna say rookie year and we're gonna say that's before 1960 and we have an error and the reason we have an error is because that is not correct and you can see all these over here uh the cricketers but now if you wanted to find it like a specific count you could just throw a DOT count out then it would show you the counts of each of these columns um you could also just do something like this you could do like player that counts and you can see 22 show up so next we're going to do a group by so it says Max highest ending score by country so if you guys remember we built up uh this country over here so this is why we did that so DF dot Group by rate and then we're gonna Group by the country I think it was capital c just unchecked yeah Capital C right I seeing score so we have this over here so just grab it that's and then just put that Max so dot Max and you can see specifically how these rank but if you wanted to sort them I'm just going to say two frame first and we're going to say this is High ends country something like that I don't know doesn't really matter and then reset index that and then dot sort values and then just throw this in here all right and then you can see how that works but if you want to change the direction you can say ascending people's false and then there we go and lastly um I just want to see the 150s and Ducks averaged by country so we can actually copy a lot of this over here I'm just gonna copy this and then first we're going to change this into mean and then this needs to be changed so since we're grabbing multiple a series is now I'm gonna do it like that and first one's gonna be 100 next we're gonna have 50. next we're gonna have zero and then dot mean like that and then you can see the average uh for each of these different countries hope you found this video helpful and also you learned a few new things if you did make sure to subscribe to the channel I really recommend though that you check out this video over here where we solve some python interview questions uh because it mainly uses pandas