Transcript for:
Data Analysis Tutorial: U.S. Minimum Wage by State

what is going on everybody and welcome to yet another installment of the data analysis and data science tutorials with Python and pandas in this tutorial we are going to be working with a new data set the u.s. minimum wage by state from 1968 to 2017 so this is the least amount of money an employer can pay an employee II organized by state and it has high and low value both in that x terms or amount and then also calculate it out using the CPI which is like the cost of living let's say for a really quick term and so they've got it in value in like 2018 value numbers so we'll probably be working with that we'll work on the true low so the actual lowest minimum wage by state or something like that just to simplify things so anyway this is the data set we're gonna use so go ahead and download that extract that and I'll have that in the datasets directory so coming over here let's go ahead and read it in so import pandas as P D and D F equals P d dot read CSV data sets slash minimum wage data dot CSV okay so let's go ahead and read that in and boom we get this nasty nasty error it's a some sort of unicode decode error and we get blah blah blah so basically what we're seeing here is pandas wants to by default use utf-8 encoding but for some reason we have an encoding issue now I have not confirmed this but I'm gonna guess if we look at table data and we come over here and look at where is it yes so table data is the scraped unclean data from the US Department of Labor so that's where this data is coming from so I'm going to guess that it has some sort of issue there which is odd because when you scrape you're getting data in utf-8 right so I don't know anyway we have an encoding issue so the first thing that I would always do is like I would by default encoding is almost certain to be utf-8 but if you hit an issue like this the next best thing to try is gonna be Latin encoding let's try that and sure enough that works so that's what we'll go with for now but one thing we can do is just go ahead and save this so we don't have to deal with that ever again and say df2 csv and let's save that as dataset min wage csv with the encoding of you utf-8 okay cool so that's just one a quick example of why we might use pandas to convert things so next what we're gonna do is DF equals PD reads CSV and we will just read in this CSV so do you have let's just do will read that in DF not head cool okay so this is the data that we're working with and what we're interested in mostly is probably this low column now basically the objective here at least an interesting thing would be to almost do the same thing that we did in the last tutorial but I'd like to see some sort of correlation so this is by state over time a value that's changing much like we had with avocados and in fact usually carry on this tutorial with the avocado data set up to this point at least so where our objective is going to be kind of the same thing it's gonna be okay we've got these columns of values or this column that contains a bunch of values and actually we want to make that all the top now what we did in the last tutorial is totally fine but I'm gonna show you guys a new way to do that the pandas way so so let's see how that that works so I'm gonna say GB for grouped by equals DF group by and then we were we're gonna group by state so I'm just gonna say state there and then what we'll do is we can say well we can do a couple of things so this is going to create this like group I object so one thing we can say is GB dot get group and then we can get a very specific group so let's just say like Alabama for the just the first state that we see here get group Alabama dot set index and we'll set the index to be the year and then we'll just print out the head okay so that's one way we can just grab a group but the other thing is we can actually just iterate over the groups so for example we could do something very similar to what we did before but do it in less code basically so we're calling this actual minimum wage equals PD dot data frames so that's going to be the same and we'll do that as well and then we're going to say is for name group in DF group by state we could also have just saved the GB above but I'll just do this for the sake of clarity of exactly what's going on here so this that's how you will iterate so you'll get named because the thing we grouped by it'll be that specific name and then you'll get group and that will be basically your data frame so for name group ok so then all we have to ask at this point is if act min wage empty we're gonna do something pretty similar but we're just gonna say act min wage equals group there we go group dot sets index and we'll set the index as year and then again we want that to be a data frame not a series so we're gonna say lo dot 2018 and then we're gonna throw in a dot rename and then this is how you can rename columns so I'm gonna say rename and then we're gonna say columns equals I was hoping it would look a little better as we went off the screen that's okay and then that would be a dictionary and inside the dictionary you're gonna put the original name and then the new name you want have so the original name is lo dot 2018 that's becoming a challenge to see it's unfortunate anyway hello Eco load of 20 and then what we want to change that name to is whatever the name is so it's just the state okay so now we've done the rename let me make sure dot rename columns so I need to close off our name that's really hard this doesn't run off the screen as well as one would hope anyway that if I did this can we see it yeah okay so I'll just zoom out for now group set index year 2018 rename cool okay so that's if it's empty otherwise what we would want to say I'm gonna copy that line and then we're just gonna say else ax min wage equals ax men wage dot join in that exact line okay hopefully it should have another is it that it is there okay we just still can't totally see it but anyway let me zoom out even further just make sure make sure you fully closed all that off okay so now we'll come on down here and a clips ax min wage dot head okay now let me clean that up actually do that now I'll zoom back in and hopefully we don't have any super long lines anymore that makes things hard so okay so here we have everything is the actual you know the low data organized by state and then year which is pretty cool so kind of the same thing we did in the avocado just using group I rather than using Python logic again and you can always use your Python logic to do things but chances are in pandas there's a built-in way because something doing something like this is a super common task this isn't like you're not the first one to need to do this so it probably exists all right so now what are some other things that we can do like with this data so one of the cool things like right out of the gate that we can do with a new data set is like Axman wage dot describe and this just gives us a quick kind of rundown of some basic stats of our data frame so in this case we could see things like count this is just how many rows of information we have mean this is the average right so this is the average minimum wage over the course of all these years in 2018 terms the this is the standard deviation so we can see which states have varied very greatly this is the minimum value and then these are like your percentiles and then this is the maximum value so just really quickly we get a lot of information with that describe that then maybe later either we could use this and maybe graph it in some way so you could just graph the standard deviation for all the states and then see immediately okay who's got the biggest standard deviation or something like that anyway so that's one cool thing that we can do to get a lot of information from our data really quickly one of the other cool things that we have is correlation and covariance just instantly just built in for us so we can do Ackman wage dot core and I'm gonna do dot head just so we don't print out like this massive table so with correlation this would display all the states by all the states right so like Alaska will you know this you know diagonally will be perfect correlation with each other obviously we are we do have some not a number data so we can look into that next and figure out okay what's going on there but otherwise we get correlation in what okay probably gonna finish this tutorial shortly I'm surprised I didn't lose power there that was really close lightning anyway where was I so we got the instant correlation data here what what perfect timing for my Franken mug we're Frankenstein's monster mug anyways so so the one thing that is curious to us has immediately when we see data like this is like what why are we getting these Nan's why are these zeros we want to see like did we make a mistake or is that just something inherent about our data so so one thing that we can do is start to kind of look through our data and just kind of see like what's going on here so one thing we can say is D F dot head and immediately we can see Alabama has nothing it's got no values and then even like the table data just has this like dot dot dot with it so Rob Lee something's not working and then what you could do is I wonder if they put the link yeah they do put the link here so you can just click on the link literally and come here and see mmm no there's definitely dot dots it's not like some JavaScript that we would have to click it's just simply not there so coming back here D F dot head so one thing that we could say is we could just check like okay so if we say issue DF equals DF where DF let's just say really a lot of these values could work but we'll just say low 20 18 is equal to 0 so that's our issue DF right and then issue D F dot head and we can see okay there's quite a few of these and then we could even find out just how many states are problems by doing issue DF state is it a cat yes capital dot unique and now we can see okay these are all the states that's for whatever reason we just can't get data on so that's okay like I said I don't know why that's an ellipses here it just is I don't know why we're not getting that data but whatever so coming on over here I think we'll just have to move on from that so what we could say is we could just get rid of that data like that's it's no good we have no data so this is gonna be super common in a lot of data sets where you've got data but then you've got a lot of missing data for whatever reason and one thing we know for sure is all of the Alabama data is no good and that's the truth for all of these states all the Florida data Georgia Illinois so there's no reason for us to continue with those states in our data set it just doesn't make any sense so what we can say instead is we can import numpy as MP because we're gonna use NPN Ann and you should have number I should already be installed if it's not if you get an error you can pip install numpy but you you should have gotten numpy when you install it pandas so I don't think I need to say that but just in case okay so now what we want to do is act min wage and then we're gonna replace all instances of zero with NP dot nan so we're just gonna say not a number I know zero is a number but actually what's happening is we don't really have any data at all and whoever parsed this decided to call that zero rather than no data I think I don't actually see zeros being reported here so yeah so I think that our we're better off just replacing that with not a number because we actually don't even know and it definitely wasn't zero I don't think surely those states actually have a minimum wage who knows I could learn something new I'm pretty sure it like he wasn't Texas in there yeah okay so Texas had minimum wage so anyway moving along okay so we replace it with not a number and then we can just say drop an A and then we will say axis equals 1 so if AK sees is one that will replace that's like going to do that'll get rid of columns so if a column contains not a number it will get of that if the default is ax is equals zero which actually means rose so if any nan is in a row it's going to get rid of the entire row which obviously we don't want because then we would lose all our state data so actually we just want to get rid of the columns that have Nan's so axes one so then we could say dot core dot head and now we just have the states that actually have minimum wage data now the next thing that we could do is we could check for let me think your well so we don't actually know if all of these are like all zeros like so for example maybe in the 1960s Texas had no minimum wage and then later it got it so maybe that's why I had Nan's don't really fully know so one thing we could say is like for problem in flips in issue D F and then state unique and then we're just gonna ask if problem in min wage core oh did we even diff'ent I think I just okay yeah so we just printed that out so what I'm gonna do instead is just do whoops this and then we'll say min wage core equals AK minimum wage okay core dot columns so if we find that problem we'll just say print we're missing something here or something like that but we definitely showing it because we should be dropping the entire column if it had any Nan's so the other question you could ask is like how many zeros are in any of those so like one thing you could ask is like you could count the number of zeros in the state column or something like that and that would probably give you a better idea because if it has any Nan's we're dropping it but anyways we've dropped them all anything that had a nan we've dropped later maybe we should check and see like maybe later on they got a minimum wage I don't know anyway cool so I'm trying to decide in fact we really we could just show that really quickly so let's do that so we could say grouped issues equals issue D F group by and we could group by state and then we could say like grouped issues get group ala bama and then dot head will just print out a couple here and we get okay so now we can see ok footnote Nan's also these are all zeros right because we haven't replaced them with in p9 yet so one option we have is because it's been replaced with a zero is we could actually just some that entire group right so if you said something like this if we said this group tissue get group Alabama and then we said load 2018 and then we said that some we get zero right because the the entire column just adds up to zero so we never get minimum wage data for Alabama so then we could do the exact same thing we could iterate over this and we could say for state data in grouped issues [Music] what do we want to do if data low 2018 dot some does not equal zero point zero then we miss something okay none of those hit so literally all of them always had no data in them okay so it's just a quick way that we could actually check that so okay I think I'm going to stop it here and in the next tutorial we're going to talk about visualizing this correlation in like a big correlation graph which is gonna end up sending us down and another rabbit hole entirely oh but it'll be fun so anyways that's all for now questions comments whatever feel free to leave them below as always thanks everybody for your support your subscriptions donations all that stuff and I will see you guys in the next video