Transcript for:
MySQL Data Cleaning Project

hello everybody and welcome to the very first project in the my sequel Series today we're going to be focusing on data cleaning now if you don't know what data cleaning is it's basically where you get it in a more usable format so you fix a lot of the issues in the Raw data that when you start creating visualizations or start using it in your products that the data is actually useful and there aren't a lot of issues with it so that's really what data cleaning is now what we're about to do is create a database we're going to import a data set this is a real data set and what we're going to do is we're going to clean the data so I'm going to show you and walk you through all the steps in order to clean the data the data set that we're going to be working with will be in the GitHub so you can just go and download that I'll have a link somewhere in the description but let's get started first thing we're going to do is create a new database so we'll go right over here to create a new schema and we're just going to call this one we'll do this is World underscore layoffs so if you can't tell already we're going to do World layoffs that's the data set that we're going to be doing we'll just click apply and that creates our world layoffs right here now we're going to go into here there are no tables we're going to right click on tables and go to table data import wizard now we haven't done this yet in this series we have an import any data but that's what we're doing here we're going to show you how to import data so we'll go ahead and click browse and as you can see right here we have this layoffs data set let's open this up and we're going to click next and we're going to create a new table there's no existing table in this database you can drop it if it exists if you'd like to it doesn't matter if this is new we're going to go ahead and select next now right here is where you configure import settings now MySQL is going to automatically assign a data type based off of the data in these columns so we'll take a look at the data later now there is one thing that you can take a look at real quick we have this date column now in here it assigned it as a text that's because the format we are going to import this as the raw data we're not going to try to change anything in the import settings we're just going to assume this is how the data was in the table so I'm not going to change anything although this may be something that you would want to change to something like a date time and go and fix that but we're going to import this as the raw data let's go ahead and select next we're going to import it we just select next now this could take a little bit so while this is importing I'm just going to skip ahead this should take just a few minutes to import all right this just finished let's select next and we imported 2 361 records let's go ahead and select finish let me get rid of this and let's refresh this perfect we have our layoffs table so we'll select everything and I'm going to go and double click on the world layoffs because I don't want to write out the whole thing every time so we're going to say from layoffs and let's see what we get so let's take a look at the data that we're going to be working with in this data cleaning project so this data set is layoffs from around the world starting I think 2021 and we'll take a look at that in the state column later but it has the company so it has the company that did the layoffs it has the location of where they are what industry they are part of how many they laid off the percentage that they laid off so the percentage of their company the date the stage which refers to the stage that the company is in whether it's a series B post IPO they don't know then there's the country and then we have funds raised Millions so we have a lot of information here and in the next project we're going to be doing exploratory data analysis so we're cleaning all of this data and then in the next lesson we're going to actually dive into it and try to find Trends and patterns and all these other things so what we are going to do is we're going to go through multiple Steps step number one is we are going to try to remove duplicates if there are any that is the first thing I typically do especially if I know this data shouldn't have any duplicates or it'd be you know repetitive or unnecessary to have duplicates the second thing is going to be to standardize the data that just means that if there are issues with the data with spellings or things like that we just want to standardize it to where it's all the same as it should be number three is we'll look at the null values or blank values and there's a lot of null values in here there's even a blank value right here and we're going to see if we can populate that if we can and there are times where you should there are times where you shouldn't I'll kind of walk through that as well and lastly we want to remove any columns and rows that aren't necessary and there's a few different ways to do that this one is a little bit you know let me write this actually real quick remove any columns so I'm just going to say there are instances where you can do this there are instances where you shouldn't do this when you're working with massive data sets and you have a column that's you know completely irrelevant completely blank you don't have any ETL process that is required for it you can get rid of it it can save you time when you're querying your data now with that being said and we'll talk about this later in the real workplace oftentimes you have processes that automatically import data from different data sources if you remove a column from the raw data set that's a big big problem so what we're going to do is something I would actually do in my real work which is I would create some type of staging or raw data set let's say this one's our raw one and we could have even called this layoffs underscore raw we're going to create another one we're going to create a table so we'll say create table and let's call this one layoffs underscore staging and we literally just want to copy all of the data from the raw table into the staging table so we can do that really quickly by just saying like layoffs and if we run this and we refresh you'll see we have the staging database and let's copy this here we go we'll do layoffs underscore staging and so now we have all of the columns and all we have to do is insert the data so we're just going to say insert then we're going to say layoffs staging right here and we'll select everything from layoffs and let's run this and if we select the table we now have all the data over so super super easy and now we have these two different tables now again why do we do this is because we're about to change the staging database a lot if we make some type of mistake we want to have the raw data available this does happen this is something that you do in the real workplace because you're not going to work on the raw data it just you shouldn't do it it's not best practice so I'm going to show you what I would actually do in my you know like a real job so that's what we're going to do now we're only going to be working off the staging database and we can copy this and make different databases for different things as long as we have our raw data we can really do anything we want going forward and that's what we're going to do so the number one thing we're going to look at is to make sure that we are removing duplicates we want to make sure we don't have any duplicate data in here and if so we're going to get rid of it now really quickly if you did my Microsoft SQL Server project we did something very similar but we had an extra column over here that gave the unique row ID which made it really easy to remove the duplicates here there is no identifying Factor that's going to be easy for that so I'm just going to tell you up front removing these duplicates is not going to be easy but we'll walk through it every step of the way so what we can do is try to do something like a row number and we'll match it against all of these columns and then we'll see if there are any duplicates now I'm just we're starting off strong okay we're jumping into kind of some of the more advanced things it does get actually easier as we go but this is the actual order that I follow so I'm going to keep it so let's try to identify duplicates so let's copy this let's pull this down do underscore staging there we go now what we can do is we can do row number and we'll do that Partition by basically we could do every single one of these columns that's kind of what we're doing so what we can do is we can say everything then we can do a comma and we'll say row underscore number and would be just like this and we're going to do this over and we want to Partition by all of these columns essentially we could just do a few for now to see if we get any hits and then we can look at that but they're going to be multiple companies that have layoffs in the same location in Industry although they're total laid off would probably be different the date would probably be different so if we do something like uh company let's do industry we will do total underscore laid underscore off I'm a percentage laid off and then let's do date now I'm doing date with the back ticks because date is a keyword in my SQL so if we do it like this it just really makes it easy so we're going to Partition by all of these things so let's do Partition by and let's bring this down real quick so I'm just going to say over Partition by and we're going to call this as row underscore num now let's try running this let's see if it works really quickly it's important and over here you can see that we have our row number now these mostly are unique and these all look unique I'm not going to scroll through all of them but we want to be able to filter on this so we can filter where the row number is greater than 2. if it has two or above that means there's duplicates that means there's an issue so let's go ahead and we're going to take this we'll put it into either a sub query or a CTE I'll create a CTE for this because it's really easy so say for or not four we'll say with then we'll do uh duplicate underscore CTE as then we'll just do our parentheses we'll paste this in here get rid of that right there and now we're going to say select everything from this duplicate CTE then we'll say where row underscore num is greater than 1. I also run this and a semicolon let's run this and you can see that these ones have duplicates so these are our duplicates actually and we want to get rid of these exact rows now just to confirm that these are the duplicates let's look at this one I've never heard of this company um but we'll take it really quick and let's select we'll say where company is equal to we'll call this Oda so let's run this and it looks like these oh no no no these aren't duplicates that's a good thing we checked okay because it looks like um these aren't the exact same although they're very very close these technically are not duplicates so I'm glad we checked this we need to do this Partition by over every single column that's what I'm realizing so we'll do company comma location I'm glad I'm genuinely glad we're you know it's good to make mistakes um and figure things out as you go it really is important so company location industry total Aid off percentage laid off date then we'll do stage and then we'll do entry and then funds underscore raised underscore Millions so we're changing the CTE to partition over everything so now let's run this okay Oda is not in there that's the only one we checked um but let's look at Casper I know this is these are the um aren't these the mattress people didn't know they had layoffs before guys um all right let's take a look it looks like this row and this row are duplicates these are our duplicates so we're going to want to remove only one of those we don't want to remove all of those so um just looking at this one example it looks like this uh query is working well so here's our duplicates now we need to identify these exact rows we don't want to delete both of them when we looked at Casper there's the real one that we want to keep then there's a duplicate that we want to remove we don't want to remove both that would be bad now in my sequel It's a little bit trickier to remove things than it is in something like Microsoft SQL Server post Grace SQL they have different ways that they can delete rows for example a Microsoft SQL Server we could literally identify these row numbers in the CTE and delete them from it and it would delete it from the actual table we can't do that in my SQL and I'll show you let's actually copy this do we'll go like this and we'll say let's say we want to delete these we'll say delete from or deleting this from where the row number is greater than one what am I writing right here delete there we go so delete from this duplicate CTE where the row number is greater than one that's all these duplicates we want to remove them let's try to do this let's run it let's go down if we look at the bottom it says the target table duplicate CTE of the delete is not updatable so you cannot update a CTE a delete statement is like an update statement essentially so what we are going to do is we're going to do something a little bit different because this is how I would love to do it that makes it super super easy to remove duplicates but that is not always the way that things happen in the real world I think what we should do is take this right here and let's run this we should take this right here and put this into let's say a staging 2 database and then we can delete it because we can filter on these row nums and we can delete those which are equal to two so it's essentially like you know creating some type of table and then just deleting the actual column so we're that's exactly what we're going to do so it's essentially just creating another table that has this extra row and then deleting it where that row is equal to two so you know somewhat fairly straightforward but um let's try it and let's see what happens so we're going to come down here and do is create our table let's try doing that with here Let's uh let's copy the clipboard a create statement let's see if this works perfect that's exactly what I wanted now all we're going to do is say we're creating the table layoff staging 2. now this is a create table statement and we're naming the columns and then we're also assigning the data type so we have all these things but we want one more and do a comma and we want to add row underscore num and I need to underscore num and that should be an integer data type so we'll just keep it just like this let's go ahead and copy this and let's run it see if it worked bring this up looks like it worked properly and let's say let's go back up I want to rewrite things that I don't have to let's run this so now we have this empty table so we want to insert this information right here so we're going to insert into so we'll insert into and then we'll do this right here so insert into staging two now let's try to run this see if it works and let's run it and let's select that table and now we have it so let's pull this back up and I'll walk through what we just did because I know I'm going quick but we have so much to cover um in this lesson so we just inserted basically a copy of all these columns but in this new table we added one more the row num so now we can filter we can say where when you spell that right where row I'm just going to num is equal to 2 or we should should say greater than one because some might have multiple duplicates and there you go here are our duplicates now we're going to delete these so all we have to do is come right back down Where'd I go copy this come right back down here I'm just going to say delete from we just did a select statement I always recommend doing that to identify what you're deleting then you change it to delete and now if we run this go and I'm actually going to keep this um let me see there we go and let's run it again and now they're gone and if we say um just the whole table this looks wonderful now this row num is going to be a column at the end that we probably don't need anymore right it's a redundant column it adds up extra space in memory and storage and all these other things and processing times we're just going to get rid of it that'll be at the very end I'm sure so it looks like we are good to go that's how we remove duplicates now um there are different ways to do it when you have different columns like if you have a unique column over here makes it so much easier so so much easier but we didn't have that so we had to kind of do a workaround uh Welcome to the Real World now let's look at standardizing data so standardizing data is finding issues in your data and then fixing it so I'm already noticing right here looks like we have a space at the beginning we could easily just do a trim on this column um unless I I don't even think I was um I did this when I wrote out all the the scripts for this let's just do from this table why am I writing it all out again we actually want to select the company and then the or actually we'll just do distinct company distinct company let's run this and if we do a trim around this let's run this again and that looks better so if we do a company company comma and then we'll just do the trim I don't wanna we don't need to do this thing right now we'll do the company this just looks better so we're going to update that uh it's super easy now if you ran to nichu just a second ago uh I may need to help you change that so if you couldn't update or delete those things earlier I should have told you this earlier I apologize all you need to go is to edit you just need to go to edit go to preferences at the very bottom go to SQL editor go all the way down to the bottom and right here we have safe updates on if you have this selected that means you can't update anything that's a problem so what you need to do is Select this or unselect it like I have it and save it you may have to even restart your MySQL potentially in order for those changes to take effect but then you should be able to update that now all we're going to do is update this table and we're going to set and now we need to come back here and we'll say we're going to set the company equal to trim now if you don't know what trim is or you haven't taken that lesson trim just takes off the white space off the end so it took the white space out of here or off the right hand side as well so we're going to update this and let's do a semicolon a semicolon let's run this let's select this again and it was updated properly so we're already off to a great start now the next thing that I want to take a look at is the actual industry so let's go back copy this and let's take a look at the industry so we'll do industry and we'll run it now if you look in here there's a ton of different Industries um and there's marketing and marketing oh because I haven't done distinct please ignore me let's do distinct and there's a ton of different Industries in here Transportation Healthcare consumer there's a blank one which we'll take a look at Aerospace there's a lot of really unique ones let's actually order this we'll do order by uh and let's just do one which is the first column we're just ordering our own self so we have null we have blank that's a problem we'll take a look at that later um but this is an issue crypto cryptocurrency and cryptocurrency these are all the same thing these should all be on or labeled the exact same thing the reason we need to change this is because when we start doing uh the exploratory data analysis visualizing it these would all be their own rows or own unique thing which we don't want we want them all to be grouped together so we can accurately look at the data let's take a look at any other ones fintech and finance that could be the same thing I'm not 100 sure I'm not a fintech person um I think for now the only one that I'm confident in changing is this one right here which is cryptocurrency so let's go ahead and update that so all we have to do and we need to actually let's select really quickly where it's like crypto so we'll say uh where industry and we want to select everything where the industry is like and we'll just do crypto because they all start with crypto right yeah we'll do crypto just like this and let's run this and let's just take a look a lot of layoffs in the crypto industry good all right let's find where it's cryptocurrency okay so even this one it's crypto and I know Gemini crypto crypto and then it says cryptocurrency so these should be all crypto you see how 95 of them are crypto so we're going to update these other ones oh this one is c-r-ypt how do you spell crypto geez I don't know anything all right so we want to update all of them to be crypto so what we're going to do is we're going to say update layoffs industry 2 we want to set the industry equal to crypto just like this where and we can do it a few different ways we can say industry I think we can do like let's try this real quick I I some of the stuff I don't have planned out I'm just kind of going with it as we go which I like better you know we kind of we work together on this we figured these things out together that's what I like um then we'll do like crypto just like this exactly like we had it up here so if it's like crypto it should be crypto let's try this let's see if it ranks it may not have I can't remember yeah it worked okay so it updated uh three rows and that looks correct now let's go back up and let's run this as we scroll down there all the exact same beautiful beautiful beautiful so if we do uh distinct industry again let's get rid of this if we run this query and we scroll down crypto is its own thing beautiful and it looks great we can look at those later on how we can update those um but let's keep going let's look at our whole table again and these blanks in these nulls are actually an issue we do need to deal with them but I I my instinct is telling me go fix it um but my you know tutorial side is saying okay stick with the tutorial the order that we agreed on um so let's go take a look we've looked at company we've looked at industry um let's just real quick look at uh distinct location now it's good to look at most of these things right there could be small tiny issues that you just never saw and we're just going to order by order by one just do it real quick just a scan to see if we find any issues um that could be an issue but that could just be another language if I'm being honest I don't know as I'm just scrolling through here because I want to make sure because this is not something I had in my pre-written script this looks pretty good to me um let's do everything we'll run this and now let's look at Country so we'll do distinct country let's run this and let's scroll down again this is sometimes just what I actually do all right we got an issue right here super common somebody put a period at the end some Dingus uh and we're not gonna judge that person I don't know who it was or who ruined this data set but um that's a problem so we're going to need to just update that it looks pretty simple but I'll just say where country is equal to or let's say like and then I'll say like United States there we go and oops I want to say select everything I just want to see um where it's at oh geez there's too many let me see if I can spot it I can't spot it it looks like they're supposed to be United States not United States dot that's the issue we can easily easily fix this and we can probably let's do um really quickly let's do select oops select distinct and then we'll do country comma and then we'll do a trim because we want to get rid of that um that one we'll do country now just doing the trim won't fix it let's go to the bottom so doing the trim doesn't fix it but here's what you can do it's a little trick of the trade here we're gonna do something called trailing which means coming at the end so what's trailing the period from country let's try running this scroll to the bottom and it fixed it so this is a little um a little Advanced little tidbit for the trim here we can do trailing from the country and we're looking for something that's not a white space we're specifying we're looking for a period so now we can do is we can say update we can set the country the update um this table oops and we'll set what am I doing what's going on here we'll set the country equal to and we'll do it just like this we're only going to do it for a country right uh so we'll say is equal to trim and we'll say where country is equal to or actually let's say like and let me see if I have this I don't let's let's just say like United States like we had it before just like this so let's go ahead and update this after I put my semicolon in let's run this and let's run this again it shouldn't need to fix it anymore it's just one row that's perfect that's exactly what we wanted now one thing that's really important uh and this is you know this is a longitudinal that's not the right word at all give me a second I can't I can't speak and write at the same time so sometimes I just say uh dumb things um uh if we want to do not longitudinal but um time series that's the word I'm looking for if we're trying to do time series um exploratory data analysis time series visualizations later on this needs to be changed right now it's text and we can look at that by going right actually let's refresh this I'm not looking at staging looking staging two if we look at the columns and we come down here to date it is a text column that's not good if we're trying to do uh time series stuff we want to change this to a date column now how can we do that let's take a look so let's do date and let's not actually do it like that let's do date backslash so we're just going to look at the date now let's change this because we want to format it how we want to format it with it which is month day year so how can we do this well there's something that's very very helpful works perfectly in this situation and is exactly what we're going to do it's called string to date so we're going to string underscore there it is right there underscore 2 underscore date it literally helps us go from a string which is a text that's the data type to a date so it's perfect now all we need to do is pass through two parameters we have to pass through the column which is the date column and then what format we want it in now if you haven't done date formats before I'm going to kind of walk you through it while we're looking at it in order to format this properly you use a percent sign so it's going to be a formatting for a month a lowercase m a capital m is something completely different I believe it's spelled out I need to I we can look at that in a second if we want to actually and then we can do this right here and then we'll do another one so we're formatting it in the way that we want it but also converting it to an actual date column so now we want month then we want day lowercase day we'll do a forward slash and then another percent sign and then a capital Y which stands for I believe the four um four number long year I have a let's just um let's look at this real quick so it worked perfect so we're it's taking in this format that it's in right over here and converting it into the date format so this is the standard date format that you're going to find in my SQL now let's see what happens really quickly just for fun uh let's see if we do capital M uh it looks like that's not going to work at all uh let's do lowercase Y and [Music] um just formatted it to 2020. I think it took the first two numbers it looks like I don't know why it's doing that if I'm being honest um but if we keep it with the capital Y as we should this looks perfect this looks exactly like what we're trying to do so you can mess around with it it depends on the how the data is formatted in your original column when it converts it to the string to date and there's a lot of different stuff you should just look up date formatting in my sequel really interesting stuff so we're going to update this date column to this which is our new date column let's go ahead and do that we're going to say update you guys should be getting used to this by now that's the whole point is getting used to doing these things so we're going to set date equal to then we're going to put in this right here the string to date go ahead and do this and let's run it make sure it worked 2355 rows it looked like it did every single one but let's go ahead and get rid of this and let's run it and it looks like it worked perfectly now there were some nulls it looks like and that'll be something we have to look at later when we talk about nulls but um overall I believe this looks proper now if we refresh this it's refresh let's come down to the date you'll notice it is still a text it's date it's called text but now it's in the date format now that's really important and maybe I should have done that earlier if I'm being honest um try to convert it to a date column it wouldn't work it would give us an error you just have to trust me on that one but now we can do it where we can change it to a date column so let's do alter table now only do this never ever do this on your raw table only this on things like a staging table because we're about to completely change the data type of the actual table so we want to change the layoff staging to and then we're going to come down here I'm going to say modify column now what column are we modifying it's this date column there we go and we want to change it to what data type a date and am I spelling this right yeah I just need a semicolon here remember I see an error I always yeah just look for the semicolons so let's go and run this and let's refresh see if it worked and the date was changed to a date which is perfect that's all we wanted to do uh just to make sure we were doing what uh or we'll set ourselves up later in the future really well let's look at our table all right this is very good so we fixed a few uh just issues with the company I believe something with the industry all the cryptocurrency we change the country I'm just going to go ahead and tell you right now this one uh we're not going to look at until we look at the um nulls and whatnot in just a second so we're not looking at that one yet and then we have this extra column that we've done so we've done a lot so far but the next thing in the process step one was remove duplicate step two with standardization step three is working with null and blank values now this is going to happen you're going to have nulls and you're going to have a blank values in here I it's somewhere um it's just going to happen and so we need to think about what we're going to do with that information whether we want to make them all nulls make them all blanks try to populate that data let's see what we're gonna do so let's start off with the total laid off we'll just do a where total underscore late underscore off is null so in order to look at the null we say is no let's try equal to null it's not going to give it to us we have to say where it is no so we have these values these are completely null uh there's quite a few of them but remember this is also useful information but if they have two nulls uh that probably is pretty useless to us um that's something I think we'll take a look at in a little bit actually we'll say and we and we may save this query percentage uh laid off is no so if they're both null like these these are all I believe fairly useless to us these might be ones that we remove so let's actually look at this um in step four we look at removing rows and columns well one thing we should take a look at I remember this industry let's do uh industry too distinct this industry had some missing values and let's take a look at that okay so we have a missing value and we have a null here so let's look at this query and let's say where industry is null or do industry is equal to a blank like this we'll select everything to run this all right so it looks like there are a few that are blank now what we can try to do is see if any of these have one that's populated let's take Airbnb for example let's search for this really quickly and this is 100 percent um you know it's just helpful it's really really helpful to be able to populate data that is popular populatable is that a word um let's try it so we'll say select everything I just want to do where spell that right where company is equal to and let's do Airbnb there we go let's run this and it looks like we have this one right here so for example um these whether they have them or not we're going to try to populate these if this Bally's or carvana or Joule had multiple layoffs these ones should if these ones aren't blank if they have one that's not blank we should be able to populate it for example um not the one I was trying to do if we look at Airbnb this one has travel so we know this is the travel industry so we can populate this with travel again we want this data to be uh the same so if we're trying to look at you know what industries were impacted the most this row isn't going to be affected or this row won't be in our output because it's blank we want that to be travel to represent the data properly so we want to update it so if this one has travel we should be able to update this row with this travel right here so let's see how we can write this and let me give myself some rows right here all right now what we're going to need to do is try to do a join here so let's try writing out in a select statement and then we'll just change it to an update if it works so we're going to select everything and we're going to do this from staging two from staging two and we'll call this st2 and then we'll join on itself because what we're going to do is we're going to check in this table does it have one that is blank and not blank if so update it with the non-blank one that's essentially uh in layman terms what we're trying to write but writing it out could be a little bit more difficult um so we're going to join on itself and we'll call this let's actually call this table one T1 and T2 because they're the exact same table and we'll do this on and we're going to say T1 dot company is equal to T2 dot company so the company has to be the same that's important and we probably should do the location is the same as well now we'll do and T1 dot location is equal to T2 dot location I'm imagining you know there's another Airbnb in like South America somewhere that's called Airbnb but you know I'm just imagining a scenario right where you have to think about different use cases rather than just large companies so those other ones they may have ones that are in different locations we don't want those um we don't want to change them if they're not the same so these are the same now what we want to find is we're going to say oops we want to say where and we'll do T1 dot industry is null and then we want to check that T2 dot industry is not null we'll say and T2 dot industry is not null and let's just run this let's see if we get anything so let's think this through because we got nothing in our output we're selecting everything we're joining on the company and the company and the location where T1 industry is null and T2 industry is not null let's just get rid of this for a second I just want to see if this changes anything it doesn't and it's possible actually that instead of doing is null we could do or in this I'm glad we're walking through this we can do or is equal to blank and let's try running this there we go okay so it looks like there's Jewel carvana and Airbnb these ones all have Industries um where it's null or blank and an industry is not null so that's really good now if we scroll over see the industry here this is our T1 this is our first table if we scroll over I bet we'll see the t2 industry where it's not null let's scroll over and here's our industry we have travel transportation and consumer so this worked exactly as we had hoped I can even um pull this up here just to show kind of show you a little bit easier what that's doing and we'll do t2. industry this is kind of like what we're trying to do so if it's blank this one is going to be populated into here if there is one that is not blank so that's essentially what we're going to do let's write the update statement and we're going to see if it works this we have to translate this to an update statement so we'll do update and we're going to update this right here so we'll say update T1 and then we'll do the join right there and now we have to do a set statement so we'll set the T1 dot industry equal to and I'll just copy this t2. I just don't like I don't like writing things out um then we say where so we do this like that and set a semicolon okay let's confirm so we're updating this table T1 we're joining on T2 or the company is the exact same we're setting T1 industry equal to T2 industry so the T1 should be the blank one so where the T1 industry is null or blank and T2 industry is not null let's go ahead and run this semicolon see if there were about three updated yep rematch zero was affected though let's go take a look um we have to run this query looks like those are still null let's run this uh that one is still blank now let me think here I'm trying to think of why this didn't work and I want to walk you through my thought process it is possible that because these are blanks and not nulls that it's not working I and I will say that is something I typically do where I set these blanks to nulls First so let's actually try that and see if that changes anything I'm just going to update on this I'm going to say set the industry equal to null we'll say where industry is equal to Blanks so we're just changing it to null where it's blank let's try this and let's go back down here to our select statement so these are all nulls okay I think I think this is now going to work because now you can see on this side it's gonna there's only one option for it to populate it before there were those blanks which I think was causing the issue um let's get rid of this part because now we have no nulls and now let's try running this we're workshopping this on the Fly guys uh let's see three rows affected hey oh all right let's go see if it worked um let's run this query and we have none that's perfect let's look at Airbnb hey all right all right ran into some issues but we worked through it we figured out the issue and now it's working properly and we can even come back up here to select everything and it looks like Bailey's is the only one that still has a null let's look up Bailey's real quick and we'll say where company is like uh Bailey let's run this yeah and there's only one so there wasn't another row all these other ones like carvana and um I can't remember the other uh Jewel and Airbnb those ones had an extra row they did multiple layoffs this one only did one layoff so we don't have another populated row where it's not null to actually populate the null row that's really all that happened uh that's why that worked that way so I'm really happy that worked awesome job guys uh I was starting to question myself do I even know how to use my sequel I mean I was really starting to question my abilities here um let's take a look uh I think that is all we're gonna do for populating null values now here's why things like total laid off percentage laid off um funds raised how are we going to populate that with the data that we have here I don't believe we can now we might be able to populate oops we might be able to populate some of this if we had the um company total like if we had the original Total before laid off because then we could do calculations like um oh these companies went completely out of business that's not good at one percent that means 100 was laid off um but if we had you know the total they had 50 employees and 100 were laid off we could populate the total Aid off whoops did it again we could populate the total Aid off by saying if this is 50 100 was laid off that's 50 people were laid off we don't have that data so we can't go and populate it I don't believe funds raised we might be able to scrape some data from the web and populate this but that's a totally different thing um not part of this project so I think the data cleaning for the null values and blank values I think that's going to be done um it's possible that the stage could be the same and if you want to go check you can but we're going to keep chugging along because we want to remove columns and rows that we need to now if you remember we were looking at this before did I save that query let's go look here we go bring this down to the bottom all right these rows let's let's really take a look at these and think about if this is going to be helped us um what we are trying to do with this data in the near future is we're not just trying to identify a company or a location that had layouts and maybe we are maybe that maybe we are trying to do that but these have no layoffs and no percentage laid off so in my opinion I don't know if these laid off any at all um I believe that we can get rid of these now deleting data is a very interesting thing to do you have to be confident am I 100 confident no not really but I'm confident enough to know that what we're about to look at in the next one we're going to be using these total laid off a lot essentially laid off a lot when we're looking at um you know actually querying the data and doing some exploratory data analysis so we're going to use these a lot I don't think uh these I'm not even sure if these are accurate I'm not even sure if they actually did have a layoff it's saying they did it doesn't show if they laid off any so um can we delete this yes should we delete this it's iffy uh I'm not 100 if I'm being completely honest and there's a lot of rows like that this is this could be like 100 or so I'm really not I mean I could run a query and run it but I don't want to I don't know it's not a big deal the point being I don't think we need this information so we're gonna get rid of it if nothing else just to show that you can do it so now we'll say uh Delete and then we'll do from here there we go so now we're going to delete these rows let's try to select them again and they are gone so we deleted the ones where the total laid off was blank and the percentage laid off was blank we just I can't trust that data I really can't um and let's go back down I'm right here semicolon so I sometimes I have to walk myself through these things um all right this ronum I mean comma we don't need that anymore let's get rid of it um so what we can do now it's a little bit different syntax we want to drop a column from this table so we have to do the alter table again we're going to alter table layoff staging two and then we're going to say drop column and row underscore num if we run this then we run the table again should be gone and it is so this is it this is our finalized clean data now in the next project we're going to be doing exploratory data analysis on this cleaned data we're going to finding Trends and patterns and running complex queries it's going to be phenomenal I'm super excited about it and I love this data cleaning one um I made some mistakes I'll be the first one to admit but cleaning data is not always a straightforward thing um you know you have to you kind of mess around with it figure it out and and you know that's what we did uh whoa took a while so just to recap we've removed duplicates we standardize the data we looked at the null values or blank values then we removed any columns or rows so we did a lot um if you go back and you actually scroll through here and look at some of this code that we wrote uh it's not super beginner stuff so if you're following along with these things and you are getting this project this is a fantastic project to put on your portfolio I myself would put this project on my portfolio because it's a very very relevant thing so I hope this was helpful I'm just going to keep scrolling while I talk I hope this was helpful I hope you learned something we did we did a lot of different things that we didn't even do in the lessons which I like doing because you can't cover every single aspect of my sequel in lessons right sometimes you just gotta get in there get into the nitty-gritty clean some data and you'll find or discover new things try new things um and now we're getting to the bottom and awesome work awesome awesome work uh this is an A1 project I think this should be in everyone's portfolio if I don't see it in your portfolio and you know you send it to me I'm gonna say it's the garbage portfolio so this is a good one so with that being said thank you guys so much for watching I if you made it all the way to the end you're still listening to me awesome work really awesome work for real it I you know you're just following along with a tutorial that's what it feels like but by the end of this I I just know you're learning a ton and you're you're trying new things and you're really pushing yourself Beyond just simple tutorials so trust me when I say this is not easy not everyone was able to make it to the end so great work getting here so I will see you guys in the next project when we actually explore this data walk through a lot of different ways to do that so thank you again for watching if you like this be sure to like And subscribe below I put out tons of content about all this stuff and I absolutely love it it is definitely one of my passions in life so go ahead and do that and I will see you in the next video foreign [Music]