Transcript for:
Comprehensive Data Analytics Project Guide

hello everyone welcome back to the channel today we are going to do a end to end data analytics project we will be using a kle API to download the data set from kagle using python then we will load that data into pandas and then in pandas we are going to do some data cleaning we have some dirty data we have to clean it we have to create some new columns so we will do some kind of processing we will load the data into SQL server and then I I have designed some five to six questions based on that data we are going to solve those SQL questions okay so if you are a data engineer or data analyst or aspiring to be one this project is going to help you I will share all the codes everything on my GitHub so I will share the link so you can follow along with me I will explain everything step by step so let's get started so first of all to use kagle API what we have to do we have to where we have to read the kagle documentation for public API okay so this is the documentation I will share a link you can read it but I will quickly show you what you need to do okay so you go to kaggle.com kaggle.com okay and if you don't have a profile on kaggle create a profile okay sign up and then go there and click on settings if you go down in settings there is a API option and you say create new token okay so you click on create new token continue and it has generated a token for you which is a kagle do Json file right so if you look at the documentation here it is saying ensure kagle do Json is in the location so this is my home directory within my home directory there will be a folder Dot kagle and there I have to place this file kagle do Json okay so what I will do I will go to the downloads take this file okay this is my k. Json file and I will go to my home directory my home directory will be in if I go to C drive and then users and then anit and if you see here there is do dokle folder so we need to just place this file I'll delete and place this file okay so this is my kagle do Json file once this is done you have to get the link of the data set that you want to download so we have a retail orders data set I will take you through the data set later but this is the link that we have to use okay so this is this is the person who has uploaded the data set I have uploaded this data set and this is the name of my data set right retail orders so I'll just copy this okay and this will be used to download the data set now before I go there I will be using um Anaconda Navigator and I will launch the notebook so I will use Joba notebook to do my all my python work okay so I already launched it so if you go here home and in my documents I have created a SQL plus python project here and you can create a new file so you can create your folder and create a new file I have already created a file with some commands I will use it so I will just directly click on it okay so this is all we are going to do so we are going to do a lot of operations and we will learn a lot about pandas and then about SQL okay so let's get it started so first thing is we have to have the kle API the the the library so you have to say pip install ex exclamatory Mark pip install kagle I have already done it so I'm not going to run it but you have to say import K right once you do it now you can run a API call to download the data set okay so I have the command ready so this is my command I'll just take you through it so I'm saying kle data sets download and this is My URL right this is my retail orders URL this much after data set this much you have to pick and here I have to download the file the file name is orders. CSV okay so this is the this is the command we have to run and I'm just just execute this cell so I've have executed this cell it says it is downloaded downloading orders. CSP doz file so if I go to SQL plus python my folder you see this file is created .csv doz so we have downloaded from kaggle to my local system right now next task is so this is done download data set from kagle this is done okay I'll just place it here okay now if you see this is a ZIP file right this is a ZIP file if you see so what I have to do if I go inside there is a orders. CSV file so I have to extract the data from orders. CSV dotzip file so for that I will use the zip Library okay so this is the code I'll just take it again and show you so I'm importing this Library called zip file right and then I'm I'm initiating a method where I'm passing the I'm loading the file basically so this is my file orders. CSV doz right this is a file that I need to extract and then I will say extract all now in extract all you can give path if you don't give path it will extract on the same location where your zip file is right now currently my my this file my python file this is my python file right and this file are on same location so I don't have to give path if your zip file is file is on some other location you can give the full path here C users or whatever right and again I'm just doing extract all so that on the same location it will extract the data and then it will then close the file so this is done if I go to my this uh folder again you see orders. CSV file is there okay so our first task is done to download data from kaggle and then extracting the data from zip file to a CSV okay okay so now the file is there in my local system I can read directly in my pandas data frame so what I will do I will use pandas Library import pandas as PD if it is not there again you have to pip install pandas uh in your system right so I'll say import pandas as PD and I will say DF do pd. read CSV because this is a CSV file and this is my orders. CSV right so orders. CSV is there in the same path where my python file is so I don't have to give path otherwise you have to give the path as well right and let's look at the data DF dot head let's say 20 records I want to see okay so this is how my data looks I have order ID order date all the columns are there now the important thing to look here is is ship mode in ship mode if you see there are some not available right not available there are some unknown values so we have to fix these values because we want to consider the M as null okay in pandas they are represented as none right so what we want to do all these values not available unknown these values we want to consider as null right not not these um not available all these strings right so for that what what we will do so first we will do how many different kind of values we have in the ship mode column so I will say DF and ship mode right and I will say unique so this will give me distinct values available in this column ship mode column right so let me correct the spelling ship mode okay so if you see we have second class standard class not available unknown first class none this none is fine none is treated as null right but these values I want also I want to treat it as null so for that in pandas if you press shift tab you will see all the options right and I will use this option of Na values right I'll just copy this right and I will say comma and I will pass these two values so what it will do the these values will be treated as null so whatever values I pass to this argument so when it is reading these values will be treated as n okay so I'm going to read it again right and check the unique values if you see all those values are treated as none now or null because we don't have unknown or not available in the data now okay so this is the first thing that we wanted to handle now second second thing is if you see here when I pass the column name right I have to give a space in between the S is capital M is capital right this is not a good practice ideally the column name should have been ship uncore mode which means that everything should be in lower case and there should be underscore not space right so we will fix the column names of this data frame now so what we will do here if you look at again DF right so there are many columns order date subcategory so we have to basically rename these columns so one one method is this I will say DF do rename and I will pass a Json that I want to rename so I will say columns equal to I want to rename let's say order ID to order _ ID right and let's say I want to rename City to small case City right this is what I want these two columns so if I run this now you will see that order ID is fixed now right it is renamed and city is also renamed but that way I have to pass all the columns right there are 15 columns around I have to pass all 15 columns which is not a good good way of doing it right so we will do some other way what we will do we will extract the all the columns so we can see all the columns using DF do columns right so it will give me a list that these are all the columns okay now I can convert it to a string object and and string object list of string object and apply the string function so for example I can say s strr do lower what it will do each value it will converted to lower case so if you see everything is converted to lower case and I will assign it back to the DF do columns okay so if I just check again DF do colums you will see everything is in lower case second thing is we have to replace spaces with underscore so again I will use replace function of a string and I will say convert spaces to underscore so if I run this if you see this is also fixed so wherever space was there we have the underscore so I will assign it back to DF do columns okay so this task is also done so if I look at the data again once d DF do head5 we have all the column names sorted so we handled both the things we handled the null values so that those values come as none and we handle the column name also right now what we have to do we have to D drive some new columns right so currently we have three columns if you say cost price for each order right list price and discount percentage so how much discount so this is the cost to company that they are buying at this price this is the MRP right on MRP they are giving this discount right so we have to drive a discount colum so this are this is a percentage we have to drive a discount column of exact value how much discount we gave so list price is 260 multiply by 02 because 2% it is right 02 will give me that much discount we give so what we have to do we have to DF and the list price right into DF discount percent right divide by 100 or multiply by 01 same thing right so if you see 260 2% of 260 is 5.2 so we are getting 5.2 so I will put it in a new column DF discount this is the exact discount that we are giving right so this is done this is going to create a new column DF discount with this value so if I look at the DF again once we have the discount column created 5.2 21.9 0.5 right the next thing is we have to have a sale price right what is the sale price list price minus the discount right list price minus the discount so I will say sale price equal to list price right minus whatever is the discount we are giving exact discount right so if I just run this and we have the sale price as well so 260 minus 5.2 discount 24.8 so this is also done next thing is we have to find the profit what is the total profit so I will say profit equal to now profit will be DF cost price minus sale price right so let me write it so this is my sale price Minus cost price right this is the cost price here and this will be my profit because I'm selling at this price after discount and this is my cost price and if I look at the DF now we have the profit that we bought it at company bought it at at 240 sold it at 24.8 so $14.8 or rupees whatever this is the profit okay so we drived all the columns which are required now at this point I think cost price list price and discount percent column are not required we we can do all the analysis using these three columns so what we will do we will drop these columns whatever is not required okay so we are going to drop those columns before that one more thing is if you look at the data type okay we will do it later but before that we have to convert order date to date so if I show you the data types you see order date is object though it look like a date only y mmdd but if you see this is object data type right it shows you data type of each column but what we want we want to convert it into a date right because when we load into SEC call it should go as a date not a string because if it go go as a date we won't be able to do all the operations right so what we will do we will use a two dat time method of pandas in this we have to pass the column so DF order underscore date right we have to give it in single codes and I have to tell the format that my string is in which format right so my string is in YY mmdd so because it is four letters I have to give percentage Y and then percentage m in this small percentage D in small and there's a hyon in between right this is the format of my string and I'm saying this is the format converted to date so if I run this you see this is showing me a data type of date time and I will assign it to order date on the original data set right run this and this is done so if I look at the DF now it it is fine let's look at the D types right so this is a date time now right this is a date kind of column okay now we will we will drop those three columns as I said we don't need those three columns right we are going to drop them so I will say DF do drop and I will say columns so if you want to drop rows you have to say index equal to and index values here we want to draw columns so I'll just pass all the columns so list price is not required list price and we don't want a discount percentage and cost price right so cost price also we will remove it and we are going to remove discount percentage well okay so these three columns are not required okay so I'm going to drop it now when I drop it what will happen it will show me the data after dropping the column so if I run this you see those three columns are not there list price cost price but the DF is in in the actual data set it is still there it is not dropped so if you see all the columns are there list price discount percent cost price right so what we have to do we have to say that in place equal to true we have to pass this argument which means in in the data set only make this change drop the columns so it will not return anything it will just drop it and if you see now we don't have right so you have to use this in place equal to True okay so this is what we have done so far we have done all the cleaning new columns we created dropped all the columns the next thing is we want to load this data into SQL Server okay so we are going to load this data in SQL server for that I will be using some of the libraries so let me copy that so these are the libraries I will be using so let me copy okay so pandas is already there I'll just remove it so I'm using SQL Alchemy to to connect to my SQL Server so this is my SQL Server right this is my SQL Server so I will connect to SQL alchy and let me delete this table okay so this is my SQ alchy and this is my SQL Server so if you go to the SQL server and connect on properties so this is my server anit SQL Server so I will be connecting to This Server right and if you look at this code here I'm passing that this is my server I'm saying Ms SQL so I'm creating a connection using this create engine method of SQL Camy in this I have to say this is my my SQL Ms SQL if it is my SQL you have to pass my SQL if it is for you have to say post then you have to pass the server name then master master means which database you want to connect so this is the default there are many databases this is the system database master so I will connect to this and this is my odbc driver for SQL Server right so these things you have to pass if you don't know what is your obbc driver of SQL Server you can just go there odbc data sources and and if you look at the SQL Server it shows you o odvc driver for SQL Server so when you install SQL Server it will be there and if you're using any other database you will be able to find it out just try to Google it if you don't find it right so these things you have to pass right this is a Windows authentication so I don't have to pass any username password and then I will say connection equal to engine. connect and if I run this this this is now connected right so I have a connection created now what I will do I'm going to load the data into SQL Server so for that this is the command so I'm saying DF DF is my data frame dot to SQL so I'm saying load this data into SQL in SQL table name will be DF orders this will be my connection right because I need connection information to where we have to load the data in with SQL Server right index equal to false so what happens when you create a data frame this index column is created 012 I don't want this column in my table I want from order ID only right so for that I'm saying index equal to false and I'm saying if exist replace it means that if table is there replace it it what it will do it will drop and create the table so currently this DF orders table is not there so when I use replace option it will create the table automatically okay so we will see so I'll just run this it will take few seconds so it is just about to be done yes it is done so if I go to this equal server and select I'll just refresh it so DF orders is not there currently you see DF orders there's no Table after C there is e I'll just refresh it if you see DF orders is created now so I'll say select St from dfdore orders so this data is created and you see these values are coming as null that is what we wanted right water is not available come as null otherwise it would have come as not available unknown which is not a good thing order date is coming as date or it it is as a date time actually but we don't have a time attribute but we can use it okay and then we have everything in a lower case under score so this data is clean now we can do analysis there is one problem is still so the problem here is if you look at the DF orders and columns in this table if you see the data type is not proper everything is coming as ship mode uh for example coming as Vare Max right so it will take more memory similarly quantity is coming as big int we don't want as big int int is enough right so we have to do some changes for this so the best practice is don't use this option if exist replac because the pandas will will create a table with highest possible data types so whereare max right big in so what you should do you should create the table yourself once you have this ddl you can modify it and create a table and instead of replace say append which means it will append the data into the table right so it will not create the table whatever table is there and it will append in this so what I will do I will drop this table now drop table right and I will use this ddl to create the table okay so I will create an empty table using this right so how many columns are there we can check 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 16 columns are there right and in our data also if I look DF do columns 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 so this seems to be fine so we we have created this table Now command done table empty table created now instead of replace I'm saying upend and run this again okay so now it will just load the data it is not try to it will not try to create the table because we are saying append okay so if I just remove this let me remove this and I will say select sta from dfdore orders right so data seems to be fine right everything is fine and if I just refresh it the column type is also fine we have a primary key order is date instead of date time we made it date right and City and all those columns are proper Weare 20 50 whatever is not needed quantity in so we we decided that what data type we want and then we loaded the data okay so this first exercise is completely done basically it is kind of a ETL project we extracted data from kagal API we did data cing transformation in Python and then loaded into SQL Server now we will do some data analysis okay so let's start with some of the questions so here are the questions right we will tackle one by one all the question the first question is find top 10 highest revenue generating products okay if you have seen the uh video till here please do like the video subscribe to the channel and share with other people because it takes a lot of effort to create these kind of videos okay so the first question is find top 10 highest revenue generating products so we have to find which products are sold most based on the product ID right and and the sale price so this is my sale price right so based on the sale price I have to tell which are the highest selling product right top 10 so for that first I have to get for each product what is the total sales so I will say from dfdore orders and I will say product underscore ID comma sum of sale price right as Sales Group by product ID right so let's run this so we have got the sales now I will say order by sale price or I can use sales because in the order of execution order by runs after select so I can directly use this Al do so these are my in descending order right I need to do in descending order so if you see these are my top 10 products right by sales these are my top 10 products so to get the top 10 I will say top 10 right so this top 10 runs at end that that is one that is when the order of execution is important because you know after select order bu will run so you are able to use this alas you also know top 10 will run at the end after order by so we are doing order by sales descending and then top 10 so we are getting top 10 product with highest revenue generated these are my product with highest revenue okay so this answers my first question let's move on to the second question the second question says five top five highest selling products in each reason now I need this is the overall right so I have different reasons so if I I say select distinct reason from dfdore orders right so there are four reasons in each reason which are the top five highest selling I need right this is a very common interview question as well right what I will do here let me copy this I will introduce reason as well so I will say let me remove this within each reason give me the sales of of each product so what is the in each season for each product what is the total sales right I'll just take this okay and reason Group by reason and product ID order by reason and sales okay so let's do this so now I got so because I'm doing order by reason and sales so for central reason these are the top five products right because sales in descending order so first it is sorting on reason then on sales so these are my top five products right in central this is what I want similarly similarly for next reason right so for example East reason let's go little down here also these are my top five products right so these products can be different in different reasons right so I need total 20 records five records for each reason so what I have to do I have to generate a rank right to generate a rank what I have to do I will say I will use it as CT with CTS right and let me remove this order by for now and I will say select star from CT right and here I will generate a row number or rank any function you can use for for instance over Partition by reason because each reason I want a rank separately order by sales descending right so if I run this now sorry I have to close this as RN okay so if you see in central region right we got whatever the top five they got rank from 1 till 5 right similarly for East reason right if I go to Easton reason the ranking will start again because I have done Partition by reason right which means for each reason it will again start the ranking so again it is starting from one so if you see these are the lowest which has no sales all almost zero right so if you see in east region these are the top five products right so what I have to say now I will say Select Staff from this or you can again create a city where I have to give a alas for subquery RN less than equal to 5 right so for each reason it will give me top five products by sales right so top five for Central top five for East top five for South and West so we have 23 cords this is this is the answer to the second question okay let's move on to the next question next is find month over month growth comparison for 2022 and 2023 sales So currently the data we have is 2022 and 23 so we can check it select year order date from orders right so we have 2021 2019 we have others actually sorry not orders DF orders my B and let's do distinct okay if you see we have two years of data 2022 and 23 right so what we want we want to do comparison for 20223 for each month so in this way January right in January what is the sales in 2022 and what is the sales in 2023 this is how we want data similarly it will be for Fab so we want to kind of pivot the data right so for that what we will do so we need year first of all so I we will get Year from order date right or there is another way or let's do this way so we have the year right and apart from this what we need we need sales right so I will say uh what is the column name sale price right okay so for each year and month right I need month as well right order date sale price what I will do I will do a Aggregate and say as sales so first I need data at year and month level right then only I can do whatever is required and this is good by this okay so if I run quickly this we have got for each month year month combination let me do order by as well okay and let's call it order year as order year and as order month right and if I run this okay so this is my 2022 order month 1 this is the sales right similarly till 12 and then from 2020 3 1 to 12 now what I have to do for each month I have to put in separate column so for order month one 2022 sale is this and for again for order month one 2023 sale is this right I have to put in different columns so I'll again use the CT here with CTS right this is done now I will say select from City so what I want I want month-wise so I will say order month for each month right and say Group by order month right for each month I want one row but I want two columns so I will say case when order year equal to 2022 then take sales right else zero and right similarly I'll create one more columns 2023 then sales and column right so if I run this without group buy what will happen without Group by let me run this uh order buy is not allowed in the subquery so let's remove it and let's run this again okay so I'm getting if you see fall for 9 let me do order by order month right so for each month I'm getting two rows right first row is for 2022 so whenever 2022 is there this is the row and this will be zero right so this is my 2022 sales basically in 202 22 sales wherever the year is 2023 it will be zero right so if I take order year also it will be more clear so if you see for 2023 the First Column is zero right this is zero because First Column represents 2022 sales right this is 2022 as sales 2022 right and let me rename this as well as 2023 and one more time let's run it final time if you see this is row of 2023 so in this 2022 sales is zero and 2023 sales is there right so there are two to those but only in one row the Valu is populated right so what I will do I'll just is sum so that it will Aggregate and give me one row only for each month right and I will do group by as well and let's run it for the one last time okay uh some eror Okay order year is not required now okay okay so we got order month one sales for 2022 2023 and similarly for all 12 months right so this is what we wanted as part of this question okay let's move on to the next question next question says for each category which month had highest sales right so each category how much is the sales month-wise sales first calculate that so select star from dfdore orders within each category we have to find that for example for technology there are 24 months right from 20 22 to 2023 that 24 months so which month had the highest sales right similarly for furniture which month had the highest Sal so we have to find that month right now month will go with along with year because we have the year two years of data right so if we don't consider year it will merge the data that will be wrong so what we have to do we can we have to drive a year month function right so we have to get this data in year month format so one way is we can use this format function so we we can say format or we can use two columns as well for year and month separately but this is a good column actually good function so I'll say order date and then I have to pass the format here right so I will say one second format and here YY YY mm so I'm saying format the date in YY mm for uh format right so if I run this you see it is coming as 20233 right 20238 so I get year month combination the date is not there now I can directly aggregate data on this so I will say as order month order year month let's say right and then sum of um sum of what um sale price as sales now I need for each category so I should have category also right and this will be my sum of sales and good by let's take this category and this right so I got data at category and monthly L right so for each category let me do order by as well to make it more clear so if you see for furniture right this is the sales for first month 2020 01 2 3 4 similarly for 2023 right now we have to find which months are the highest sale for each category right for each each category so I should have three rows one for each category so again I will convert it to a CT right so this is my CT again I have to do a rank select from city right star comma row [Music] number over again Partition by Cate right because for each category I want order by sales descending because I want to see which month had highest sales right so if I run this order wise not required again not not allowed in subqueries or cities okay so if you see for furniture highest selling month was 20221 right this is the sales similarly if we go down for office supply guys the highest selling month was 20232 similarly for technology it is 20231 right so what we have to do again I can just say select sta from as I said earlier you can create even a city and alas where RN equal to 1 we want just highest selling one right so this will give me the result okay so these are the highest selling months for each category I hope it makes sense okay let's move on to the last question now last question is which subcategory had the highest growth by profit in 2003 compared to 2022 so it is saying which subcategory saw the highest growth in profit right so we have done some year-wise analysis here let's let's use this so I'll just go here this one right so let me copy this I'll just do some changes in this okay so now they are talking about subcategory first thing right so what I will do I will introduce subcategory here okay subcore category um is it wrong okay it is fine so okay now here we we are not concerned about month so let's remove the month we want only only year wise right so let's subcategory year and Here Also let's remove month right because we need to see which subcategories of highest growth right in 2023 compared to 2022 right so for each subcategory I have to get the growth number for 2023 compared to 2022 so let's do this and again I will use the same thing subcategory right so for each sub category I will get this number so let me run this quickly and show you what I'm trying to do okay so if you see for each subcategory there are 17 subcategories I got 2022 sales 2023 sales right now I just need to see which category subcategory has the highest growth right so for that let me create this as a CT ct2 as okay and I will say select from ct2 right now I need to find the growth so how I find the growth I will say sales of 2023 right minus sales of 2022 divide by sales of 2022 into 100 right divide by sales of 2022 into 100 right so if I just run this it will give me the growth percent if you see this is giving me growth percent so for example from 77,000 to 83 it is 8% increase here if you see sales has decreased for art again it is decreased for binders it is increased so this is giving me the growth percentage so I can use this in order by so that I can use this for filtering order by this in the descending order because we want highest growth subcategory right so let's run this and this has the highest growth in terms of percentage right from 16,000 to 28,000 and now I can just say top one because I need the highest growth subcategory the top one Cate subcategory and this will give me the result okay so supplies saw the highest growth in terms of percentage if the question is without percentage then you don't have to do this right without percentage answer can be different so percentage wise it is 80% so if I just remove this if I just see plain growth right so this the answer can be different it might not be supplies right it will be different so you have to be careful about the question so this is machines now machines saw the highest growth of 35,000 it depends on what is the question you have to do accordingly okay so this is your end to end project I hope hope you liked it uh please do subscribe to the channel like the video share with your friends we have done everything from kagal API to downloading cleaning preprocessing and then loading to SQL server and doing the data analysis you will find the link in the description box to get all the documents all the codes thanks for watching have a good day