Transcript for:
Comprehensive Python Data Science Course

this comprehensive python data science course covers the essentials through Theory demos and real world applications with two detailed projects this course is designed to provide practical experience that will prepare you for real world data science you'll gain Hands-On knowledge in data analytics AB testing and business intelligence if you aspiring data analysts data scientists or you are aspiring to get into the field of machine learning or AI then mastering the basics of data and analytics is your starting point in this comprehensive 6 Plus hour course we are going to start with the python implementation of data analytics we are going to look into the data analytics Basics when it comes to putting that in Python Programming after this we are going to get into the AB testing Theory which is fundamental for any data analyst or data scientist who wants to drive the experimentation changes in the product like ux design to the algorithms using the data this course will consist of three parts the first part will be dedicated to the python data analytics course in this python data analytics course we are going to cover the basics of Performing and data analytics including data visualization and data processing in Python after this we're going to get into the second part of the course which consists of the theory of data analytics and datadriven exper experimentation this is fundamental for any data analytics and data science professional here you are going to learn the AB testing Theory starting from the hypothesis the business problem up to the point of conducting a data analysis on that collected data to make a data driven decision for different sorts of Online problems then we are going to get into the third part of the course in this third part of the course we are going to conduct the two endtoend portfolio projects the first one will be related to the AB testing where we will conduct an endtoend AB testing online analytics related project which you can also put on your resume and in this 1 half hour We're are going to cover from the basics of a testing results and as is in Python to the actual implementation of it and conducting your data analytics in Python Programming and finally the third part of the course will consist of two separate end to end full data analytics projects the first one will be about online testing where we will use the data analytics as well as Python Programming to derive the uh landing page ux design decision on the landing page of lunch. and then the second portfol project will be another 1 half hour end to end data analytics project where we are going to look into the data analytics for Superstore project so those two projects in total of 3 hours will be agre great way to implement the theory into practice in an actual business real life setting dat scientist an AI professional and I've been in this field for more than 5 years I'm co-founder of lunar Tech where we are making data science and AI accessible to everyone individuals businesses and institutions so here is what we are going to cover as part of this full data analytics course in the first part of the course we are going to cover the data analytics in Python Programming so uh it is expected for you to know some basics in Python but not more we're are going to learn how to load data in Python using pendis how to do data wrangling and data preprocessing using librar such as nony and dendis then we are going to look into the data PR processing techniques how to do sorting filtering as well as data aggregation how to join data using different joints including inner join left join left anti join right join how to do uh different uh statistics related task including calculation of data um descriptive statistics for our data using python then we are going to do data sampling in Python we are going to learn different data sampling techniques and we are also going to look into Data visualization in Python which is really important as a data analytics professional when it comes to bringing the the theory of data analytics into practice so after this once we are done with the uh practical uh programming section for the data analytics in Python we're going to get into the second part of this course which is about AB testing here we are going to look into a quick high level theory behind AB testing and then we are going to dive deep into it we're going to learn this ID or B testing online experimentation and how data analytics is relevant for AB testing and we are here we are going to learn this entire cycle of AB testing from the design up to the data analytics or the final results be prepared to learn here uh the concepts like primary metric the design of the test how you can design a proper AB test including choosing the right parameters for your test calculation of the minimum sample size so as prerequisite for this part of the course it requires for you to know some fundamentals in statistics so understanding uh this Basics behind probability probability Theory this uh concept behind normal distribution how uh you can use a sample to dve Insight about your entire population and once we are done with this theory behind AB testing and we have also learned how you you can conduct the data analytics and final data analysis for your ab test we are ready to go into the third and final part for our data analytics full course in this third part of the course we are going to conduct two end to endend case studies in the first one we are going to conduct a datadriven decision making for lunex landing page where we are going to use data analytics data visualization as well as AV testing to understand whether we need to replace our current button so here expect to use uh python in this one half hour course we are going to conduct data wrangling data preprocessing also data visualization and then we are going to uh analyze our results and make a decision by using the theory that we learn as part of the uh second part of the course as well as the uh programming that we learned as part of the first part of this course then uh once we are done with this first end to endend portf project we are then ready to go onto the second project as part of this final part of the course which will be about a pure data analytics case study in this case study we are going to conduct the data analytics for our super store so here we are going to uh start with the overview of this analysis and then we are going to analyze Superstore customers then we are going to see what uh techniques we can use and how we can conduct a Superstore customer segmentation analysis in Python then we are going to analyze the revenue of the superstore by customer segment after this we are going to explore the customer loyalty at the superstore and then we are going to finish off with the insights that we derived based on this analysis for our customers from the sales and uh just in general so we are going to analyze the sales of this customers by segment and then we are going to conclude at the end of this course expect to learn all the essentials for your data analytics Journey so without further do let's get started hi there and welcome back in this demo we are going to talk about how to load data and view this data in order to obtain more information about a certain data that is provided to us we're going to learn how to load CSV files how to load txt files how to load Excel files as well as Json files and also how to load SQL database for this one we are going to use pred L the pendis library a library that we spoke about in the previous demo but we are also going to use some other libraries all right so without further Ado let's actually go ahead and learn how to load CSV files in Python so for that the first thing that I'm going to do is to import the pendas library import pendas as PD and then the next thing that I'm going to do is to pick the name of the CSV file so you might notice that in the left hand side in our uh py charm you can see that we have a file called percent bachelor's degrees woman usa. CSV this is CSV file containing the following data so you can see we have information about the year agriculture architecture art and performance Etc so you might have already guessed that we are dealing with data that describes the percentage of the uh females that have uh completed a bachelor degree in the corresponding fields and the corresponding year of it well let's go ahead and load the data in the python for that what I'm going to do is to use this uh pandas library and as an Acron name we usually always write a name of uh PD for the pendas so let's go ahead and uh name our data frame which we will call let's say uh dataor CSV and this will be equal to and here we need to take the name of the library we are going to use which is PD which stands for pandas Dot and then read uncore as you can see we are getting in many options so we have reor CSV then we have reor Excel we have reor HTML Json par pickle SAS so here you can see all sorts of data formats that you can file formats that you can import and we are going to learn a few of them and those are the most popular uh file formats that you can expect uh whenever you are entering the data science field so um as we have a CSV file we are going to use the read _ CSV option option and within the parentheses we always need to specify the name of the file we are dealing with so uh you always need to put the name of the file as it's a string So within the quotation mark in here so let's go ahead and actually print our data frame to see what is actually going on here we go so you can see we are getting our data nicely so we see that the heads is recognized so we see the column names we see also here the uh indices corresponding to our observations and this is really a great way to look into your data for the first time in the same way by using exactly the same function so read undor CSV we can also load a txt file so txt and CSV files are pretty similar to each other so in case of CSC files which stands for comma separated values uh we uh do not uh usually specify that the separator is comma so uh as you can see in here this is a CSV file and the values corresponding to each of the columns are separated by comma and uh if we're dealing with a txt file in the txt file we don't really know what the separator can be sometimes it's the comma sometimes it's a space sometimes it can be entirely different character so it's really up to to the data that is provided to you but one simple way to load a txt file by using exactly the same function so the read CSC is the following so here we have two different txt files in here we have the student grades. txt and the student schools. txt go ahead and use them so uh we have data uncore txt and it's equal to pd. read uncore txt and then here within the parentheses we have um let's say student and then schools. txt so this is the path but uh before moving on towards the other argument let's actually go ahead and click on this to see how it looks like so you can see we are dealing with a txt file where we do have the header so we have the name school ID and Country which all represent the name of the corresponding columns as you can see here we have the name here we have the school IDs and here we have the countries and um another thing that we can notice is that uh we are dealing with a separator in the form of commas so therefore what we need to do is to provide in here by the way instead of txt we would just use CS3 for Simplicity and here we will mention that the heer is equal to zero so the first row corresponds to the header which means that that row should not be counted as a data and then the next argument that we will use just for knowing how to use it is the separator so here we will meure that the separator being used to separate each column's value in the row is the comma but um if you were to be supplied with the data in a more difficult format where you um had a different separator so let's say the separator is present symbol then here you need to specify that your separator is this uh is this symbol so whatever the um symbol is used the character is used to separate your data that's exactly what you need to put in here such that pythron can understand when it needs to cut and needs to take that value and say that this value corresponds to that specific column and then the same holds also for the header if your header is not present then you need to specify that in your argument header all right so let's go ahead and load this data and see what is underneath so print dataor pxt here we go so as you can see we nicely get all seven rows so the first name is Tina the last name is Anna and then the Country Canada and last one Armenia so let's go ahead and check it in here so uh I always recommend to check the uh first and the last rows of the database to make sure that you correctly have loaded your data and you are not missing any information uh from your database the next thing we are going to learn and you have name of your first sheet equal to sheet one and then the second one corresponding to another name and you have multiple of those um pages in your Excel file is how to load Excel files so let's say you have an Excel file and you want to load only the first page well for that what you can do in here I'm not going to um look into a specific Excel file uh feel free to uh search for an Excel file or maybe one that you can create yourself and then create your own Pages within your Excel file and try to load that uh in Python but for now let's assume that we do have that Excel F in our py charm environment and we are going to load that so uh dataor Excel will be the name of the data frame that we will store our data and then the function we can use is PD Dot and then read and then here we already get a recommendation from pycharm read uncore Excel and then here we have file unor XL s x which is a common uh extension of Excel file file. ex LS6 is the name of your Excel file and here I'm making an assumption that your Excel file is within this python for data side or your own uh folder the one that you are currently using in py charm so here then the next thing we need to specify is the exact spreadsheet we are looking into because um otherwise you will uh get an error in a pie charm and python will not recognize where exactly it needs to look for the data therefore we need to use this argument called she feore name and here you need to specify the name of your exact spreadsheet you are looking for it can be that it is the default um acronym usually used in Excel but in case you have renamed it or someone has renamed it then you need to specify that specific uh name so uh it can be for instance um uh first spread sheet let's say if that's the name of your first spreadsheet or it can be uh hint one which is usually the common convention used in Excel whenever you are not changing the name of your spreadsheet and this is how you can read an Excel file uh I won't be running this code because we do not have the file. Excel S6 in our folder but this something that you can experiment yourself and another common the file format that you can expect the Json format here once again we are following the same ideas as in case of Excel files so feel free to go ahead and look for Json file uh online download it and try to load that into your pie charm environment but this is the way that you can load Json type of data so data uncore Json is equal to pd. traore Json and then here you can specify the um uh let's say Jon uh file. Json so this will be name of your file we can also make this more convenient so let's make it file name here also file uncore name and this is the only thing that you need to specify so this will be the name of your file and here once again I'm assuming that your file name. Json is actually in this folder that you are currently running otherwise you need to specify the exact path of the file that you have so once you write this then you should be able to successfully load your decent type of data in your pie charm and then finally we will look into uh way to load uh SQL databases so SQL databases are common um database format uh whenever you are working with big data this is very common in the field of data analytics but I think it's still worth to know at least the commands and the library you can use in Python in order to load uh this type of data so let's actually go had and import the corresponding Library uh we can use to loow SQL database and the library is called SQL uh te Tre so uh for that we will do import and then SQL and then I and then three then uh what we need to do first is to make a connection with this SQL database and that's exactly what we can do by using this Library so connection DB is equal to SQL it3 do connect and here we need to specify the name of the database we are dealing with so database uncore name do the extension is DB for database and in this way you will make a connection with the corresponding database the way SQL works is that in SQL we are creating databases and within each database we can have multiple tables and each table has its own name and then within each table when we are loading that table we can there run cues I won't go too much into details about what is SQL how you can use database how you can create tables and how you can run cues because that's outside of the scope of this uh course but uh I would highly suggest you to um at least learn the basics of the SQL it's not necessary to enter the field therefore it's also not included as part of this course but it's good to know uh at least what squel is and also how uh you can use it and what what is the a functionality of SQL uh in the entire world of data science so uh I will include some resources about SQL and the usage of it uh in the resources section but just know that in order to be a technical data scientist at least to enter the field of data science you do not need to know SQL it's something that you I would highly suggest you to learn as you grow your career but it's not a must know so once we have made the connection with our database called database name. DB then we can specify the exact cure that we want to run and in this um specific scenario what we mean by the Cy is that we will use the commands commonly used in SQL in order to select all the rows within a specific table so in our database we can have multiple tables and here I will assume that we have a specific table from which we want to import um let's say the First Column only for that what we need to do is to do a cury let's call it a cury uh let's say it's our first cury cury _ 1 is equal to and then here we have quotation mark and then select this is a common way of um specifying that we want to select specific variables from our table and here we can specify the name of the uh column that we want to import let's say callcore one and then we need to specify from and here we will specify the uh table underscore name and in this way the one will then go ahead and select the first column from the table with the table underscore name so uh this is a one way of uh running a cury and selecting just one variable we could also write a cury we will call CY 2 and this will select all variables from uh table uncore name select and then here what we need to do is use star and in SQL whenever we are saying select star it will go ahead and select all the columns included in that data something that we uh usually usually prefer instead of uh just selecting one variable so you will select certain variables only if you are specifically looking for those features but otherwise I would highly suggest you to include all the Cs all right by the name of the data print we want to load this data in and then we read underscore as you can see we already see something so here you can see that we have three different functions we can use we can use reor SQL we can use reor SQL uncore cery or we can use a read underscore SQL underscore table they do defer in the way they import the data so for one for instance you can specify also schema and another one you can specify the index uh of the com you want to import the most generic one is the read SQL similar to the read CSC so we are going to use that one and the next thing we need to specify is the uh cury and the connection is specify cycore 2 and then next thing I need to do is to specify the connection so once you run this code what this will do is to make a connection with your SQL database it will then uh specify the Cur and it will go ahead and select all the features and import all the features from a table called table underscore name all the variables and it will then uh be um stored in the pandas data frame this is all for this demo and I will see you in the next one in this demo we are going to continue the process of looking into the data as in the previous demo we learn how to load different sorts of data and in this one we're going to look into it we're going to learn how to explore the data and how to preprocess it we are going to discuss the uh inspection of the data getting information about it getting to know what the shape of the data is how to identify missing values how to drop the missing values how to fill in missing values how to get the type of the data you are dealing with how to access different rows in your data frame by using the infamous iog and loog and what is the difference between the two so from the previous demo we still have the CSV file and uh we saw that uh we got this data structure in the CSV file describing the percentage of the bachelor degrees uh consisting of women in the USA so this how the data look like we had the year agriculture architecture art and performance Etc and then uh in total we got 18 columns in this data frame and 42 rows so the rows are those observations so you can see in here and the columns are all the features included in data frame and we got only one feature describing the year and the rest of them are all the names of different sorts of Bachelor studies so as you can see we have AR culture architecture we have social sciences and history public administration Etc the first thing we are going to look into how to use the Heather functionality in Python in order to get or a snapshot of the data so what we can do here is to uh write down the name of the data frame so data. CSV and what we need to do to do Dot and then head for the header and then here uh inside in it uh we can leave it empty so when we do that what we will get is the following output so let's actually go ahead and remove that one from the printing temporarily as you can see it will print for you the first five rows with all the columns and if we specifically are looking for each number of rows to be presented as part of the snapshot then we can specify that as part of the head function so in here for instance we can say show to us the first 100 observations in this data frame and in that case it will print for you the first 100 observations but as we have only 42 rows it means that it would need to print for you all the rows if we change this to 20 let's say then in that case we will get the top 20 observations present in the data and this is how it looks like all right so this is about header function just a good way to uh have a first look at what kind of variables you have what are the first few columns what are the last two columns what is the number of observations you got by looking in here what is the number of columns you got and uh also what are the different sorts of variables you got and data types in your data frame just by a visual inspection you can see that we have for instance the year column which consists of the integers we have the agriculture and all these are variables that are of a floating uh number type which means that we got a number and then we got Dot and then what comes up to the decimal points and if this heer function will print for you the top X observations we can do exactly the same only from the bottom observations and for that we can use a function called tail so uh we can print for instance the last 20 columns by using this data _ CSV do tail and then within the parenthesis the amount of rows that we want to see from the bottom up so let's go ahead and print that and as you can see if the Heather function will show case the first 20 rows the tail function will showcase the last 20 rows so you can see 22 23 up to 41 so this is a great way to see uh how the uh the first few rows look like and how the last few rows look like the next thing what we can do is to use this info function in order to obtain more information about our columns so they data type specifically so this is the output of the info function and this is the number of columns you have in this case the year is the First Column the agriculture is a second column and then the social sciences and histories in last column and then we have the count of the non-n values as you can see all the columns have 42 non-n values which means that we do not have any missing observations then we have the data type corresponding to that specific feature and we already saw from the snapshot that uh the year was the only variable of integer type and everything else was floting uh data type and this is exactly the confirmation for data observation the next thing what we can do is to identify the missing values and drop the missing values so uh from this we can al already see that we do not have any missing values but let's actually go ahead and learn how we can do that so let's say we want to uh drop all the missing observations for that what we can do is to do print is to actually um take the name of the data frame CSV and then what we can do is to do drop and then Na and then parentheses so in this way you will be dropping all the na so all the cases where your observation has an NA for that specific column so as we do not have any missing values in our data frame this will not do much to our case but I think it's really important to know how to drop missing values in case you want to quickly remove them from your database let's say you do not want to drop your na so the missing values in the form of Na but you want to F them with a certain volue what you can do is to use this uh field na function and within the parenthesis you just need to specify what is the value that you want to use to fill the na so in here you can for instance decide to put null instead of Na and this will simply go ahead and fill all the values where it's written na a it will replace it with null values so let's say you have another issue with your data so you have rows that are exact copies of each other one function you can use is what we call drop duplicates so for that you simply need to take the um data frames name and you need to drop and then underscore duplicates and this will quickly remove all the duplicates from your data frame so let's actually go ahead and change the CSC file and see whether we can uh nicely remove the duplicates so let me copy paste this few times and as you can see it's the uh exact copy of the third row and now we have it in the fourth and fifth and sixth rows so let's go ahead and check whether this dropcore duplicate really removes uh those duplicates so let's print the actual CSV file before removing the duplicates and after removing the duplicates in here let's see so in here you can see that this is the uh data frame with in total of 45 rows because we just added three additional rows so previously we had 42 we added three so we end up with 45 rows and the number of cols is the same and then we apply the drop duplicates function and as you can see after using that function now we once again end up with a 42 rows and if we go ahead and look into the specific column we duplicate you can see that we got only one row corresponding to this year and this is how we know for sure that dropcore duplicates really works and it will remove the duplicates from your data the last thing we will look into in this demo is how to access certain rows in a data frame depending on their index type so uh sometime and actually most of the time we will get a data frame that has an integer as an index and that's also what we got in our data frame as you can see the index is 0 1 2 3 4 so it's in the integer format but there also occasions when you will get your uh data frame with an index that is of string type so you will see that instead of the uh index 01 2 3 for instance you will have ABC or A1 A2 A3 uh Etc so depending on this uh nature of the uh index that you are dealing with you can then use either the ilog or the log functionalities in Python in order to access different rows in dependent data frame and this is by the way a common question you can expect during your programming related data science interviews what is the difference between the iog and loog and how you can use them so let's start with the iog as our data frame or already contains an integer based indexes and let's say we want to access the data in the index uh 10 so uh this is the row that we want to access therefore what we need to do is to take the name of the data frame so dataor CSV and then we need to do Dot and then iog and then in here we need to specify the row that we want to access which is the 10 and let's go ahead and print this to see whether we are getting the correct data and uh let's verify that by looking at the gear so as you can see the gear is in incremental order um and we do not have duplicates in the year so uh therefore if we see that the year is equal to 1980 then we have selected the right data and we have accessed the right Ro so this the output and as you can see the year is equal to 1980 and this is all the uh information that is stored in the 10th row and in this way you can access any row that you want in your data frame so it can be for instance the first row or the last row but it can also be a row somewhere in the middle that you want for for some reason and if you want to access a specific column instead of specific row what you can do is to use again the log function so data. log and here instead of providing just one argument you can then provide two arguments so we always have the rows in the beginning and then uh we need to specify the columns and in case of um the in here what we did was to access the specific row therefore we specified only the X but if we want to access a specific column we also want to specify the rows that we want to include then we need to specify both the index of the rows and the index of the column so let's say I want to access the column A2 in here I will then specify A2 and as I have specified here a column this means that I want to take all the rows so the indexes corresponding to all these values so let's see what the output of this one is and as you can see here we are getting four five and six so the value corresponding to the index X Y and Zed and this is actually the column A2 so in this way you can specify not only the rows that you want to access but also the column so let's say you only want to access a specific value in that column let's say you want to access the second row and the second column in that case what you need to do I'm sure you already have guessed that is to specify the index of the row and the index of the column that you want to access so the index of the row is Y and then the column that we want to access is A2 so this is the number that I'm chasing let's see where this will provide the value and as you can see it provides five and let's also look into the case when we are dealing with a string based index so uh let's create for that a small data frame and it is the following let's look into it and as you can see this is the uh small data frame that has indexes X Y and Z and let's say we want to access the um data that is stored under the index X so very first row well what we need to do for that is to uh once again take the name of the data frame and instead of using iog this time we need to use the log and then the index name which is similar to what we saw before so uh the index name is X that's what we want to access and let's actually see what is the output of that print here we go so as you can see the first value is one the second value is four and the last value is seven so by using the log functionality we can access the uh row in a data frame where the index is of string type but if you go ahead and you use the ey log in here you will see that you will get an error and the reason for that is because iog doesn't allow you to search with uh case when your index is of string type and for those cases you always need to use the uh log function instead of iog hi there and welcome back to another demo we are going to talk about three very important tasks that you can perform as part of your data analysis and data manipulation toolkit so we are going to learn how to do filtering sorting and grouping in Python data analysis and manipulation involve working with large amounts of data and you can definitely expect this as part of your data science project and then what you need to do is to extract meaningful insights in this context filtering grouping and sorting are really important techniques that allow us to organize extract and analyze data efficiently python has a very powerful tool Library called pendas that we also saw as part of the libraries discussion demo which can be used in order to perform grouping filtering and sorting in a very simple way so when it comes to grouping grouping data helps us to analyze and summarize information and data based on specific criteria here we need to have at least one variable to do the grouping based on but then you can also add extra other variables such that you can aggregate your data not only on one variable but on multiple variable so let's say you want to uh group your data based on the gender or based on the region and then you want to perform some uh descriptive statistics calculation so you can calculate the mean for instance the standard deviation the variance the median the mode the minimum the maximum you get the idea so you can then categorize your data into certain groups and this way you can group your data and then you can obtain some meaningful information and analyze your data this is usually very important part of your data preparation process another thing we are going to talk about is how to filter the data so whenever you are filtering your data it helps you to extract subset of data based on specific condition so let's say you uh know a specific spefic uh year that you are interested in or you are interested in specific region you are interested in specific characteristics then you can use filtering to filter your data to select only a subset of observations from your data and to perform all the analysis calculation and training of your model based on this specific subset or it can also be that you want to identify the outlash in your data or the noise in your data and you want to identify the 99 percentile or the first percentile so you want to identify the largest or the smallest observation in your data and remove them from your data such that that you won't be dealing with a problem of overfitting as an example another thing that we are going to learn as part of this demo is sorting so sorting data helps you organize information in specific order it can be in an ascending order or in a descending order it can help you to visualize your data it can help you to identify certain patterns find EX streams or outliers in your data and sometimes it's also used as part of the time serice analysis or for instance to look into the sales performance to identify the worst performing uh shops or the best performing shops and it's essential part of the reranking and recommender systems so whenever you are dealing with sege engines recommender systems anything that relates to the order and importance then sorting data comes really handy because you want to show the uh best information the most important information to your customers and the way that you can do that is by sorting your data so when it comes to the s versus descending the ascending relates to the case when the smallest values are at the top and then the values would then increase and then at the bottom you have the largest observations and whenever it comes to the descending uh order of the Sorting then we have the largest values at the top and then the values would decrease and then the smallest values will be at the bottom so here I created a very simple data frame using Panda's library and here we have four different columns we have the name the age the salary and the department of an employee and as you can see we are dealing with eight observations the name and the department are of string type and the age and salary are of integer type and what we want to do here is to sort our data under the data frame name data with respect to the salary such that uh in the beginning we have the uh at the top we have uh employees with the smallest salary and at the bottom we have employees with the largest salary so what we expect is that uh this person so this corresponds to saana at the age of 19 from the Department of operations should be at the top because uh this person earns the Leist and then we have um the uh largest earning so the highest earning person which is named Bob and Bob with the age of 20 should be at the very bottom because we want first to sort our data based on salary in an ascending order so for this what we can use is the default um penders function called sword _ volume so data. sortore values so as you can see we are already getting a recommendation for this function uh from Python and uh here you need to specify based on which variable you are sorting and then you need to specify whether it's an ascending or a descending so um the reason why we need to specify the varibles name based on which we are sorting is because here we could have also sorted the data based on the age but uh instead what we want to do is to sort the data based on salary such that the highest earners will be at the bottom and the lowest earners will be at the top therefore we need to specify that buy is equal to salary and then when it comes to the uh other parameter or the argument in this function which is ascending this is a brilliant type of argument so um when ascending is equal to True which is the default value then uh it means that we are sorting our data data frame based on an ascending order but if we want our data to be sorted in a descending order then what we need to do is to change this value so then the ascending argument should be equal to false so let's actually go ahead and see that in the implementation so here we need to specify by is equal two and then the name of the variable based on which we are sorting which is the salary then the next thing what we need to do is to mention s sending uh sending which is the second argument it's a brilant valued argument so it can only take values true and false and and the default value is true which means that we are um sorting our data in an ascending order I could have also skipped this argument you can see in a bit that we are getting exactly the same uh result when we are mentioning this argument SN equal to true and without mentioning this argument simply because the default value is equal to true so let's actually go ahead and remove this for one case for the other s you can see that both will result in the same output here we go so let's also add a line in between to make sure that we are getting everything nicely printed here we go so as you can see here here we are getting us expected Sana at the top and then Bob at the bottom and in this way uh you can verify that your uh sorting occurred successfully and we have done it in an ascending order because the salary is increasing and the lowest salary is at top and then the highest salary is at the bottom so uh the other thing that you can see here is that we are getting exactly the same result when we are mentioning the S standing equal to true so this is simply because um there are certain arguments and the arguments have certain default values in Python and uh whenever you want to have the uh default value as your arguments value you can also skip that argument and you don't need to mention it specifically therefore this results in the same output as this line but if we do want to change the order so we want to have another value corresponding to that uh argument which is not a default value then we need to change that and we need to specifically mention what is the value that we want which means that if we want to order this data so we want to sort this data based on cellary but in this sending order we need to use this argument and instead the true we need to change this to false because if the S sending is equal to false it means that the descending is equal to True hope this makes sense so let's go ahead and print this here we go so as you can see now we have exact opposite of what we had before so we have now the highest earner Bob with the age of 20 with a salary of 220k from the tech department at the top and then the salary would decrease accordingly and then at the very bottom we have saana with an age of 19 with a salary of 10 10K from the operations Department because she's earning the list so this is about sorting this is all what you need to know do you keep in mind about this descending and ascending and how you can use the parameters in this function in order to assort your data accordingly another thing we are going to learn today is how to group your data so let's say we want to group our data based on a department so we want to know per Department what is the number of employees or it can be that we want to obtain per Department the average salary let's go ahead and learn how to do that in Python so let's say uh we first count the number of people in a department for that what we need to do is to take the name of the department uh of the data frame which is data and then Dot and what we need to do here is to use the uh function called group Y and this a pandas function that we can use to group our data and within Group by we need to specify the variable based on which we are doing the grouping and as we want to obtain the number of employees per Department it means we are aggregating the databased on department so here I will mention then the uh Department which is the uh variable we are using for uh grouping our data and then next thing we need to do is to do Dot and then the operation that we are performing in this case we are counting the number of employees this means that I can either use the salary column or I can use the age or the name in order to obtain the number of observations per department so let's go ahead and do uh count by the way we can also uh even skip this part and we can specify that we uh want to just count the number of times the uh Department name appears because this also will go and calculate the number of employees per department so let's go ahead and see what I mean here print then let's also add the closing parentheses for the print here we go so as you can see we are getting as an index for this uh new data frame the name of the department so the variable that we use to do the grouping by and then here we have Healthcare operations and Tech what you are getting here is basically the number of times uh each of this department appear here for the variable name age and salary so as you can see this is the column name the first one and then you have the age and then the salary if we were to go ahead and actually specify the exact column that we want to do the grouping and then aggregation so let's say uh the name then we can expect to do the aggregation only based on the variable name so as you can see here we are getting this new data frame and it says that the Department Healthcare got only one person working and then the operations got two and then Tech got five if you go ahead and count here you can verify that information actually so you have here Tech on two three uh four and five so five times then the healthcare once and then operations twice exactly what we got here so in this way you can do the uh you can count the number of times uh each of the uh observations appear uh based on the variable that you have chosen and you can can decide either to select a specific column to do the aggregation or you can also leave that part and you just uh implement the function and it will be uh it will do the corresponding operation for all the columns so for the name for the age and for the salary so let's actually go ahead and calculate the average salary pair department for that what I need to do is to change the name of the variable based on which I wanted to the agregation and then instead of using the function count I will use the function mean and this function so the mean is the same as the average should calculate the mean or the average salary pair department and as you can see here we are getting that the health care department has an average salary of 170k operations has 20K as the average salary and the tag has the 113k as the average salary so one way you can verify that this information is really correct is for instance by looking at the healthcare and this is an easy way to do that because Healthcare got only one observation and the average of the one observation is equal to that average which means that if we look at the health care salary so the only uh person who is from the healthcare department is the person with name uh Ellis and then the corresponding salary is 170k exactly the same number as we got here we could also go and uh do the same calculation only instead of calculating the average we could calculate per Department the the uh minimum salary so let's go ahead and do that we see we are getting an error because we forgot a parenthesis in here so now you get not the average but the minimum salary per Department you could also do the uh maximum per Department you could also do the um minimum age or the average age by Department to see uh the age group uh accordingly so for that what you had to do is to change the salary to a variable called age because we want to do the aggregation based on age and here we can then change minimum and in this way we can calculate the U average age per department so as you can see from the Department of tech the um people are on average 41 years old from the operations department they are on average 22 years old and in the healthcare department they are uh around 65 years old and this how you can do grouping in Python and then the final part in this demo is to look into the filtering so how we can use the uh filtering in Python in order to uh select data observations based on specific criteria so uh let's say uh we want to keep only the uh name and the information of people in this data frame that got a salary uh higher than certain threshold so let's say uh people whose salary is is larger than 100K so for that what we need to do is like always mentioned the name of the data frame which should be filtered and then here what we need to do is to add a square parenthesis and within the square parenthesis we need to specify the constraint so uh the uh first part so this part where we are specifying the name of the data frame and then the square parenthesis this mentions that look into the data frame and select specific observations so keep only those observations in this data frame the second thing you need to mention within the square parenthesis is the actual condition that this observation should satisfy in order to be uh kept and not to be filtered out for that what we need to pursue is to specify the condition and the column based on which the condition uh should be conducted so data and the square parenthesis and then the name of the variable based on which we are doing the filtering is the salary so therefore we are saying look into the specific column so data salary and then we are saying that the salary should be larger than 100,000 so in this way what we are telling to python to do is to look into the data Frame data and then look into it salary column select all the observations for which the salary is larger than 100,000 and then only provide the data corresponding to those observations so let's go ahead and check it whether this doing everything correctly here we go so as you can see we are removing all the observations from our data frame that do not have a salary higher than 100,000 and only keeping the observation to salary is larger than the threshold so as you can see we no longer have for instance Anna in our data frame we no longer have uh the Karan in our data frame or um the fish observation which corresponds to Kevin and then s is also not included in our data but only the people whose salary is larger than 110k are included in this data frame so let's say we want to add an extra condition so we not only want to have people whose salary is larger than 100,000 but we also want to filter our people whose salary is too high for instance whose salary is larger than 200k and in this case what we expect is uh from this data frame Bob should be removed so as you can see now with this current filtering Bob is still included but then if we uh also uh remove the people whose saler is larger than 200k then Bob should be removed so let's go ahead and learn how to do multiple filtering and here what we will use is the uh end operation because we want the two of the conditions mentioned here to be satisfied so for that we need to separate the conditions using this parenthesis so then I will add here the second condition which will be based once again uh on the variable called salary and then we are saying that the salary should be smaller than 200,000 here we go so this symbol stands for end and as you can see now we no longer uh get the information of the Bob because Bob got a salary above 200k so in this way you can specify not just one but multiple conditions that your observation should satisfy in order should be filtered in this new data frame and uh the final thing that we will look into is to how filter your data not based on the larger or smaller but based on specific values so uh let's say you are interested in a data of people whose age is equal to uh 65 and uh 20 so in this case uh you cannot uh you can no longer say that the age should be larger than certain volume smaller than certain value or even if you do that it will be much more complicated then to just to just help to python to select all the observations for which the AG is equal to to this specific values and in those cases uh the uh eing functionality comes very handy so for that what we need to do is to Simply once again take the name of the data frame and then here we'll be uh mentioning again the name of the variable that we need to look into to which is age and then here we need to specify that is in and what this uh function basically does is that it looks into specific values for this age and only keeps the observation that satisfied to uh those values so uh in here I mentioned that I want information only for people with an age of 65 and 20 for that I need to put uh those two values within an array because they are not just one but two values here we go so as you can see now we are getting only the data corresponding to people whose age is equal to 65 or 20 so this is how you can use the easing functionality to filter for specific values in your data and this becomes even more handy when you are dealing with string type of variables so you can no longer say that the corresponding uh column the observations volue should be larger than or smaller something because you dealing with shrinks and in those situations the easy can be really handy so it's really worth to know how to do The Field Ching based on specific values and this is all for this demo where we learned about grouping filtering and sorting in Python stay tuned and I will see you in next demo hi there and welcome to another demo in this demo we are going to talk about descriptive statistics we are we are going to to learn about calculating the mean calculating the standard deviation the variance the mode the Medan different percent has Quant tiles for arrays as well as we are going to learn how to get the descriptive statistics table for pendis data frame so descriptive statistics play very important role in analyzing and summarizing your data here are a few reasons why I believe knowing how to calculate descriptive statistics is important and also how you can use it so first of all descripted statistics help us to summarize our data so descripted statistics provides a coincide summary of the main features of your data set it can help you to understand the data by providing measures such as the mean dispersion so uh the variance and the shape of the distribution summarizing your data will help you to gain more insight about your data in an efficient way making it much easier to interpret as well as to communicate with other people during the presentations whenever you want to explain something about your data and also it's the essential part of every data science and data analytics projects the first thing you need to do is to obtain the descriptive statistics about your data and present it to your stakeholders in order to tell a story about your data another thing you can do with descriptive statistics is to perform data exploration so descriptive statistics is a great way a good starting point for exploring and understanding your your data it provides an overview of a data distribution it can help you to identify certain patterns it can help you to identify outliers by looking at the mean as well as the minimum and the maximum of your different variables it can also help you to identify potential data issues so the outliers noise in your data as well as missing values it can also help you to understand the type of variables you are dealing with because if you get an output from your descriptive statistics you'll get understanding whether you are dealing with a categorical string variables or numeric variable or a floating data point uh or an integer data point Etc it will help you to further investigate your data and to have a good understanding what should be uh your steps in order to clean your data and to prepare your data in the best possible way for your machine learning model another thing you can do by using descriptive statistics is to get an understanding whether the some that you have sampled from the main population it's a good representation of your population so as part of the fundamental statistic section in this course we learned about the difference between sample and population and how we use sample and we randomly sample a small part of your entire population in order to make conclusions about your population but then the criteria is that your sample should be a true and an unbiased representation of your population by using descriptive statistics you can then compare for instance the mean and the variance of your sample to the actual population in order to get an understanding whether you are dealing with a good sample or whether you need to go back and then sample again in order to get a good sample another thing you can do is to visualize your data so by using descripted statistics you can visualize your data in order to represent it to different stakeholders you can use histograms you can use box PLS you can use bar charts or pie charts in order to represent your data in a clear and understanding way so without further Ado let's learn how to calculate different statistics for our array so here I've created an array which consists of the following numbers as you can see 100 205 20 45 - 100 and 46 so first thing we are going to do is to calculate the mean the mean is the average of a set of numbers and in this case it will be the mean of all these numbers and it is calculated by summing up all these different numbers that we have here in the set and dividing it through the number of uh observations we got here so as you can see the length of this of this array is equal to seven which means that we need to sum up all these values and divide it to seven and this will be our mean or the average we can do that by using the nonp library so NP do mean and this the function that will calculate the average and then here within the parenthesis we need to specify what is that the variable or the array that we want to calculate the average let's calculate and let's see what that value is so as you can see the mean of this array is equal to 9043 thing we are going to learn is how to calculate the median so sometimes when we are dealing with sampling distributions the sampling distributions might be skewed so uh it can be left skewed or right skewed for those cases calcul ating the mean might not be the smartest thing to do and therefore it would be better to calculate the median because the median is then the better representation of the overall uh data instead of the mean so uh it's usually handy to calculate both of them the mean and the median and whenever the mean is different from the median it means that you are dealing with a CU distribution so the median is the middle value in a set of numbers when they are arranged in an ascending or descending order if there is an even number of values then the median is the average of the two medial elements and uh the median is also the second quantile from the statistical terminology so it's the 15th percentile let's go ahead and learn how to calculate the median for this array so median underscore is equal to an NP dot as you can see we are already getting the recommendation of the function called median of very straightforward and let's go ahead and print this median o array here we go so as we can see the median of this area is equal to 20 one thing that we can also see is that the mean is very close to the median indicating that we most likely are not dealing with this CU distribution next thing we are going to learn is how to calculate the mode the mode is a value or the values in the set of provided numbers that occurs the most so in our case in our data array we can see that there is a single value that appears the most which is two times and that's the value 20 and this is the mode so in this to calculate the mode of our array we are going to use a different Library so we are not going to use the nonp and instead we are going to use the scipi so the reason for this is because uh n does not contain this corresponding uh functionality to obtain the mode and the reason for that is is because a mode is not a popular measure measure of central ascendency usually we only calculate the mode the median or the mean uh of the data and we are good to go but then uh if you want to calculate the mode it it's still U useful to know how to use this corresponding library to do that so let's go ahead and for that import from scipi the library called stats and then from here we will use the starts. mode function to calculate the mode of our array so as you can see the output is slightly different but the uh idea is the same so as you can see that the mode of this array is this value so 20 and it appears two times so we are getting two uh outputs we are getting the actual value so the value that appears the most and also how many times it appears so count and it's equal to two because we got two of those 20s in our data the next thing we are going to learn is how to calculate the variance and the standard deviation so variance measures the spread or the dispersion of the data it quantifies how far your number is from the mean and a higher variant indicator greater variability your data and lower variant indicate the smaller variability in your data and standard deviation is highly related to the variance the two are basically uh explaining the same thing only standard deviation is the uh square roof of the variance and it is at the level of your uh numbers which makes it uh more prepared whenever it comes to interpreting the results it's the square root of the variance as I mentioned and it provides a standard a way to explain and understand the average distance between each data point and the mean therefore it's uh almost always prefer to uh use standard deviation when you are explaining your data and how much variability is there in your data so let's go ahead and use the nonp library once again in order to calculate the variance and the standard deviation of this data so variance underscore is equal to np. VAR and then data and the standard deviation of the SD uncore is equal to NP do ACD data let's go ahead and print it the variance for here we go as you can see here we are getting the variance of our data and the standard deviation of our array all right so this about calculating various statistics given the array another thing that would be worth to know is how to get the descriptive statistics whenever you are dealing with the data frame so let's go ahead and bring our data that we used previously so you might recall that we saw previously this uh data set uh describing the percentage of woman that completed certain Bachelor uh studies let's also import the pend this data frame as we need that to load this data frame as well as to uh compute the descriptive statistics so this how the data looks like and uh most of the time whenever we are dealing with the pendis data frame we want you get the uh nice descriptive statistics table that will describe our data and we can simply do that by using this uh nice functionality in pandas so we need to specify the name of the data frame and then this CBE and this will go ahead and print for us the descriptive statistics of this data Fame let's go ahead and print that here we go so this how the descriptive statistics table looks like when it comes to the pendis data print so we are get getting the count so the number of observations that we got per variable now we are getting the mean which we just s uh for an array now we are getting the mean perir column in our data frame then we have DD which stands for the standard deviation then we got the mean which is a minimum value per uh column in our data frame then we have the 25 percentile which is the uh lowest 205 uh percentile in your column which is basically the first quanti from the statistical uh point of view and then we got the uh 15 percentile which is the median what we also just saw when we were calculating the median of uh array 15 percentile is also the second quantile from uh statistical terminology then we got the 75th percentile which is the third quantile and then we got the maximum which is the maximum corresponding to that column so as you can see this is a great way to summarize your data so you can look for instance the year and you can see that the minimum of the year is uh 1,970 the maximum is 2011 which means that you can say that you have a data spanning from uh 1970 till 2011 then you have the mean uh in case of here it's not really meaningful but when you look at uh other columns for instance when we look at the architecture which describes the P percentage of women that completed the study of architecture across different years you can see that uh across the years so spanning from 1970 till 2011 there were uh on average 34% woman who completed this study so in this way you can then uh tell a story about your data and you can also Identify some problems in your data so this is all for this demo where we learn how to calculate different statistics for an array as well as how to get the descriptive statistics table whenever we are dealing with a pandas data frame so this is all for this demo and I will see you in the next one if you're looking for machine learning deep learning data science or AI resources then check out the free resources section in lunch. or our YouTube channel where you can find more content and you can dive into machine learning and in AI hi there and welcome to another demo in this demo we are going to learn how we can combine so we can merge different tables that we have in our database because most often we get our data not in one file but in multiple files and sometimes we need to do some pre-processing some filtering and then at the end we need to join multiple tables together in order to end up with a single table such that we can use that in our analysis our data visualization any our machine learning training process and we will train our model only on a single data set for that you need to know what are all the different joints out there what are the possible combinations what are the possibilities and how you can do each of them in Python so uh in here in this picture you can see the most popular joints out there and we have here a left joint we have an inner joint we have the right joint we have left anti joint and right an The Joint so let's so uh in here we will be looking at two tables we have have a table X which will contain certain features and then certain observations and then we will have a table Y which will contain a different table with different observations and different features and our goal is to merge the two tables in different ways and the way we can do that it really depends on uh what kind of joint we want to do so let's go each of these joints one by one we will go into definition and here we will use the idea and we will assume that there is one key so there is a key identifier present both in table X and table y that we can use in order to find out whether a certain observation exist in X and in y or not because whenever we are trying to merge two tables we need at least one key identifier to use that to do the merge uh on that so let's say we have a table containing the sales of a a shop and then we have a table containing the customers of the shop at least we need to know the identifier of a shop in order to say that this customer in this shop has both this item and then uh this shop had the corresponding sales in this way the shop identifier will be the key identifier to be used in order to merge this uh sales data with the customer data based on the shop so using this idea we will then look into this different joints so first we will look into the inner joint and by definition an inner joint returns only the matching rows so what you see here the intersection between two tables based on a common column which will be the key identifier so um if we have certain observations that are in table X but then they are not in table Y which means that we are talking about all these observations not highlighted and uh if we have certain observations that are in table y but not in table X so those are all this uh on this part of the Y that is not highlighted then all these observations will not be included in the final joint table and we will only end up with observations that are in this intersection so they appear both in X and in y so in terms of the example that I just mentioned it means that we will be only keeping the data for the shops for which we have both the sales information and the customer purchase information the result will include only the rows where the key ke values will be present in both table X and in table y let's now look into the left joint so in here you can see the left joint and the uh definition of the left joint is that a left joint returns all the rows from the left table and the matching rows from the right table in this case table y based on the common column so the key identifier so if we have certain observations that we do not have information about in the table X so those are all the observation in here so in table y that we uh do not have in table X then those observations will not be included in the final output and inste all the observations in X independent where they are in table y or not they will be included so basically we are selecting all this part of the uh two joints so this will be our output so the result will include all rows from the left table so table X and the matching rows from the right table of table Y and if there is no match it includes uh n values for the coms of the right table so you will see some uh no values appearing in your end result because there will be cases for which uh you can see that observation contains information in here from table X but not in from table y then another interesting joint uh to look into is the right joint right joint is basically the exact opposite of the left joint so right joint returns all the rows from the right table so table Y and the matching rows from the left table so table X based on the common column so those are all the matching rows and those are all the rows that are only in table Y and this will be the output of our table and then we have the left anti joint and right anti joint left anti joint is very um uh it's kind of uh close to the left joint but uh it is basically the um derivation from the left joint and inner joint so um unlike in the left joint where we were including both the observations that were matching in observations that were only present in X in case of left anti join we are only including so the output will be only the observations which are not matching and they are only in X so this highlighted part so by definition left antig returns all the rows from the left table so table X that do not have a match in the right table based on a common column so it will be only this part and in case of right anti join it is the exact opposite of the left an The Joint so by definition a right on the joint returns all the rows from the right table so table y that do not have a match in the left table X based on a common column so let's say we are looking at this example of a shop for which we have the sales information and for which we have the customer purchase information if we um only want to have the um customer information for the shops that do not appear in the sales data so uh the sales data is the table X right and the customer purchase data uh par shop is the table Y and we want to only have information for the shops and their customer purchase information for which we do not have the corresponding sales data so in those cases you can then use the right on the join it might not be reasonable for this specific example but sometimes whenever we are look sules in our database Sometimes using the left on the joint and right on the joint might be handy therefore it's worth to know how to do that in Python so here we have two different uh small data frames that I created in Python uh data one and data two and uh in the First Data frame we have uh in total of seven observations so you can see a b c d e f and in the second one we have the CDE e fgh h and the corresponding indices are from 8 to 13 whereas for the data one the indices are from 1 to 7 so note that we do not have any uh intersecting indices because we are going to merge this data and we want to avoid case when we have a different value corresponding to the same index so without further Ado let's first learn how to do an inner joint between these two data frames but before that let's actually look into them so let's print the data one and the data two and as you can see here we are getting our two data frames so you can see that this is the data one and this is the data to so uh what we are going to do is to learn how to do an inner join so how to do a merge where our way of joining is the inner join so for that let's first call uh the a corresponding data frame that we want to create so merge and then inner join and this will be equal and here we are going to use the pandas merge function so merge and then whenever we are using this we first need to specify the data frames that we want to merge so in our case it's data one and then data two and then what we need to do is to use the argument on to specify the uh key so the identify that we are going to use in order to merge the two data frames in this case it is the key because um it does make sense to join the data frames on a variable which uh we do have something in common in the uh other table and in this case as you can see we have certain um rows so for instance the letter c letter d letter E and F appear in both data one and data two and as we have only two variables only one of that makes sense to be used as an identifier so uh in here we will use the the key as our um variable based on which we are going to do the join and then we need to specify the exact uh way we want to do our join so the inner join uh has the corresponding uh parameter uh of inner so whenever you are writing inner as a volume for an argument how then python understands that you want to do an inner join between a data frame one and data frame two so let's go ahead and actually print this so merge and then inner join and um before um printing the output let's actually um understand uh what is the uh expected output so uh there are common uh keys that appear both in data one and data two and um those are the uh row uh corresponding to the key c key D and then uh e and then F and the g so we expect that those keys that appear both in data one and data two and their correspondent values will then be um in the uh inner join so let's go ahead and print it here we go so as you can see as expected we are getting the uh value one and value two corresponding to the key CDE e FG in this output because those are the ones that appear both in table uh data one and in table data too another thing that we are going to learn is how to do left join so as we learned left join uh will provide all the values that appear in the uh first table in the left table as well as all the matching values that appear both in the left table and in the right table so what we need to do for that is basically to have the same so left joint so let's rename it to avoid confusion and then here the only thing that we are going to do is to change this to left let's go ahead and print this and this is our left join so as you can see we are getting all the values so all the keys that we got in a data one and we are also getting all the keys that appear in data two so as you can see in data two we got the key CDE e FG so those values which also appear in here you can see that the value two uh corresponding to 8 9 10 11 and 12 also here and here but then the other values for which we do not have the uh matching uh observations uh in the um table uh data one they do not appear in here so here you can see that we have non corresponding to the uh key a and the B Because those keys do not appear in the uh second table in the left table so this is the output of the left joint as you can see um the uh variables that only appear in the right table they end up getting some nonv values but all the uh fields and observations that were included in the first table so data one they do appear uh in full in here now let's go ahead and do our right joint so um as we just learned right joint is basically the opposite of the left joint which means that we can expect to get all the rows from the data two so the right table but then uh some of the rows that were in data one but they were not in the data two they will not appear so we will get only the matching ones and the ones that are present in the data to so let's also change the parameter in the argument how by right and we should see some nonv values in the value one because value one is a variable appearing in the data one so as expected here we are getting all the matching keys that are both present in the table one data one and data two as well as some nonv values corresponding to the value one because this row does not appear in the data one but it is present in data two and then all the rows that are present in data two are in here and finally let's go and learn how to do left on the join and the right on the join I will leave it to you but the idea should be the same as it's uh in the in its concept they are very similar so in case of left on The Joint we need to uh go the extra mile and do some extra steps in order to get our left on The Joint so the first thing we are going to do is to do a left joint and then from the left joint we want to remove the intersection part so um if you bring back uh the um diagram that we just show uh you might recall that the in case of left anti join we are uh doing something very similar as the left join only instead of uh also uh choosing the matching observations that appear both in the left and right tables in case of left anti joint we are only selecting observation that appear in the left table but not the matching observation so we are removing the intersection part therefore what we are first going to do is to do the uh left joint so uh let's call it merge and then uh left on T it's equal to and then pd. merge so the same function and here we are going to specify the uh left table and then the right table and then once again the uh on argument here we are specifying the variable based on which we are doing the Jo and then how we are going to join which is equal to left as we want to do left on this so first we want to do the um the left join and uh we also this time want to uh save the indicators uh that uh come as a result from the joints so um by default the uh indicator is actually set to false and we are going to change that and we are going to set it to true and what this indicator does is that it shows whether the observation belongs to uh the left table only or it belongs uh to the matching part so um it shows whether the observation was in the table data one only or uh whether the observation belongs both to the data one and data two so it is in the intersection part and then using this indicator and using this classification that will come from this indicator we can then identify all the observations that were part of the intersection and we can remove them and we will end up with all the observations that belonged only to the left table so the data one exactly what is the point behind left on The Joint so let's go ahead and print the output of this table to uh show you what this uh indicator does so as you can see now we are getting pair observation beside of the left joint on the two tables we are also getting this underscore merge column which says whether the observation belongs only to the left part or it belongs to the both parts so it's in the matching in in the inter intersection zone so as you can see we got two observations so observation with the key A and A B that belongs only to the table data one and those are the observations that we want to keep and we want to remove all these observations from the intersection zone so from the uh Keys c d e f and g all right so next thing we are going to do is to Define our left R uh join data so let's do merge and then underscore left unjin sure let's make this to left because it makes quite more sense and then we will use this data frame and here we are going to apply the filtering that we learned previously so we are going to say look into the variable on the score merge that just came from the indicator and uh look into all the cases where this variable is equal to left uncore only and this will then keep only the observations that appear in data one so let's go ahead and print this and you will see that we will end up only keeping the observations for which the uh for which the underscore merge will be equal to left underscore only here we go and this our left on The Joint of course we we don't want to keep this underscore merge anymore cuz we have already used it and uh there is no purpose of keeping it so what I will do is I will drop it so let's do it actually in a new line I will put the name of the data frame I will do Dot and then drop and then here I will do uncore merge because this the variable that I want to drop and I need to specify the AIS so as it's a column I want to remove a column I need to specify that the AIS should be equal to one in Python the AIS equal to zero means rows and in X is equal to one it means column underscore merge is a column I'm mentioning that X is equal to one and let's actually go ahead and overwrite this data frame as I want to keep just one copy of the data frame and this is the output here we go so this is our left on The Joint as you can see it is bit more complicated than the left joint or the right joint but I think it's worth to know how to do it because sometimes it can be very useful to implement this in practice so I will leave the right under the joint to you and this actually completes our demo for today where we learn how to do left join inner join right join and also left anti join this is all and I will see you in the next demo hi there and welcome back to another demo in in this demo we are going to learn how to perform data visualization with math plug Li in Python data visualization is a very important technique for gaining insights from your data and to effectively communicate your findings to your audience whether it's presenting to your stakeholders or whether it's putting in your case study or your paper it's really important to know how to PL those visuals by using python because it's a simple way to go from your data analysis to your data visualization we sometimes we call exploratory data analysis or Eda and Eda sometimes can be the uh essential part of the case study to Showcase your data to find some correlations it can be also a stepping stone towards the next step in your case study whether it's closo analysis or modeling it can help you to identify features that explain your dependent variable it can help you to identify unimportant features or it can help you to identify a noise in your data so therefore it's really important for you to know how to make those visuals so the first type of visualization we are going to learn is the line plots line plots are great way to visualize Trends or patterns in the data they are great way to visualize time series so whatever you are dealing with the graph where the x-axis in the form of a time and then the Y AIS are the values that evolve over time this can be for instance the stock prices or the stock returns or the Roa of a company you get the idea so for that what I have here is a set of X values and Y values that I created in the form of an array and what we're going to do now is to plot this arrays so therefore I've imported here the meth PL pip plot Library so pip plot is a directory in the med plot library and um as a uh way uh of shortening the name of this library is calling it PLT so this similar to the idea of using uh p for the pandas and the NP for the noai so let's go ahead and use a library so it is PLT Dot and then we have a plot so p l o t and then here we need to specify first the X values and then we need to specify the Y Valu so the X underscore values and then y underscore values so the idea is that for each specific X we need to have the corresponding y's if your X array is different from the Y aray which means that for certain xes you don't have the corresponding y values or the other way around so for certain y you don't have the X values then you will get an error so those two areas should be the same and for each X you need to have the corresponding y values so let's go ahead and run this and you will quickly see that you're not getting any output and the reason for that is because in python whenever you are using the pi uh plot uh Library you need to uh specify a PLT show such that uh your uh plot will actually be visualized here we go so this is the plot that we are getting as you can see those are the X values starting from uh one and then ending with 10 and then the Y value starting with one and then ending with 20 so uh as you can see this PL is very based we don't have any extra information explaining it but we want our plots to be self-explanatory we don't want to add too much information and we want our audience to look into the graph and the visuals and to understand what it's about for that you need to uh make use of extra functionalities in P plot to add more information to your visuals for instance it could have been really handy to know what this x-axis represent or what what this Y axis represent or to have a title on the top of the visual saying what this graph is about all those can be done by using this uh PLT and in here we can say PLT Dot and then ex label and in this way we can add a text to our xaxis saying for instance what is the variable that the xaxis represent so X AIS Place holder and then the same we can do with the white AIS here we need to change y here we need to change y so let's say you're visualizing the time series of stock prices and your xaxis represents the time and then the YX represent the stock prices well in this case you can then uh put in uh xais placeholder that um it is the date and then the Y AIS placeholder can be the stock price and then the name of the stock that you are looking into and then finally we can also add a title pl. title and here then you can put the um uh title placeholder for now I will put it as some text but what you can do here is you can replace it with the uh title let's say the stock prices or stock X from a time period uh X to Y so in this way you can then add more information to your graph let's see how this looks like in the actual visualization here we go so here you have the title here you have the text under the x-axis and here you have the text uh beside of the y axis so uh another thing you can do is to uh work with your plot and to uh make it uh nicer and you can do that by for instance changing the way the line is represented so you can go from a line to dots uh or dashes it can be that you want to change the color of your plot this one is really popular and I think it's really uh Worth to know so for instance let's say your presentation is in the green uh color so it's in the green palette and you want your a visual to match the color of your palette what you can do for that is to use this argument color and here you can for specify that it should be of green color here we go so as you can see this graph then changes the color the plot is in the green and this looks uh much more appealing compared to what we had before all right so this is about line plus the next thing we are going to learn is how to plot Scatter Plots scatter plot can be really useful when we are trying to visualize a relationship between two variables so let's say we have um uh two features in our data and we want to understand and whether there is a certain relationship between the two whether there is a correlation because we want to know whether um we have a strong perfect uh correlation between the two which is uh something that we need to check as part of the linear regression model we don't want to have two features being um multi colinear and a perfect multicollinear and um we also want to check sometimes the relationship between the uh feature and between the dependent variable because we want them to be highly correlated in all those cases we can use and the scatter BLS as a way to identify this correlation and to see whether there is a pattern or there is no pattern and Scatter Plots uh can be uh in those exact cases super handy so let's say we have exactly the same data and instead of the line plot we want to have a the Scatter Plots so for that what we need to do is to do PLT do and then scatter so instead of plot we are using the function together and then once again we are specifying the X values and the Y values so let me go ahead and repeat the rest here we go so as you can see now we are getting the uh sket PLO and uh this is basically uh the same plot that we saw before only instead of uh lines now we are getting dots so this is a skatter plot coming from our first case study where we were looking into the uh what factors make a playist successful and this is a vivid example of a skatter clo that helps us to understand understand whether there is a relationship between the number of albums in the playlist versus the average recre active usage and also you can see that there is a positive relationship so in here so in this way using a scatter PL you can identify whether there is a relationship between a pair of variables whether it's your two independent variables whether it's your one independent variable and your dependent variable the next type of visualization we are going going to learn is the bar chart so how to plot bar charts in Python and bar charts can be really useful for comparing different categorical uh values so if you're dealing with a categorical data and you have a certain variable that has categories and you have the corresponding values then you can visualize it nicely by using the bar charts and uh for this I have created here the sample data a very basic one where we have the categories of in the terms of our names of animals so we have cat we have dog we have horse and we have Mouse and then uh we have here categorical uh values so uh this represents for instance the weight of animal so this is the weight of a cat weight of a dog weight of a horse and then weight of a mouse and we want to visualize this let's go ahead and use once again the library and the function PLT so PLT and then bar which stands for bar chart and here we need to First specify the categories and then we need to specify the corresponding values so cat and then values which stand for categorical values and once again we uh need to add the label we need to add the uh title and we need to add PLT show so here we can replace this for instance by um animals and then here we can add for instance the uh weight of an animal and then as a title here we can say um weight on weight hair animal and finally we want to do pl. show in order to show it and I also would like to add a color to this visualization so let's say I want the color to be forest green by the way if you're wondering what are all the possible um colors that can be used for this argument color then uh what you can simply do is to use the chat GPT and um try to search for uh different colors available in PIP plot and met flip and you will get the name of all the possible colors that you can use and then you can use a nice color palette that matches your presentation or your case study so let's go ahead and run this so as you can see we are getting this uh bar chart this nice visualization and we can see that the cat has the uh corresponding weight and then we see the dog and the horse and the mouse we can see for instance that the mouse has the smallest weight and the horse has the largest weight and in this way you can nicely visualize a categorical data so let's now learn how to plot histograms so histograms are useful for visualizing the distribution of a numerical data it can be that you want to visualize your population distribution or you want to visualize your sample data and you want to compare for instance your sampling distribution to the population distribution to know how your sample is representative of your population whether it's an unbiased and a true representation of the population which means that your sampling distribution should be close to your population distribution so for all those kind of test you can definitely use the histograms and by the way this is a very uh common question as during data science interviews when you are asked to uh randomly sample from a normal distribution or from a con uniform distribution and deplo this distribution in Python using histograms well that's exactly what we are going to learn today so you might recall from a demo where we learned how to randomly generate data and to um create a simulated version of a data that we use the Sile library to randomly sample from a normal distribution and here we are sampling from standard normal distribution with a mean of zero and standard deviation of one and here we are sampling 100 observations so uh now we are going to plug this distribution by using histograms and here once again I'm using PLT Dot and then I'm specifying his and then here I need to specify the values that I want to plot and then once again I need to specify the X label so as you can see here we are getting our sample data and here we have the frequency so how often we are getting the corresponding volum and we can see that the distribution of the Su pole is symmetric around zero as expected because normal distribution is symmetric and it's well shaped and it's always symmetric around its mean and we sample data from the standard normal distribution with the mean of zero and as you can see the standard deviation so how spread out the observations are from the mean are very close to one so um one thing that you will also notice is that as we increase this amount so we go um we increase the sample size so we'll make it for instance 2,000 then this distribution should look more and more this hisam should look more and more like the actual normal distribution here we go let's actually go ahead and do something bit more advanced to show you how this looks like when comparing the histogram with a plot and what I want to do here is to visualize the uh sampling distribution so when uh we are generating our own sample randomly sampled from normal distribution and we are comparing it to the actual population distribution by using this Norm function that comes from the library called CI and this one should be So-Cal population distribution and what we had before should be the sampling distribution let's go ahead and put this here we go so here in this part as you can see now we are no longer uh plotting the frequencies but we are plotting the actual probability corresponding to the sampling distribution and we are also in the same graph visualizing the population distribution so the actual normal distribution with the same parameters and the reason why I wanted uh to show this because uh in this way we can compare your uh population distribution to the sampling distribution and one thing that you will see as you change the uh number of observation that you sample is that the higher will be the number of observations so the sample size more will this uh histogram so the um green bars will look like to the actual population distribution and this is also the entire idea behind what we call Central limit theorem so if you're wondering what this normal distribution is what is sampling distribution population distribution what is Central limit the me then head towards a fundamental statistic section of this course to learn everything about this topics so uh one thing that you will notice uh in this specific graph is that we saw uh that there were also Legend added to the visualization so uh those are really helpful when you are plotting not just one but two set of data sets in the same plot and you want to explain what is the difference between them in our case we had this uh something dist tion and we had a population distribution and we had to specify that the uh bars correspond to the sampling distribution and the plot so the line corresponds to the uh population distribution well for that we use what we call Legend and as you can see the way that I'm doing it is by using pl. Legend and what I'm doing here is simply specifying the X values using the nonp range function which arranges values between the minimum and the maximum and then the corresponding incremental value and then we have here the X values which uses the norm function coming from the s pi to um to uh generate the corresponding probability distribution uh values so the probabilities and then here I'm specifying the counts number of beans that I want to be visualized and also what should be ignored so uh here I'm using the histogram and then I'm specifying the first the histogram that I want to plot which is similar to what we had before and uh the only difference is that I'm specifying that we are dealing with a density which means that it's going to plot the probabilities instead of frequencies then I'm specifying the color and then I'm specifying the label so the way that the legend works is that I need to specify per plot what is the name of the plot by using this label argument and then once you add all these labels and then you add here pl. Legend it will then pair plot so per typee of visualization it will then in the right corner or somewhere in the left corner it will then specify the name of the plot and if you want to see similar visualizations how you can for instance randomly draw observations from B distribution from binomial distribution exponential distribution geometric distribution normal distribution poison distribution or student T distribution or uniform distribution and how you can visualize them using the histograms then head towards this GitHub repository that I will post as part of the resources where you can see all those visualizations and their corresponding python codes and this is as part of this mathematics statistics or data science GitHub repository that I created and this actually concludes this demo where we'll learn how to create different visualizations in Python using met PL flip we learned how to do line plots how to do Scatter Plots how to do bar charts and histograms and also how to combine for instance the plots and the histograms in a single visualization this video was sponsored by lunarch at lunarch we are all about making you ready for your dream job in Tech making data science and AI accessible to everyone with is data science artificial intelligence or engineering at lunar Tech Academy we have courses and boot camps to help you become a job ready professional we are here to help also businesses and schools and universities with a topn training modernization with data science and AI corporate training including the latest topics like generative AI with lunar Tech learning is easy fun and super practical we care about providing an end to-end learning experience that is both practical and grounded in fundamental knowledge our community is all about supporting each other making sure you you get where you want to go ready to start your Tech Journey lunner Tech is where you begin for students or aspiring data science and AI professionals visit Lun Tech Academy section to explore our courses and boot camps and just in general our programs businesses in Need for employee training upscaling or data science and AI Solutions should head to the technology section on the learner. page Enterprises looking for corporate training in curriculum modernization and customize AI tools to enhance education please visit the lunch Enterprises section at lunch. for a free consultation and customize estimate join lunch and start building your future one data point at a time AB testing is an important topic for data scientists to know because it's a powerful method for evaluating changes or improvements to the products or Services it allows us to make datadriven decisions by comparing the performance of two different versions of a product or a service usually referred as treatment or control for example a testing allows data scientists to measure the effectiveness of changes to a product or a service which is important as it enables data scientists to make data driven decisions rather they're relying on Intuition or assumptions secondly AB testing helps data Sciences to identify the most effective effective changes to a product or a service which is really important because it allows us to optimize the performance of a product or a service which can then lead to increased customer satisfaction and sales AB testing helps us also to validate certain hypothesis about what changes will improve a product or service this is important because it helps us to build a deeper understanding of the customers and the factors that influence customers Behavior finally AB testing is a common practice in many Industries such as e-commerce digital marketing website optimization and many others so data scientists who have knowledge and experience in a testing will be more valuable to this companies no matter in which industry you want to enter as a data scientist and what kind of job you will be interviewed for and even if you believe more technical data scien is your cup of tea be prepared to know at least higher level understanding and the details behind this method definitely help you to know about this topic when you are speaking with product owners stakeholders product scientists and other people involved in the business let's briefly discuss a perfect audience for the section of the course and prerequisites there are no prerequisites of this section in terms of AB testing Concepts that you should know already but knowing the basics and statistics which you can find in the fundamentals to statistics section is highly recommended this section will be great if you have no prior AB testing knowledge and you want to identify and learn the essential AB testing Concepts from scratch so this will help you to prepare for your job interviews it will also be a good refresher for anyone who does have AB testing knowledge but who wants to refresh their memory or want to fill in the gaps in their knowledge in this lecture we will start off the topic about AB testing where we will formally Define what AB testing is and we will look at the high level overview of AB testing process step by step by definition AB testing or split testing is originated from the cial randomized control trials and is one of the most popular ways for businesses to test new ux features new versions of a product or an algorithm to decide whether your business should launch that new ux feature or should productionize that new recommender system create that new product that new button or that new algorithm the idea behind a testing is that you should show the variated or the new version of the product to sample of customers often referred as experimental group and the existing version of the product to not sample of customers referred as control group then the difference in the product performance in experimental versus control group is tracked to identify the effect of these new versions of the product on the performance of the product so the goal is then to track the metric during the test period and find out whe there is a difference in the performance of the product and what type of difference is it the motivation behind this test is to test new product variants that will improve the performance of the existing product and will make this product more successful and optimal showing a positive treatment effect what makes this testing grade is that businesses are getting direct feedback from their actual users by presenting them the existing versus the variated product version and in this way they can quickly Test new ideas in case of ab Test shows that the variated version is not effective at least businesses can learn from this and can decide whether they need to improve it or need to look for other ideas let us go through the steps included in the AB testing process which will give you a higher level overview into the process the first step in conducting AB testing is stating the hypothesis of the ab test this is a process that includes coming up with business and statistical hypothesis that you would like to test with this test including how you measured the success which R called primary metric next step in AB testing is to perform what we call power analysis and design the entire test which includes making assumptions about the most important parameters of the test and calculate the minimum sample size required to claim statistical significance the third step in AB testing is to run the actual AB test which in practical sense for the data scientist means making sure that the test runs smoothly and correctly collaborate with engineers and product managers to ensure that all the requirements are satisfied this also includes collecting the data of control and experimental groups which will be used in The Next Step next step in AB testing is choosing the right statistical test whether it is z test T Test Ki Square test Etc to test the hypothesis from the step one by using the data collected from the previous step and to determine whether there is a statistically significant difference between the control versus experimental group The Fifth and the final step in AB testing is continuing to analyze the results and find out whether besides statistical significance there is also practical significance in this step we use the second step's power analysis so the assumptions that we made about model parameters and the suiz and the four steps results to determine whether there is a practical significance beside of the statistical significance this summarizes the AB testing process at a higher level in next couple of lectures we'll go through the steps one at a time so buckle up and let's learn about AB testing in this lecture lecture number two we will discuss the first step in a testing process so let's bring our diagram back as you can recall from the previous lecture when we were discussing the entire process of AB testing at a high level we saw that in the first step in conducting AB testing is stating the hypothesis of ab test this this process includes coming up with a business and statistical hypothesis that you would like to test with this test including how you measured the success which we call a primary metric so what is the metric that we can use to say that that the product that we are testing performs well first we need to State the business hypothesis for our AB test from a business perspective so formally business hypothesis describes what the two products are that being compared and what is the desired impact or the difference for the businesses so how to fix a potential issue in the product where a solution of these two problems will influence what we call a key performance indicator or the kpi of the interest business hypothesis is usually set as a result of brainstorming and collaboration of relevant people on the product team and data science team the idea behind this hypothesis is to decide how to fix a potential issue in the product where a solution of these problems will improve the target kpi one example business hypothesis is that changing the color of learn more button for instance to Green will increase the engagement of the web page next we need to select what we call primary metric for our av testing there should be only one primary metric in your AV test choosing this metric is one of the most important parts of baby test since this metric will be used to measure the performance of the product or feature for the experimental and control groups and then will be used to identify whether there is a difference or what we call statistically significant difference between these two groups by definition primary metric is a way to measure the performance of the product being tested in the ab test for the experimental and control groups it will be used to identify whether there is a statistically significant difference between these two groups the choice of the success metric depends on the underlying hypothesis that is being tested with this AB test this is if not the most one of the most important parts of the ab test because it determines how the test will be designed and also how well the proposed ideas perform choosing poor metrics might disqualify a large amount of work or might result in wrong conclusions for instance the revenue is not always the end goal therefore in a testing we need to tie up the primary metric to the direct and the higher level goals of the product the expectation is that if the product makes more money then this suggests the content is great but in achieving that goal instead of improving the overall content of the material and writing one can just optimize the conversion funless one way to test the accuracy of the metric you have chosen as your primary metric for your ab test could be to go back to the exact problem you want to solve you can ask yourself the following question what I tend to call the metric validity question so if this chosen metric were to increase significantly while everything else stays constant would we achieve our goal and would we address our business problem is it higher revenue is it higher customer engagement or is it high views that we are chasing in the business so the choice of the metric will then answer this question though you need to have a single primary metric for your ab test you still need to keep an eye on the remaining metrics to make sure that all the metrics are showing a change and not only the target one having multiple metrics in your ab test will lead to false positives since you will identify many significant differences well there is no effect which is something you want to avoid so it's always a good idea to pick just a single primary metric but to keep an eye and monitor all the remaining metrics so if the answer to the metric validity question is higher Revenue which means that you are saying that the higher revenue is what you are chasing and better performance means higher revenue for your product then you can can use as your primary metric what we call a conversion rate conversion rate is a metric that is used to measure the effectiveness of a website a product or a marketing campaign it is typically used to determine the percentage of visitors or customers who take a desired action such as making a purchase filling out a form or signing up for a service the formula for conversion rate is conversion rate is equal to number of conversions divided to number of total visitors multiplied by 100% for example example if a website has thousand visitors and 50 of them make a purchase the conversion rate would be equal to 50 divide 2,000 multiply by 100% which gives us 5% this means that our conversion rate in this case is equal to 5% conversion rate is an important metric because it allows us and businesses to measure the effectiveness of their website a product or a marketing campaign it can help businesses to identify areas for improvement such as increasing the number of conversions or improving the user experience conversion rate can be used for different purposes for example if a company wants to measure the effectiveness of an online store the conversion rate would be the percentage of visitors who make a purchase and on the other hand if a company wants to measure the effectiveness of landing page the conversion rate would be the percentage of visitors who fill out a form or sign up for a service so if the answer to the metric validity question is higher engagement then you can use the click rate or CTR as your primary metric this is by the way a common metric used in a testing whenever we are dealing with eCommerce product search engine recommender system click through rate or the CTR is a metric that measures the effectiveness of a digital marketing campaign or the user engagement or some feature on your web page or your website and it's typically used to determine the percentage of users who click on a specific link or button or call to action CTA out of of the total number of users who view it the formula for the click through rate can be represented as follows so the CTR is equal to number of clicks divided to number of Impressions multiply by 100% not to be confused with click through probability because there is a difference between the click through rate and click through probability for example if an online advertisement receives thousand of Impressions which means that we are showing it to the customers for a thousand times and there were 25 clicks which means 25 out of all the Impressions resulted in clicks this means that the clickr rate for this specific example would be equal to 25 divide 2,000 multiply by 100% which gives us 2.5% this means that for this particular example our clickr rate is equal to 2.5% cure rate is an important metric because it allows businesses to measure the effectiveness of their digital marketing campaigns and the user engagement with their website or web pages High click through rate indicates that a campaign or the web page or feature is relevant and appealing to the target audience because they are clicking on it while low clickr rate indicates that a campaign or the web page needs an improvement click show rate can be used to measure the performance of different digital marketing channels such as paid search display advertising email marketing and social media it can also be used to measure the performance of different ad formats such as text advertisements Banner advertisement video advertisements Etc next and the final task in this first step in the process of AP testing is to State the statistical hypothesis based on the business hypothesis and the chosen primary metric next and in the final task in this first step of the AB testing process we need to State the statistical hypothesis based on the business hypothesis we stated and the chosen primary metric in the section of fundamentals to statistics of this course in lecture number seven we went into details about statistical hypothesis testing included what n hypothesis is and what alternative hypothesis is so do have a look to get all the insight about this topic AB testing should always be based on a hypothesis that needs to be tested this hypothesis is usually set as a result of brainstorming and collaboration of relevant people on the product team and data science team the idea behind this hypothesis is to decide how to fix a potential issue in a product where a solution of these problems will influence in the key performance indicators or the kpi of interest it's also highly important to make prioritization out of range of product problems and ideas to test while you want to P that fixing this problem would result in the biggest impact for the product we can put the hypothesis that is subject to rejection so that we want to reject in the ideal World under the null hypothesis what we Define by AG zero well we can put the hypothesis subject to acceptance so the desired hypothesis that we would like to have as a result of AB testing under the alternative hypothesis defined by H1 for example if the kpi of the product is to increase the customer engagement by changing the color of the read more button from blue to green then under the N hypothesis we can state that click through rate of learn more button with blue color is equal to the clickr rate of green button under the alternative we can then state that the click through rate of the learn more button with green color is larger than the click through of the blue button so ideally want to reject this n hypothesis and we want to accept the alternative hypothesis which will mean that we can improve the clickr rate so the engagement of our product by simply changing the color of the button from blue to green once we have set up the business hypothesis selected the primary metrix and stated the statistical hypothesis we are ready to proceed to the next stage in the AB testing process in this lecture we will discuss the next Second Step In AB testing process which is designing the ab test including the power analysis and calculating the minimum sample sizes for the control and experimental groups stay tuned as this is a very important part of AB testing process commonly appearing during the data science interviews some argue that AB testing is an art and others say that it's a business adjusted common statistical test test but the borderline is that to properly Design This experiment you need to be disciplined and intentional while keeping in mind that it's not really about testing but it's about learning following AR steps you need to take to have a solid design for your ab test so let's bring the diagram back so in this step we need to perform the power analysis for our AB test and calculate the minimum sample size in order to design our AB test AB test design includes three steps the first step is power analysis which includes making assumptions about model parameters including the power of the test the significance level Etc the second step is to use these parameters from Power analysis to calculate the minimum sample size for the control and experimental groups and then the final third step is to decide on the test duration depending on several factors so let's discuss each of these topics one by one power analysis for AB testing includes this three specific steps the first one is determining the power of the test this is our first parameter the power of the statistical test is the probability of correctly rejecting the N hypothesis power is the probability of making a correct decision so to reject the N hypothesis when the N hypothesis is false if you're wondering what is the power of the test what is this different concepts that we just talk about what is this null hypothesis and what does it mean to reject the null hypothesis then head towards the fundamental St IC section of this course as we discuss this topic in detail as part of that section the power is often defined by 1 minus beta which is equal to the probability of not making a type two error where type two error is the probability of not rejecting the null hypothesis while the null is actually false it's common practice to pick 80% as the power of the ab test which means that we allow 20% of type to error and this means that we are fine with not detecting so failing to reject no hypothesis 20% of the time which means that we are fine with not detecting a true treatment effect while there is an effect which means that we are failing to reject the N however the choice over value of this parameter depends on the nature of the test and the business constraints secondly we need to determine a significance level for our AB test the significance level which is also the probability of type one error is the likelihood of rejecting the no hence detecting a treatment effect while the no is actually true and there is no statistically significant impact this value often defined by a Greek letter Alpha is a probability of making a false Discovery often referred to as a false positive rate generally we use the significance level of 5% which indicates that we have 5% risk of concluding that there exists a statistically significant difference between the experimental and control variant performances when there is no actual difference so we are fine by having five out of 100 cases detecting a treatment effect while there is no effect it also means that you have a significant result difference between the control and the experimental groups within 95% confidence like in the case of the power of the test the choice of the alpha is dependent on the nature of the test and the business constraints that you have for instance if running this AB test is related to high engineering course then the business might decide to pick a high offer such that it would be easier to detect a treatment effect on the other hand the implementation costs of the proposed version in production are high you can then pick a lower significance level since this proposed feature should really have a big impact to justify the high implementation cost so it should be harder to reject no hypothesis finally as the last typ of power analysis we need to determine a minimum detectable effect for the test last parameter as part of the power analysis we need to make assumptions about is what we call minimum detect effect or Delta from the business point of view so what is the substantive to the statistical significance that the business wants to see as a minimum impact of the new version to find this variant investment worthy the answer to this question is what is the amount of change we aim to observe in a new versions metric compared to the existing one to make recommendations to the business that this feature should be launched in the production that it's investment worthy an estimate of this parameter is what is is known as a minimum detectable effect often defined by a Greek letter Delta which is also related to the Practical significance of the test so this MD or the minimum detectable effect is a proxy that relates to the smallest effect that would matter in practice for the business and it's usually set by stakeholders as this parameter is highly dependent on the business there is no common level of it instead so this minimum detectable effect is basically the translation from statistical significance to the Practical significance and here we want to see and we want to answer the question what is this percentage increase in the performance of the product that we want to experiment with that will tell to the business that this is good enough to invest in this new feature or in this new product and this can be for instance 1% for one product it can be 5% for another one and it really depends on the business and what is the underlying kpi a popular reference to the parameters involved in the power analysis for AB testing is like this so 1 minus beta for the power of the test Alpha for the significance level Delta for the minimum detectable effect to make sure that our results are repeatable robust and can be generalized to the entire population we need to avoid P hacking to ensure real statistical significance and to avoid biased results so we want to make sure that we collect enough amount of observations and we run the test for a minimum predetermined amount of time therefore before running the test we need to determine the sample size of the control in experimental groups as well as later on in this lecture we will see also how long we need to run the test so this is another important part of AB testing which needs to be done using the defined power of the test which was the one minus beta the significance level and a minimum detectable effect so all the parameters that we decided upon when conducting the power analysis calculation of the sample size depends on the underlying primary metric as well that you have chosen for tracking the progress of the control and experimental versions of the product so we need to distinguish here two cases so when discussing the primary metric we saw that there are different ways that we can measure the performance of different type of products if we are interested in engagement then we are looking at a metric such as click through rate which is in the form of averages so the case one will be where the primary metric of AB testing is in the form of a binary variable it can be for instance conversion or no conversion click or no click and in case two where the primary metric of the test is in the form of proportions or averages which means mean order amount or mean click through rate for today we will be covering only one of these cases but you can find more details on the second case in my blog which I have posted also as part of the resources section this blog post contains all the details that you need to know about AB testing including the statistical test and their corresponding hypothesis the descriptions of different primary metrics that go beyond what we have covered as part of this section as well as many more details that you need to know about a testing so let's look at a case two where the primary metric of the test is in the form of proportions or averages so let's say we want to test whether the average click to rate of control is equal to the average click to rate of experimental group and under H we have that the MU control is equal to M experimental and under H1 we have that the m control control is not to Mu experimental so here the MU control and mu experimental are simply the average of the primary metric for control group and for the experimental group respectively so this the formal hypothesis we want to test with our AB test and we can assume that this new control is for instance the clickr rate of the control group and the MU experimental is the clickr rate of the experimental group so this is the formal statistical hypothesis we want to test with our AB test if you have having done so I would highly suggest you to head towards the fundamental statistic section of this course where in lecture number seven and eight of the statistical part of this course I go in detail about statistical hypothesis testing the means averages significance level Etc this also holds for the theorem that the some precise calculation is based upon called Central limit theorem so check out the last lecture about infuential statistics where I covered the central limit theorem which we will also use in this section and finally also check the lecture number five in that section where we cover the normal distribution another thing that we will use as part of this section so the central limit theorem states that given a sufficiently large sample size from an arbitrary distribution the sample mean will be approximately normally distributed regardless of the shape of the original population distribution this means that the distribution of the sample means will be approximately normal if we take a large enough sample even if the distribution of the original sample is not normal so when we are dealing with a primary performance tracking metric that is in the form of average such as this one that we are covering today which is a clickr rate we intend to compare the means of the control and experimental groups then we can use the central limit theorem as state that the mean sampling distribution of both controlling experimental groups follow normal distribution consequently the sampling distribution of the difference of the means of these two groups also will be normally distributed so this can be expressed like this where we see that the mean of the control group and mean of the experimental group follows normal distribution with mean mu control and mu experimental respectively and then with the variance of Sigma control squared and sigma experimental squared respectively though derivation of this Pro is out of the scope of this course we can state that the difference between the means of the true group so xar control minus xar experimental also follows normal distribution with a mean new control minus mu experimental and with a variance of Sigma control squ / to n Control Plus Sigma experimental squ / to n experimental so the sample size of the experimental group and the sample size of the control group hence the sample size needed to compare the means of the two normally distributed samples using a two-sided test which pre-specify significance of alpha power level and minimum detectable F can be calculated as follows so here you can see the mathematical represent ation of the minimum sample size so the N which stands for the minimum sample size is equal to and in denominator we have Sigma Control squ Plus Sigma experimental squ multip by Z 1 - Alpha / to 2 + z 1 - beta s / to the Delta squ and here the Alpha and the beta and the Delta we have made assumptions about as part of the power analysis and the sigma control squar and a sigma experimental squared are the uh estimates of the variance that we can come up with using the So-Cal AA testing I would say you do not necessarily need to know this derivation as there are many online calculators that will ask you for the alpha the beta and the Delta values as well as the sample estimates for the sigma squ control and experimental and then these calculators will automatically calculate the minimum S size for you if you're wondering what this AA testing is and how we can come up with the sigma control squared and sigma experimental squared as well as all the other values then make sure to to check out the blog that I posted before and that I mentioned before as I explain in detail all these values as well as check out the resource section where I've included many resources regarding this but for now just keep in mind that the Z1 minus Alpha / to 2 and Z1 minus beta are just two constants and come from the normal distributed and standard normal distributed tables I would say you do not necessarily need to know this derivation as there are many online calculators that will ask you for this Alpha Beta And Delta values as well as the SLE estimates for the sigma squar controlling Sigma experimental control and then we'll calculate automatically the sample size for you for the control and experimental group effectively one example of such calculator is this AB test online calculator but if you Google it you will find many others that will ask you for the minimum detectable effect for the statistical significance or the statistical power and then it will automatically calculate for you the minimum sample size that you should have in order to have a statistical significance and in order to have a valid AB test one thing to keep in mind is that you will notice that the statistical significance level is set to 95% in here which is not what we have seen when we were discussing the alpha significance level so sometimes these online calculators will confuse or will interchangeably use the significance level versus the confidence level which are the opposite so the significance level is usually at the level of 5% or 1% confidence level is around 95% so which is basically 100% minus the alpha there for whenever you see this 95% know that this means that your Alpha should be 5% so it's really important to understand how to use this calculator not to end up with the wrong minimum sample size conduct an entire AB test and then at the end realize that you have used the wrong uh significance level the final step is to calculate the test duration this question needs to be answered before you run your experiment and not during the experiment sometimes people stop the test when they detect C significance which is what we call P hacking and that's absolutely not what you want to do to to determine the Baseline of a duration time a common approach is to use this formula as you can see duration is equal to n ided to the number of visitors per day where n is your minimum sample size that we just calculated in the previous step and the number of visitors per day is the average number of visitors that you expect to see as part of your experiment for instance if this formula results in 14 days or 14 this suggest that running the test for two weeks is a good idea however it's highly important to take many business specific aspect into account when choosing the time to run the test and for how long you need to run it and simply using this formula is not enough for example if you want to run an experiment at the end of the month December with Christmas breaks when higher than expected or lower than expected number of people are usually checking your web page then this external and uncertain event had an impact on on the page to search for some businesses this means a for example if you want to run an experiment at the end of the month of December with Christmas breaks when higher than expected or in some cases lower than expected number of people are usually checking the web page so depending on the nature of your business or the product then this external and uncertain event can have an impact on the page usage for some businesses which means that for some businesses a high increasing the P usage can be the result and for some a huge decrease in usability in this case running AB test without taking into account this external Factor would result in inaccurate results since the activity period would not be true representation of a common page usage and we no longer have this Randomness which is a crucial part of AB testing beside this When selecting a specific test duration there are few other things to be aware of firstly two small test duration might result in what we call novelty effects users tend to react quickly and positively to all types of changes independent of their nature so it's referred as a novelty effect and it vares of in time and it is considered illusionary so it would be wrong to describe this effect to the experimental version itself and to expect that it will continue to persist after the noble T effect wears off hence when picking a test duration we need to make sure that we do not run the test for too short amount of time period otherwise we can have a noble TF effect novelty effect can be a major threat to the external validity of an AV test so it's important to avoid it as much as possible secondly if the test duration is too large then we can have what we call maturation effects when planning an AB test it's usually useful to consider a longer test duration for allowing users to get used to a new feature or product in this way one will be able to observe the real treatment effect by giving more time to returning users to cool down from an initial positive reaction or a spike of Interest due to a change that was introduced as part of a treatment this should help to avoid novelty effect and is better predictive value for the test outcome however the longer the test period the larger is a likelihood of external effect impacting the reaction of the users and possibly contaminating the test results if you like this content make sure to check all the other videos available on this channel and don't forget to subscribe like and comment to help the algorithm to make this content more accessible to everyone across the world and if you want to get free resources make sure to check the free resources section at lunch. and if you want to become a job ready data scientist and you are looking for this accessible boot camp that will help you to make your job ready data scientist consider enrolling to the data science boot camp the ultimate data science boot camp at lunch. you will learn all the theory the fundamentals to become a jbre data scientist you will also implement the learn theory into a real world multiple data science projects beside this after learning the theory and practicing it with the real world case studies you will also prepare for your data science interviews and if you want to stay up to date with the recent developments in Tech what are the headlines that you have missed in the last week what are the open positions currently in the market across the globe and what are the tech startups that are making waves in the tech and sure to subscribe to the data science na newsletter from [Music] lunarch this is what we call maturation effect and therefore running the AP test for too short amount of time or too long amount of time is not recommended as it's a very topic we can talk for hours about this part of the ab test and also a topic that is asked a lot during the data science and product scientist interviews therefore I highly suggest you to check out this book about AB testing which is a Hands-On tutorial about everything you need to know about AB testing as well check out the interview preparation guide in this section that contains 30 most popular AB testing related questions you can expect during your data science interviews looking to elevate your data science or data analytics portfolio then you are in the right place with this AB testing and Trend case study you can showcase your AB testing and coding skills in one place I'm tasan data scientist and AI professional and I'm the co-founder of lunar Tech where we are making data science and AI accessible to everyone individuals businesses and institutions in this case study we are going to complete an endtoend case study with AB testing where we are going to test in a data driven way whether it's worth to change one of our features in our ux design in the lunar text landing page this a real life data science case study that you can conduct and you can put it on your resume in order to Showcase your experience in datadriven decision making where you will showcase your statistical skills experimentation skills with AB testing and your coding skills in Python using Library such as T models but also the pendas npy also met plot lip and Seaburn we are going to start with the business objective of this case study then we are going to translate the business objective into a data science problem then we are going to start with the actual coding we are going to load libraries we are going to look into Data visualize the data The Click data we are going to look into the motivation behind choosing that specific primary metric which is a click through rate then we are going to talk about the statistical hypothesis for our AB testing I will also teach you step by step all the calculations starting from the calculation of the pulled estimate from the clickr rate and then a computation of the the uh ped variance the standard error but also the motivation behind choosing the searches to discal test that I will be using such as the two sample Z test and then how you can calculate the test statistics how you can calculate the P value of the test statistics and they use that with the statistical significance to test the uh statistical significance of your ab test after this we will also then compute the confidence interval comment on the Gen ability of the ab test and then at the end we will also test for the Practical significance of the ab test then we will conclude and we will wrap up and we will make a decision based on our datadriven approach using the ab test to check whether it's worth it to change a feature in our ux design in the lunar text landing page so without further Ado let's get started so let's now start our case study in here here I have in the left hand side this uh version of our landing page so which is our control version so to say the existing version where you can see that here we have start free trial and here we got us our button secure free trial in the right hand side we got this new experimental version that we would like to have which is the Andro Now button so as we saw in the introduction what we are trying to understand is that whether our customers click more on the new version the experimental version versus the existing version the control version so um as of the day of uh loading this and uh conducting this case study our lending page uh has a secure free trial but what we want you to test with our data is whether the uh endroll now is more engaging such that we can go from the secure free trial version to the Andro now version and uh here um for this specific case not only but also in general as we know from a testing is that whenever we got an existing algorithm or existing feature existing button then we are referring this group that we will um where we will expose this existing version of the product we are referring this as a control group so all the users to whom we will show the existing version of our landing page we will refer them as the uh control group participants and then we have the the right hand side our experimental version and our experimental users so the users our existing customers that are selected to be taken part um in our experimental group and in our experment they will be then uh exposed to this new version of our landing page which contains this endroll now button so our end goal in terms of the business as we saw in the introduction is to understand whether we should release the new button which will end up being higher engaging which means that we will have higher CTR or higher uh more uh clicks that will come from our user site which uh automatically means better business because we want to have highly engaging users if they are clicking on this button it means that it interests them more compared to the control version and uh if something on our landing page in this case our call to action is more interesting and highly engaging it means that we are doing something right and our users might uh either make use of our free products or uh purchase our products or um just stay engaged with us to keep Runner Tech in mind and whenever there is someone who uh is interested in data science or AI um Solutions or products then they can at least refer their friends if they are just clicking to understand and to learn more about our products that's also possibility so from a business perspective we therefore are using here as our primary metric uh our click through rate the CTR of this specific button which in our control version is the secure free trial and in our experimental version is the enroll now and what we want to understand is that whether this new button will end end up having higher CTR or not because higher CTR from the technical perspective will translate to higher engagement from the business perspective so here we are making this translation from business versus technical um when it comes to AB testing we can have different sorts of primary metrics we can have a clickr rate as a primary metric we can have a conversion rate as a primary metric or any other primary metric what we want you have as our metric that will work as the single measure that we will compare our control and experimental group to understand which version performs better is first to understand what this definition of better is and how that translates back to the business because if the engagement is what we are referring as Better Business for some reason and I will explain you in a bit why we think the engagement in this case is what we what matter for us at ler Tech then it means that click through rate can be used as a primary metric this is just a universal metric that has been used across um different web applications search engines recommender systems and many other digital products to understand whether the engagement of that specific algorithm feature web design whether that is better or not and in this case in the specific case study we are also going to use the CTR because we are interested in the engagement so at learner te we really care about the engagement um with our users and we want our users to make use of our products but uh ultimately to engage with us because if they engage with us it means that our products are being seen our uh landing page is being visited and the user is actually interested to click on that button and then the action point and then to start either free trial or to enroll to see what is going on because all these are signs of Interest coming from the user side and in the control version as our click to action is to secure a free trial which directly uh lends the user to our free trial to our ultimate data science boot camp but given that we are expanding which means that we are now offering more courses we are offering Freer products and also we have uh Enterprise clients uh we have businesses as clients who want data science and AI Solutions and who want corporate training therefore we want to go from this Niche uh version of a landing page so secure free trial to enroll now because we already have a lot of Engagement in terms of the free trial we want to make it more General so that's the business perspective and on the other hand we also want to change beside of changing this um main um call Action we want to make it generalized and at the same time we want to see whether this generalized version will end up leading us um a higher engagement not only in terms of the other products but also for the tree trial free trial itself because we always are looking for educating people and providing these free trials as that they can make use of our Flagship product which is the the ultimate data science boot camp so now when we understand why we care about the engagement here at ltech and we understand why we want to check whether this new button in our ux design will end up increasing the engagement or not we can now make this translation back to the data science terms because we know now from the business perspective All We Care is to understand whether this experimental version of the product is performing better or not but then this means that we need to conduct an AB test and we need to understand whether the ideas that we got and the speculation that the enroll now more General Button as call you action will be better than the secure free trial version whether this is actually true or not from the customers perspective because if we want to call us a data driven company we cannot just base our conclusions and our decisions for our products or for just in general for our product road map based on Intuition or logic we want this to be data driven which means that the customers are at the first place we are customer driven and our customers need to tell us whether the new um button is better or not and here we have conducted conducted an AB test and um here I won't be using the real data I will be using the uh proxy data or simulated data that I generated myself and um this one contains the similar structure and this uh the same um idea of the data that we got when we were conducting our I test and collecting this data and what is our business uh hypothesis in our business hypothesis we can say that we have at least 10% increase in our click through rate so 10% higher engagement when we have our enroll Now versus the secure free trial version of the product so this is our business hypothesis which means that our enroll now CTR so click through rate of the enroll Now button will result in at least 10% higher CTR than the secure free trial so there exist uh 10% at least 10% difference in terms of the engagement when we compare this new version of the product versus the old version of this new uh button and when we translate this back to statistical hypothesis we can say that under the new hypothesis we are saying that there is no statistically significant difference between the um control p and then P experimental which means the um um probability click through probability clickr rate for control group versus experimental group so under AG n the null hypothesis we are stating what we ideally want to reject we are saying there is no difference between the experimental and control group CTR and under the alternative hypothesis so the H1 we are saying no uh we do have a difference which means that the uh control groups a CTR is different from the control experimental group CTR and one key part here is to mention that they are not just different but they are statistically significantly different so uh when it comes to starting the case study first things first is to load the libraries in this case study we are going to use a numpy we are going to use a pendis as usual for any sort of data analytics data sience um case studies you always need those two usually pendis will be needed for our data wrangling to load the data process the data visualize it nonp will be used to uh work with different arrays and part of the data then we are going to use a ci. stats uh model and from that we will import the norm function later on um we will see that we are using this in order to visualize this um uh rejection region that we get from for our test to understand whether we need to reject our n hypothesis or not then in this case that we also want to visualize our results and visualize our data for which we are going to need our visualization libraries from python which are the curn and the Met plot L let's look into our data so what we have in our data we have four different columns and of course this is a filter data that contains the information that we need but in general you can have a larger database you can have more sorts of um um matric matrices and uh different other Matrix but for conducting your ab test the pure AB test you actually need only the following information so you need your user ID to understand uh what are the user you are dealing with so it's the user one user two user 10 it can be that you have other way of referring to your users and uh those can be for instance this long strings that we use to refer to our user but given that our case is a simple one our case study we have just a user ID and this user ID is just a integers that go from one till uh until the end of our uh data and here we got in total 20,000 users therefore this number user ID goes to um 20,000 and those 20,000 um are all part of the user group which means that they are all users and they contain both the experimental and control users then we have our uh click variable and this click variable is a binary variable which can be uh either one or zero where one refers that the user has clicked on the button and zero means the user didn't click on the button this is our primary metric for our AB test then we have the group reference which is this um string varable able and this string variable helps us to understand whether the user comes from the experimental group or from the control group so this can contain only two different values two strings and it is X referring to the experimental and control referring to the uh control group if you can see here we got just the three letters X referring to the experimental group and then if we go in here because we have first the experimental and then the control ones you can see that here we got the uh control group then we have also some time stamp which is uh not something relevant so we'll be skipping that for now um given that this uh data that we have here it's not the actual data our data but it's a synthetic one but similar in terms of its structures in terms of the uh nature of variables and you can Implement exactly the same steps when you have your data and you are getting it from your ab test and then you are conducting your a has uh case study so in here what we are going to make use of the most is our click variable and the group variable because we want to find out per group what are the users that have clicked on the uh button and to be more specific we are looking for these averages so we are not so much interested that that specific user from that specific group has clicked on the product or not that's something that we can explore later but for now we are interested on the more high level so what is this uh percentages what is the click probability or click true rate perir group and here we got groups of experimental and control as it should be in any source of ab test so once we have conducted our AB test then I will also provide you more insights on what you can do with your data especially with this user ID to learn more about uh the idea behind these different decisions or whether your ab test is different per group but the idea is that this AB test that we are conducting by following all the steps and by ensuring that the uh pitfalls are avoided that we are making a decision that um represents the entire population so we are using a sample that is large enough for us to make a decision for our product and for our business that will be generalized and will be a representation and representative when we apply this decision on our population so let me close this part because we no longer need this and let's go ahead and load this data so here I'm using the pendis library and the common uh approvation of PD and I'm saying pd. read CSV and then I'm here referring to the name of the data that contains my click data and here you can see that dat the DAT data is here so abore testore click dat. CSV and I will be providing you this data because you won't have this in your own Google clap you will have the link to this Google clap and I'll provide you also the data such that you can put that data you can download it first from my source and then load it in here by using this specific button in here and by doing that you can then go to that specific folder where you downloaded the data and then you will have also this uh corresponding CSP file in your folders so once you have that then you will uh smoothly run this code and uh here I'm loading that data and putting under the name of DF abore test so basically the data frame containing my ab test click data what I want you to do is to Showcase you how the data looks like so here you will see the header given that here I haven't provided any argument it just looks at the top five elements so the top five rows and here I got only the first five users from the experimental group I see that some of them have clicked some of them didn't click and the corresponding user ID and the time stamp uh that they um done the click action then um when we look at the describe function you can see here that this gives us more general idea uh of uh what the data contains not so much what the top five rows just look like which is great in terms of to understand what kind of data you are dealing with with what kind of variable you have now you can see more the uh total uh picture so high level picture what kind of um data what amount of data you got so the descriptive statistics so here we can see that in total we got 20,000 of users included in this data so 20,000 observations 20K rows and then we have the mean for the user ID of course it it's not relevant the mean is 10,000 and um this is an interesting number so we see that the average click uh when we look at both user and control the experimental and control groups it is 40% so 0.40 uh 52 so 4052 present however this is not what we are too much interested so this is not to be confused with the click through rate perir group what we are interested is the click- through rate or the mean click through um when it comes to the experimental group and the control group so then we have our standard deviation we see a high standard deviation which is understandable given that we have this uh large variation in our data we got a control group and experimental group and this variation shows that we have a huge difference in them these different values uh when it comes to the click event and then we have the mean and the maxim which doesn't give us too much information because the click event so the click variable is a binary variable it contains the zeros and ones so naturally the minimum will be the value zero because the click can take value zero and one and the large lest one is of course one which means the maximum would be one and then for the rest the 25% so the first quantile the second quantile the 15% which s the median or the third quantile the 75th percentile is not that much relevant so when it comes to the descriptive statistics for this kind of data especially if it's filtered it's not super relevant but if you would have a larger data more matrices beside of Click which is your primary Matrix but you all have also measured some other Matrix which is recommendable then you would see more um values which would be interesting to look at so not only to look at the click rate but also to look at for instance the mean or maybe the median of conversion rate or the uh mean uh amount of time the average amount of time the user has spent on your landing page or how much time did that user end up spending before making that decision of a click those can be all very interesting Matrix to look into from the product uh data science perspective to understand the decision process and the channel and The Funnel of these clicks but for now for our case study what we are purely interested is in our primary metric which is the click event so what we can also see in here is that we got um uh in our group um when it comes to the control group we got uh 1,989 users out of all uh control users that and end up clicking versus the experimental group where we have 6,6 users who did click so do not confuse this with the total amount of users per group this amount is the um grouping of the uh data so using the group by and then group so we are grouping that data per group and we want to see per group what is the sum of this variable sum of the clicks and give that the click is a binary variable we know from basics of python that we are basically accounting the number of Click events because if you got a binary variable containing zeros and ones if you do the sum of the clicks adding the zeros do not doesn't have any impact which means that um you end up just summing up all the ones to each other and then you end up getting the number of or the total amount of uh cases when this click variable is equal to one so in this casee when there is a click event therefore we can see that per experimental group um we got 6,16 uh users out of all the experimental users that end up clicking and then out of control group this amount is much lower so we end up having uh only 989 users clicking so let's now go ahead and visualize this data I want to showcase in a bar chart using this clicks what is the total number of clicks so I I want to show the distribution of the clicks when it comes to um The Click event pair group and here I want to uh see next to each other the experimental group and control group and as you can see here here we are getting our bar charts and the yellow correspond student no which means that there was no click versus the uh black corresp resps to the yes which means there was a click so whenever you see this amount it means that that amount corresponds to no click no engagement from the user side and this is per group so this is what we are referring as a click distribution in our uh data in our experimental uh and control groups and the way that I generated this bar chart is by first creating this um uh list that will contain the colors that I want to assign to each of my groups and I'm saying zero corresponds to the yellow and one corresponds to Black which means that if my variable contains an amount of zero in this case my click is equal to zero it means that I don't have a click so it's a no and this I want to visualize by yellow otherwise I have a black which means that um the um one corresponds to the case when we have um click and in this case we will get a black as you can see here the uh yes which means a click is um visualized by this black color and then what I'm doing is that I'm initializing this uh figure size by saying that I I want to have a figure size of 10 and six you you can also skip it but I I think it's always great to put the size of a figure to ensure that you are getting the size like you want it to be such later on you can also download or take a screenshot then we have this uh here I'm using as you can see a combination of the Met plot leap. pip plot Library as well as the uh caburn because curn has much nicer colors and here I'm saying uh we are going to uh make use of the curn to um create um count plot because we are going to count and we are going to showcase the counts per group uh what is the number or the count of the clicks versus no clicks for a group called experimental and what is the number of um or the percentage of clicks versus no clicks when it comes to the group control and then here I'm specifying that the Hue should be on the click which means that we are looking at the click variable and we are going to use the data dfab test which means that we are going to look in this data from here we are going to select this specific variable called click and we are going to use this in order to group our data based on this group so you can see that we are doing the grouping on the variable called group so the argument is called x x is equal to group we're grouping our dfab has data on this group and we are going to do the count in our count plot based on this variable click basically what I'm saying here is that go and group our data dfap test based on Group which means that we will group based on experimental versus control and then I'm saying go and count the click events count pair group so pair experimental per control group what is the number of times when we have a no so we have a zero and what is the number of times when we have a yes or we have a one as a value for click variable and then as a pellet I'm using my custom pellet that I just created which should be in the form of list as you can see in here if I would have here also my third group or fourth group then I of course need to extend this color palette because I need to have the same amount of colors as the number of groups pair might Target variable in this case The Click has only two possible values 0o and one which means that I'm only only specifying the two colors in my list so then we have the title of our plot always nice to add by and then we have our labels which means that I want to emphasize uh as my X label so here I want to have my group you can see here is my group because I will either have group experiment or control that's my variable on my xaxis and on my y AIS of course I have the the count so I'm counting the number of times I got uh the uh no click versus click event so here note that the um y AIS is in terms of this count so here you can see it's uh 8,000 here sa 7,000 or 6,000 5,000 which means that we are talking about the numbers and the counts rather than percentages and this is important because um another thing that I'm also doing is that I'm going the extra mile and I'm also adding beside of this counts on the top of each bar I I want to visualize and clarify what are the corresponding percentages it's always great to enhance your data visualization with some percentages percentages is easier for the uh person who follows your presentation to understand for inance if you got an experimental group and the the users is here 6,000 and um 4,000 they might not quickly understand that you got for instance in total 10,000 of users and then 6,000 has then uh clicked and then 4,000 didn't click so um then the idea is that by adding this percentages we can then see that 61.2% has clicked in this experimental group and 38.8% has not clicked of course this a simulated data I specifically pick the extreme in such way that we can clearly see this difference in the clickr rates but um in the reality you can have a clickthrough rate of 10% up to 14% which is usually a good number if you have a click through rate of 40% is great but it really depend on underline user basee what kind of product you got how large is your user base because if you have very large user base then 10% can be a good clickthrough rate versus if you have a very small user base maybe uh 61% is considered uh good or average so uh in here we have just a simulated data of course and I've have added these percentages uh by using the following code so I won't go too much into detailing here um feel free to check and see uh and if something that doesn't make sense go back to our python for data science course that contains lot of information on the basics in Python but here just quickly what I'm doing is that I am uh calculating the percentages and I'm annotating the bars so I want to know what are this percentages which means that per group I want to take the total amount of clicks I want to understand what is the number of Click event when the click variable is equal to one so and what are the number of cases when there was no click from the user side which is what are the number of cases when the click variable is equal to zero and then I'm counting those amounts and then using the total amount to calculate the percentage for instance in this specific case I'm filtering the data for experimental group I'm looking at the total number of users for this group which is 10K and then I'm counting the number of times when out of this 10,000 users the amount of users that end up clicking on that button which is the click is equal to one case and then I'm taking that number dividing it to the total number of users for this experimental group multiplying by 100 in order to get that in percentages and this is the calculation that you can see in here one thing that is important here is that here I'm using this um uh percentage um so for the current bar I'm saying U as a way to identify whether we are dealing with experimental control group is by getting by looking into this uh p and and uh this p in here is the basically the patches so in this case I'm basically saying if I'm dealing with the experimental group then go ahead and calculate what is this uh total amount of observations and then take what is the uh number of clicks and then divide the two numbers uh C multiply this with 100 and this will then give us the percentage and then I'm doing this for each of those groups so I'm doing it for this group I'm doing for this group group and for this one and for this one so I got two groups but then within each group I got clicks and no clicks and I'm calculating this four different percentages and then I'm adding these percentages on the top of those bars so I not only want to have numbers represented in my visualizations but I also want to add this corresponding percentages at the top just for visualization purposes I wanted to put this out there because this can help your uh data visualization toolkit and it also will um make your audience from your presentations be more thankful to you when you are telling the story of your data so uh this is about the data that we have we see that uh 38.8% of our experimental group users have not clicked on the button versus the 61.2% have clicked on the button based on the simulated data and then uh in the control group we have a quite the opposite situation we got the majority of the users 80.1% not clicking on the button versus the remaining 19.9% have actually clicked on that button so we got a huge difference a dissonance when it comes to the experimental group and uh control group this kind of gives us an indication hey something is going on here we kind of uh have already um high level intuition what the remaining anasis will look like um which is that there most likely will be a difference in their ctrs when it comes to the uh the um uh control versus experimental group and the uh corresponding buttons but uh hey let's continue that's the entire goal behind a testing is to ensure that our intuition our conclusions are all based on the data rather than on our intuition so what are the parameters that I'm using here for conducting our AB test when I was designing this AB test uh the first step was to of course do all these different translations that we learn as part of our AB test course um conducting it properly which means coming up with this three different parameters when doing our power analysis and usually this should be done when you are collaborating also with your colleagues and uh with your product managers or your product people domain EXP experts because they have um a lot of information on what it means to have um thresholds that you need to pass in order to say that for instance this new version of your feature is different and is uh considerably uh different from the existing one and here um in order to for us to understand this uh and make these conclusions we need to come up with the three different parameters that can help us to properly conduct an AB test as we learned when we were looking into designing a proper AB test so first we have our significance level the significance level or the alpha the Greek letter that we are using to refer to the significance level which is also the probability of the type one error and that amount we have chosen following the industry standard which is 5% given that we didn't have any uh previous information or specific reason to choose a different significance level so lower or higher we decided to go with the industry standard which is the 5% this means that we want to have um we want to compare our P value of our uh statistical test to this 5% and then say whether we have a statistically significant difference between the control and experimental group based on this 5% significance level and let's refresh our memory on this Alpha this Alpha uh or significance level is also the probability of type one error so this is the amount of error that we are comfortable making when we um reject the N hypothesis well the null hypothesis is actually um true which means that we are detecting a difference between the experimental and control version while there is no difference and we are making that mistake and here we are saying that we are fine and we are comfortable with making this mistake at maximum of 5% but higher than that it's not allowed we are not comfortable making uh error um higher than 5% then the next variable uh in this case the B uh or beta the probability of type two error which is the opposite of the type one error which is a false negative rate or the amount of time the um um proportion of time when we end up failing to reject the null hypothesis while null hypothesis is false and it should have been rejected then the one minus beta is actually power of the test so what is the amount of time we are correctly rejecting our null hypothesis and correctly stating that there is indeed a statistically significant difference between our experimental group and our control group so we have chosen for this the uh industry standard as well which is the 80% but given that for your results analysis in this case for conducting this case study that part of the power analysis is not relevant we use that when calculating our minimum sample size but we don't need that when conducting our results analyses therefore I'm not initializing that as part of this code so here I'm only providing to my program the values for my significance level which is 0.05 or this is the same as 5% and then the Del which is the third parameter and this Delta is our minimum detectable effect effect so this a Greek letter Delta which is the minimum detectable effect helps us to understand whether beside of having this statistically significant difference whether this difference is large enough for us to say that we are comfortable making that business decision to launch this new button so it can be that when we are conducting an AB T has we are finding out that the experimental group has indeed higher engag ENT than the uh control group and we are uh getting a small P or at least smaller than the alpha and we are seeing that P is more than the alpha level which means that we can reject the null hypothesis and we can say that the uh CTR or the clickr rate of the experimental group is statistically significantly different from the control group at 5% significance level but we know from the theory of ab test that only that is not enough only statistical significance is not enough for the business to make that important decision to launch an algorithm or to launch a feature in this case to change our lending page the button from the uh start free trial to the endroll now which means that we want to have enough users and we want to have enough difference large difference in our click through rates or enough users saying that we are more happy with this uh new version of the landing page for us to go and change our feature and what is this definition of enough what is the difference in the clickr rate that we need to detect after we have detected the statistical significance in order for us to say that we also have a practical significance so practically we are also comfortable making that business decision and then launching this new feature and changing our landing page button and that is exactly what we have under our Delta this minimum detectable effect in this case we have chosen for Delta of 10% so you can see here 0.5 this is 10% this means that our Delta or the MD the minum detectable effect is 10% this means that we are saying not only we should have a statistically significant difference between the experimental group and control group but also we need to have this difference to be at least 10% which means that we need to have detected that the experimental version of the landing page results in at least 10% higher click rate compared to the control version for us to go ahead and to launch this new version and deploy this new uh ux uh feature so this is really important because many people go and check for statistical significance so they do their Alpha and then check uh whether the P values were the alpha and then say hey we have a statistically significant difference and then they are done with that but that's not correct after you have conducted your uh statistical significant analysis and you have detected that your uh experimental version has a statistically significant different um CTR than the control version at your Alpha significance level the next thing you need to do is to ensure that you also have a practical significance beside of the statistical significance and this practical significance you can detect and you can check when you use your mde or your Delta and you compare it to your confidence interval that you have calculated something that we have also learned as part of the theory of conducting a proper AB test but once we come to that point so after we check for our statistical significance I will also explain how exactly uh we will need to do this check and at the same time we will also be refreshing our theory on the Practical significance so let's now go ahead and calculate the total number of clicks per group by summing up these clicks and I also want to calculate and group by this amount just to Showcase how you can do that on your own so here what I'm doing is that I'm taking my ab test data I'm grouping by by group group is the uh variable that contains the reference whether we are dealing with experimental group or control group and as you know from our uh python series and demos python for data science course that uh whenever we want to group that data a pendis data frame first we need to say pendis data frame name do group by Within parenthesis the variable that we are using to do the grouping which is in this case group and then within Square braces I want to emphasize and put the name of a variable that I want to um apply operation on so I want to group my data on the group variable and I want to count the number of times I have a click in my control group and in my experimental group this will be my X control and X experimental variables so X control will then compute contain information about number of Clicks in my control group and then X experimental will contain the number of Clicks in my experimental group and given that um I want to refer to the name of that uh Group after I did my grouping so I am getting this kind of this shape of data frame of course I then need to uh use my do log function in order to properly call that amount so to understand what is this amount corresponding to this index and what is this amount corresponding to this index and given that my index is in strings I'm then using here my do log function something that we also learned as part of our python for data science course so here is basically the printing just writing nicely what are the results which means that we are counting that the let me count again that the uh number of uh clicks for my control group is 1,989 so you can see that it is want to double check and see what we got yes so we got the same number so we are dealing with the same data set just to make sure and here the number of clicks for experimental group is equal to 6K and 116 so 6,116 clicks so then we are calculating the uh pulled estimates for the clicks per group let me quickly fix typo so calculating the uh pulled estimate for the clicks pair Group which means the P estimate for the experimental group and for the uh control group so let me quickly add here how I can calculate the uh total cases when we got uh experimental group users so what is the number of users in the experimental group and what is the number of users in the uh control group so here what I want to do is that I want to say that the group The DF test group should be equal to experimental and this of course should be my filter and I want to count this and let me quickly copy this I saw that is already under the control so here I'm changing to the control and this will need to give me the number of users in each of these groups too so number of users in control and number of us users in Click and here I will simply check this so I will print then the number of users per group and at the same time I will also click the number of clicks per group there we go so now when we have done this what we are ready to do is to go ahead and calculate the P estimat for clicks per group which means pair control group and pair experimental group for that what we need to do is to take the number of clicks of the control group divide to the number of all users for control group as you can see in here x control / to n control and we are referring to this variable as P control head because we know that the estimate of this click probability um is always with a hat it's just the way that we reference it in um statistics and in a testing so this is the estimate something that we are estimating the therefore we are saying head and then we have the same for experimental group which means that the estimate of the experimental groups uh click probability is equal to X exp and then divided to n x then um in order to calculate the uh pulled estimate or uh pull click probability which means the value that will describe of the uh control group and experimental group we need to follow this formula which means that we are taking the control we are adding to that X control the X experimental this is our nominator of our uh value and then we are dividing this to the uh sum of the sizes of each of those groups which is n control and N experimental so this is the common formula of the pulled estimate uh when it comes to this type of experimentation when you are dealing with um primary uh met trick that is in the form of zeros and ones and if you want to refresh your memory on this type of formulas then make sure to also check our AB testing course because in there we go in detail in this uh specific lesson of the uh AB test result results analis we are looking into this uh all these formulas on how we can calculate the pulled estimate of this uh click probability so click probability but then we are calling it P click probability and then what we got is this volum so that amount is then 0 and 40 this number should look familiar because this is then the mean that we saw when we were looking at the um uh descriptive statistics table if you can recall this table let me see this number so now basically we are then calculating this manually because we need a variable that will hold this uh volume so it is simply summing up all the clicks for control group and experimental group to get the total number of clicks and we are dividing it to the total number of users so n Control Plus n experiment so now when we have this we are ready to also calculate what we are referring as a pulled variance also something that we have learned as part of the theory for AB testing so the pulled variance is equal to the pulled estimate of the clicks so P had something that we just calculated multiplied by one minus P head so the uh click event the estimate of the click probability multiplied by the estimate of no click and we know already this idea of Bol distrib bution that the variable that uh describes this process of clicks and no clicks follows kind of this idea of bero distribution when we have a click and no click so we have probability of click and then we have probability of no click which is the one minus that click probability so that's the idea or the part of the formula that we are following as kind of an intuition and then this multiplied by 1 / to n control + 1 / to n experimental so here I'm purely following the formula for the PED variance if you want more details and explanations and sure to check the corresponding Theory lecture because we are going into details of each of those formulas and understanding why we calculate this um P variance and P estimates uh in this specific way and using these specific formulas so here by just follow following the uh formula I'm getting that the uh pull uh variance is this amount so this is in nutshell how I calculated my uh pull click probability and a pulled variance of that click event and we are going to need that in the next very important step which is calculating the standard error and calculating the test statistics because in this case what we are doing is that we are dealing with a case when the primary metric is in the form of zeros and one so we let's Now quickly talk about the uh choice of a statistical test be uh before conducting the actual calculation of standard error in the test statistics so here I went for the two samples at test and let me explain you why and what is the motivation because as we learned as part of the theory um whenever we have a primary metric that is in the form of an averages like we have now because we are using the P control head and P experimental head head so we have a primary metric that is the uh click through rate which is the average clicks per group so we have calculated the average click per experimental group and per control group then the primary metric the form of it already dictates given that it's in averages that we need to look at uh either parametric test corresponding to this averages or not parametric test corresponding to the um averages in this case I went for the parametric case because uh it has Better Properties if I have this information about the distribution of my data and why do I have this information and then this also dictates the uh choice of my um statistical test well I have a size of my sample which is over 100 and actually over 30 that's the threshold that we tend to use in statistics and in a testing in order to say whether we have a large size or large data or not if our sample is not large so it contains less than 30 users per group which happens as well then we say that we need to go for um statistical tests uh that will be specific for this kind of cases because we can no longer make use of the uh statistical theorems like the central limit theorem we helps us to um uh to take the uh to the inference so to make use of the inferential statistics and make conclusions regarding the distribution of our population just having the sample and what do I mean by that so if my sample is larger than 30 like in this specific case I got 10,000 users per group so it is definitely larger than 30 uh users then in that case I can say that by making use of the central limit serem I can say that my sampling distribution is normally distributed and this is simply making use of the central limit theorem something that we have also learned when we were looking into this concept of inferential statistics as part of the fundamental statistics course uh course um in lunar Tech so this is a powerful theorem that we use in AB testing in order to make our life easier because when we have a sample that is larger than 30 for each of these groups then we can say that even if we don't know the actual distribution or the name of the distribution that our uh sample follows when it comes to the click um event so the random variable that describes this number of clicks or the average click through rate what is that um distribution exactly but given that we have that this uh size is large enough it's large than 30 users we can say that by making use of the central limit theorem we can say that the uh the uh sample distribution follows a normal distribution if given that SLE size is large enough and this helps us to say that well in that case it doesn't matter whether we make use of the two sample Z test or two sample T Test we can make use of either of this test in order to conduct our analysis and we had this specific template to make this Choice easier uh in our AB test course at the ler Tech where we were making all this decisions and saying if the SLE size is this we need to do this if the sample size is this we need to do this and in this specific case following that exact structured and organized approach I ended up seeing that my sample size is large so it's larger than 30 so I can then make use of the central limit theorem I then know what is the random U what my random variable describing this quick through rate um follows the kind of distribution in this case a normal distribution and then this means that whether I use a t test or Z test doesn't really matter I'm going to end up with the same conclusions therefore I will just go with the two sample Z test simply because um it is just easier for me to do for example you can also go with the two sample T Test and you can even change this case study and tweak it and then make it your own and put it on your resume in that way by making it more unique and that will be totally fine because you will see that you are going to end up with exactly the same conclusions as we do in this specific case study because if you have a large enough sample it won't matter whether you have a true sample Z test as your parametric test or the two sample T Test and um if you want to know why why this matters and all the different detail statistical insights make sure to check the actual uh course dedicated to AB testing because there will we cover this all and you will then become a master in the field of AB testing now we know this uh decisions and the motivation behind choosing the uh two samples that test let's now go ahead and do the actual calculations so here we have a standard error which we calculate by taking the pulled variance and taking the square root of it and this is again using the idea of this formulas that we learned as part of the ab test so we are using this P variance taking the square root of this which gives us the standard error and the standard error as you can see here is then equal to 0.69 29499 this amount there we calculate our test statistic for our two sample at test so the test statistic is equal to P control hat minus P experimental hats divided to standard error so here uh you can now see the motivation behind not only Computing the P pulled head but really also the p uh control head and P experimental head and then I take the P control head and subtract the P experimental head and I divide it to the standard error to compute my test statistics once I did this as you can see this is this amount so test statistics for our two sample that test is this amount minus 5956 rounded then um we can also compute the critical value of our Z test which is uh by using this Norm function that we uh loaded in here from the C high and this will help us to understand what is this value from our normal distribution table the standard normal distribution table uh where by making use of this table we identify what is this critical value that we need to have to uh create our rejection regions and to say whether we can uh reject our n hypothesis or not so to conduct our test we need to have a critical value for uh to which we will compare our test statistics and this critical value will be based simply on the standard normal distribution so this is this norm. ppf and then uh probability um uh function basically uh the the probability function that comes from the normal distribution standard normal distribution and as you can see this corresponds specifically to this percent Point function which is the inverse of the cumulative distribution function so this based on the alpha / 2 so 1 - Alpha / 2 is the argument that we need to put for our percent Point uh probability function and why divide it to two because we have a two sample test so because we have a two-sided two sample test sorry so if you want to understand this difference between uh two sample um test two-sided Test please check out the uh fundamental to statistics course at ler Tech because we covered this uh Topic in detail and it's a very involved topic it contain contains many complex from statistical point of view so I won't be spending in this case study too much time on that here I'm assuming that you know this formula already but if you don't and if you quickly need to do your case study na testing feel free just just to copy this line which basically is a value that we need based on the corresponding chosen statistical significance level that we need to compute to compare our test statistics so our test statistics is this value and the value that we need to compare it to is the Z critical volue so we can see that this critical value is then equal to 1.96 this is actually a very common value that we know even without looking at a standard normal table when you make use of this test enough often then you know that the uh critical value corresponding to a two-sided test when it comes to normal table is is equal to uh 1.96 this is just the value that we know and in here by even without calculating the next step which is a P value we can even say already what is the decision we need to make in terms of statistical significance because we know that one way we can test our hypothesis statistical hypothesis is by Computing the test statistics and checking where the test statistics the absolute value of it is larger than the critical value and we see that the test statistics is equal to minus 5956 the absolute value of that is 5956 and that value is much larger than our critical value which is equal to 1.96 this already gives us an idea that we can reject our null hypothesis at 5% statistical significance level but I want to uh go onto the next step actually because that's um more structured more organized way to do and conducting experimentations as in the industry we tend to make use of the P values instead of making use of this econometrical approach and statistical approach of um testing the statistical test so once we have calculated our test statistics the next thing we need to do is to calculate our P value and then use that P value compare to the significance level Alpha and then make a decision with we need to reject our n hypothesis and say that we have a statistical significance or we cannot reject our n hypothesis and then we need to say that we don't have a statistical significance so we don't have enough evidence to reject the hypothesis so the idea here is that we need to make use of our uh normal function and specifically the norm. SF so making use of exactly the same Library the norm from CI that's dos and then this time we're using the survival function which is the one minus the cumulative distribution function of normal distribution this comes again from statistics and then using the absolute value of our test statistics multiplying it by two given that we have a two-sided test I'm calculating my P value this is simply by making use of the same formula that we saw when we were uh studying theab test from a technical point of view because we learned that the P value is than the probability that Z will be smaller than equal the minus test statistics or that the test statistic is small than equal to Z so uh we basically want to calculate what is this probability the P value which is equal to the probability that our test statistics will be smaller than the critical value or our negative of the test statistics will be larger than equal of the critical value and we want to know this probability because what this probability represents is that what is the chance that we will get a large test statistics well this is due to a random chance and not because we have a uh actual statistical difference between the clickr rate of the experimental group versus control group so this is the idea behind P value so what is this chance that we are uh mistaking this random mistake this random observation that we got a large test statistic and saying that there is a statistical significance well there is no such thing and we are purely getting this large test statistics um because of the random chance if the probability of getting a large test statistics by random chance is small so if this P value is small then we can say that we have a statistical significance that's the idea behind it and this P value when we calculate uh we are storing it in this variable called pcore volume and then the next thing what I'm doing is that I'm writing this function quote is statistically significant which takes a argument as P value in Alpha so I just need the P value that I just calculated for my test Set uh test uh statistics and then I want the statistical significance level that I want to use for my test and then this is the value that comes from my power analysis as I mentioned before that's the 5% this P value I'm calculating for my test statistics so in here and then I'm taking the two and I want to compare them so I want to assess whether I have a statistical significance by comparing my P value to my statistical significance level Alpha and what is this comparison well we know uh from the theory that um if we have a low P value and specifically in the P that we are getting the P value is small than equal the 5% or 0.05 which is the significance level then this indicates that we have a strong statistical uh evidence that uh the N hypothesis is false and we need to reject it so we have a strong evidence against the null hypothesis and otherwise if the P value is larger than 0.05 so it's larger than 5% that we have chosen as the maximum transure hold of that mistake so the significance level is uh uh no longer the largest element but the P value is larger than your significance level then this indicates that you don't have enough evidence against the null hypothesis so your evidence is weak this means that you fail to reject the N hypothesis so this is what I'm doing in here with this code so I'm saying print the P value first and we are rounding it up with this round function I'm rounding it to the three decimal and then I want to check and determine whether I have a statistically significant or not and the way that I'm doing that is I'm saying if my P value is more than my alpha or actually lets add smaller than equal than Alpha then we can print that there is a statistical significance which indicates that the observed differences between the experimental and control groups are un unlikely to occur due to random chance which means that this is not random chance and uh we have a strong evidence that there is a statistical significance and this suggests that this new feature that we got this new version of our landing page with this um uh call to action um as the andr Now is better and results in higher statistically significantly higher clickr rate than the existing version of the control uh group so there is a real effect then otherwise if this is not the case which means that my P value is larger than my Alpha then I'm saying print that there is no statistical significance and that the observed difference that we see in the clickr rate is not because uh of the real difference in the performance but TR truly this is just a random chance so here we can see that once we run our we call the function in here which is play the function name and the argument so P value in alpha alpha comes from the initialized value that we had from our power analysis so from here we initialize this value 0.05 and then here we got the P value that we just calculated then what we are getting in here is that our P value is actually so small that it's um rounded to zero so what this means is that that there is evidence that suggests that at 5% statistical level significance level that the uh clickr rate of the experimental group is different from the clickr rate of the control group note that I'm not saying higher or lower because our stal test was two-sided so under n hypothesis we had that the uh P control so in here as you can see our P control was equal to P experimental and under the alternative we had that a p control is not equal to P experimental this means that we um have now rejected the null hypothesis we have found evidence that suggests that the null hypothesis can be rejected since our P value is zero and it's smaller than the statistical significance level 5% and this means that we can reject the H and we can say that uh there is enough evidence to say that P control is not equ experiment and given that that we saw from the visualizations from our calculations that the um clickr rate for our experimental group is much higher than the click rate of the uh control group we can also say that we have found evidence that at 5% significance level we have found out that there is a statistically significant difference between the experimental and control groups clickr rate and that the experimental groups clickr rate is actually higher so statistically significantly higher than the control versions clickr rate so this is really important because this suggest that this difference in their clickr rate is not due to random chance loan but truly that there is evidence statistical evidence that can support this hypothesis that there is a true difference between the performance of the experimental version of the product so in this case in our case the landing page that has enroll Now button versus the control version of the product which had the uh uh startree trial version of the landing page the existing version so beside of calculating this P value it's always a great practice to also visualize your results and this is great for your audience who are technically sound and who know uh these different concepts and you want to visualize uh the results that you got not only by showing some number that is the P value and say hey I have a statistical significance but you also want to showcase the actual picture of what you got what is your test statistic what is the significance level that you use to kind of tell a story around your numbers and that's the uh art behind the data science I would say so let's go ahead and do some art so what I'm doing here is that I am making use of my standard normal distribution or the gausian distribution the way that we are referring to the standard normal distribution in statistics I'm saying that my mean or the MU is equal to zero my Sigma is equal to one which is my standard deviation and I'm saying that my uh I want to now plot my uh standard normal distribution by getting my uh X values which are the uh number of uh X elements that I want to have my xais and then taking the PDF or the probability distribution function for the normal distribution by using the S Library I'm then providing my X values for which I want to get my uh corresponding uh values of Y so basically here are all the values between let's say minus something minus three and then so between minus 3 and three and I want to find all the Y's corresponding to this which basically plus the probability distribution function of the cian distribution or the standard normal distribution and then I want to add to this graph also the uh corresponding rejection region and as you can see it is here so then what I'm adding here by using this part of the plot is that I want to fill in the rejection regions so I'm saying for all the values in this figure whenever the uh value is lower than that TR thrh hold in this case the threshold is z critical 1.96 so whenever my threshold is smaller than minus this uh 1.96 and larger than this 1.96 then we are in the rejection region we are saying then if my test statistics is falling in the rejection region in this case you can see that we are in the far left so the test statistic is minus 5944 and it's much lower than this threshold as you can see in here this is this left Blue Line in here then in this case it falls in this rejection region so actually this entire thing is the rejection region it starts from here and it goes all the way to here anything anything in this region means that we need to we have a test statistic falling in the rejection region which means that we can reject to no hypothesis if we were to get a test statistic that is very large and very positive it means we would be in this part of the figure and again in the rejection region anything above this line is then uh going under this category of rejection region and also anything in here so for anything in here we are in the rejection region being in the rejection region it means that we can reject the N hypothesis and we can say that we have a statistically significant result results so now when we have our statistical significance it's always a great idea to go on to the next step and it's actually mandatory to do this because not only a statistical significance is important but also the Practical significance as I mentioned in the beginning of this case study so for that what we are going to do is first we are going to calculate the confidence interval of the test and this confidence interval will help us to first of all make um comments regarding the quality of our test and its generalizability uh at our entire population and the accuracy of our results and then we will use this confidence interval to make a comments and to test for the Practical significance in our AB test so let's go ahead and calculate the confidence interval so as we learned as part of our lectures the confidence interval can be calculated by first taking the uh p experimental head and P control head and the standard error and the Z critical so here we need the two different estimates of the experimental groups click through rate and the control groups click through rate we also need the standard error of our two sample Z test as well as the critical value and then we need to First calculate the lower bound of our confidence interval and then we need to calculate the upper bound of our confidence interval and in this case uh given that the um statistical signific ific level we are using is Alpha uh the uh Z critical is based on that therefore we are also saying that we are calculating the 95% confidence interval so in here the way we will calculate the lower bound is by taking the P experimental head subtracting from that the P control head and then once we have done that we then subtract from that the standard error multiply by Z critical volume and we are just rounding this up up to the three decimal behind the zero then we are doing the same thing only with a plus sign in here for the upper bound calculation of the confidence interval so this is just pure following the formula of the confidence interval that I will set you here and let's go ahead and print this value which is this interval so what we are seeing here is that we have a confidence interval that is from 0 399 so 0.4 to 0. uh 43 so quite narrow confidence interval I would say which is actually a good sign because this confidence inter that provides this range of values within which the true difference between this control and experimental groups proportions or the clickr rate is likely to lie within a certain level of confidence in this case 95% confidence this is very narrow and if it's a neuro confidence interval it means that the uh accuracy of our results is higher and it means that the results we are getting based on our smaller sample it will most likely generalize well when we apply these changes and deploy these changes and we put this new product in front of the entire population of users because now we are doing all this experiment for a small group for the sample and this confidence info that is narrow it's not wide it's narrow it means that the results that we are getting is are accurate more or less accurate and this means that the results that we are getting based on a sample are most likely a true representation of the entire population that we got this is the idea behind the width of the confidence interval the narrower it is the higher uh is the quality of your results which means that the uh more generalizable are your results so let's now go on to the final stage of our case study which is to test the Practical significance of our results so now when we know that the statistical significance is there the experimental version of our feature is statistically significantly different from the control version in terms of the clickr rate and we have seen that the competence interval is narrow which means that our results are accurate quite uh with quite high accuracy then we can now comment on the Practical significance of our results this means we want to see whether the significant difference that we obtained whether this difference is actually large enough from the business perspective to say that it's worth to put our engineering resources and our money and our uh uh product into uh to put through through this change and to uh say that it's wor from the business perspective to change this button and to put this into um the production and in front of our users and of course here we are not only talking about the engineering resources that it will take from us to change this and the deployment and the monitoring but also in terms of the quality of the product we are providing to our users because whenever we are making a change to our product it is a risk because we are changing what our user is used to see and this can always be scary uh when it comes uh to the business because we don't want to uh make our customers scared so therefore we need to also check for this practical significance so for that what I'm doing is that I'm creating this python function that will take two arguments so two values that is the minimum detectable effect and then the 95% confidence interval that I just calculated those will be the two arguments for my function and I'm calling this function is practically significant and this function will go and check whether the uh practical significance is there or not and it will then return true or false and then it will also print whether we have a practical significance or not and we learned from the theory and we know from this AB testing concept that whenever the uh mde or the Delta that we got the minimum detectable effect is larger than the lower bound of our confidence interval it means that the lowest possible value that we can get based on the results that we obtain in our sample that that amount is smaller than the minimum detectable effect that we assumed before even conducting our AB test this suggests that we have a practical significance and the difference the minimum difference that we will obtain is large enough for us to have a motivation to make this change in our product for that what I'm doing is that first I'm taking my 95% confidence interval and I'm taking the first element because we know that a confidence interval is actually range so two P of two numbers the lower bound and upper bound I need the lower bound because all I care for this practical significance is to compare the lower bound of the 95% confidence interval to this minimum detectable effect which is my Delta so therefore I'm taking this lower bound of confidence interval putting that into a variable and then I'm using this variable this lower uncore bound uncore CI confidence interval and I'm comparing this to my data I'm saying if my lower Bond of the confidence interval actually I'm noticing that here I got a mistake it should be the other way around we need to say that if our Delta is larger or equal the uh lower bound of the confidence interval which is the same as if our lower bound of the confidence interval is smaller than equal our Delta so if our we can also write this the other way around so if our Delta is larger than equal than our lower underscore bound underscore CI then we can say that we have a practical significance so with the MDA of in this case so I want to use my initial Delta that therefore I won't be initializing this so you might recall here a Delta of 10% I want to still make use of that Delta so therefore I will just go ahead and then in here what I want to do is to call this function by using that specific Delta so I want to have a 10% as my MD and whenever this Delta will be larger than the lower B of my confidence interval that I just obtained I will then say that we have a a practical significance and with an MDA of 10% the difference between control an experimental group is also practically significant so you can see that the lower bound is 0.04 something that we obtain here and that amount is then compared to this Delta and here you can see that we have concluded that we also have a practical significance so so amazing we have come to the end of this case study and in this involved case study we have conducted an entire um AB test result and Asis so this case study Navy test and to end going from the point of loading the data and then understanding this business concept or business objective of ab test where we were testing whether the um enroll Now button which is the new version the experimental version should replace the existing button which is the secure vraal and based on this case study what we found out is that we have a statistical significance at 5% significance level suggesting that we can reject the N hypothesis and we can say that indeed there exists a statistical significant difference between the click through rate in the experimental group versus control group uh and specifically that the enroll n experimental button results a statistically significantly higher click through rate than the uh secure free trial button and beside this we also checked the um accuracy of our results by looking at a confidence interval and we saw that the confidence interval was quite narrow suggesting that the results we obtained were quite uh accurate and this means that the results that we got for the sample will generalize to our population of users and finally we have also checked the Practical significance of our results by using the 95% confidence interval and comparing the lower bound of that interval with our minimum detectable effect Delta and we saw that we will have at least 10% uh significant difference between the control groups CTR and the control uh the experimental group CTR and the experimental group CTR will be at least 10% higher than the uh control groups and this suggests that uh from the business perspective we also have a motivation uh beside of this statistical significance we also have practical significance suggesting that we also have enough motivation and reason from the business perspective to put this new button into production so we can conclude that uh based on this datadriven approach and conducting an AB testing we uh can see a clear motivation of deploying this new button and draw now and replace the existing one secure free trial version and we will then expect to see more users clicking on this and engaging with our product and for now this will be all for this case study if you want to learn more about AB testing make sure to check our AB testing course as well as the ultimate data science boot camp don't forget to try our free trial this time using our enroll Now button and if you want to see more case studies like this make sure to check our a case studies we have many case studies also included as part of our ultimate dat science boot camp where we go in detail of these different steps and we conduct different sorts of case studies to put our data science theory into practice including from the field of NLP machine learning recommended systems Advanced analytics and also AB testing and soon also from AI so for now thank you for staying with me and conducting this case study happy learning this video was sponsored by lunarch at lunarch we are all about making you ready for your dream job in Tech making data science and AI accessible to everyone with these data science artificial intelligence or engineering at lunar Tech Academy we have courses and boot camps to help you become a job ready professional we are here to help also businesses and schools and universities with a top notot training modernization with data science and AI corporate training including the latest topics like generative AI with lunar Tech learning is easy fun and super practical we care about providing an endtoend learning experience that is both practical and grounded in fundamental knowledge our community is all about supporting each other making sure you get where you want to grow ready to start your Tech Journey lunner Tech is where you begin for students or aspiring data science and AI professionals visit Lun Tech Academy section to explore our courses and boot camps and just in general our programs businesses in Need for employee training upscaling or data science and AI Solutions should head to the technology section on the lunch. page Enterprises looking for corporate training curriculum modernization and customize AI tools to enhance education please visit the lunar Tech Enterprises section at lunch. for a free consultation and customize estimate join lunch and start building your future one data point at a time hi I'm vah and in this project we will learn how to understand your customers better track sales patterns and show those results if you like working with data or own a store this video will show you how to use information to make better choices and get better results you will speit your customers into smaller groups based on how they shop this helps you send the right messages to the right people and give them offers they will like loyal customers are the best you will use data to find your biggest supporters and those who are ready to spend more then you can reward your your best customers with programs that fit their shopping habits this makes them happy and stops them from going to other stores you will use data to guess what people will buy and when they will buy it you will find sales patter hiden among different items and figure out what cool new products people will want this lets you always have the right stuff at the right time you won't have too many items everything will sell and customers will be surprised by how well you know what they need we'll look at how sales change throughout the year will this helps you plan for busy times cash slowdowns early and know exactly when to have big sales we will use the location data and what people say about you to find places where sales are going well and where you could grow you'll even show it all on the map this helps you spend your advertising money wisely find great spots for new stores and even choose the perfect things to sell in each place so let's get started all right let's now go over the data I will be using so we are using the superstore cells DS set and it has 9,800 rows and the columns order ID order date ship date shipping mode standard class second class or other classes and the customer ID with them customer name the segment meaning um who bought the product was it a consumer a corporate or a home office and the clients mainly come from United States and it's also specified from which city of the United States they come from so we shall import this to our Google clab and start working on it okay so let's now import the necessary python libraries we'll import pandas as PD we also import numai SMP import metf lip High import Seaborn as SMS so let's also import the there and we will be using copy p [Music] perfect so this is how it looked like on kagle and this is also how it looks like when we have imported so let's now look up the frames and larger info so everything seems to be consistent but the postal code it seems that 11 postos are missing okay so what we can do is to fill in those no values [Music] [Music] [Music] okay so as you can see we have replaced the no uh poster codes customers that didn't have any postal code and we have put a zero inside it all right so let's now move on to checking for duplicates if the have that duplicated that sum trigger down zero Dash print so let's now see if they're actually duplicates and if there are duplicates we will print D duplicates exist and if there are not we'll print no duplicate found all right so as you can see there exists no duplicate so let's move on to customer segmentation let's first create a variable named types of customers and let's extract out of our D Frame called [Music] segment as you can see from our data frame we have a married segment within our data this segment includes a list of the types of customers in our data frame we have both consumer and corporate customers so let's get started with customer segmentation the main problem is that many large businesses struggle to understand the contribution and importance of their various customer segments they often lack precise information about their main buyers relying on intuition rather than data this leads to misallocation of resour resources resulting in Revenue loss and decreased customer satisfaction for example if your store primarily caters to Consumers it's crucial to tailor your marketing and customer satisfaction efforts to resonate with their needs and preferences by focusing your resources on understanding and Catering to your consumer base you can avoid misallocating resources to large corporates this ensures you're providing a satisfying customer experience for your primary demographic ultimately leading to increased customer loyalty and revenue growth and we can also we can create a part chart pie chart or bargar from it to clearly illustrate the revenue contribution of each customer segment and this will allow us to tailor more of our marketing resources our customer satisfaction resources towards once you've completed customer segmentation The Next Step depends on your strategic goals here are a few ways to proceed focus on your most valuable segment if your existing customer segmentation reveals a particularly profitable segment such as consumers tailor your marketing product offerings and customer service to deepen your engagement with that group Target new segments if you want to attract more corporates or home offices you'll need to understand their unique needs and pain points start by researching these segments what are are their challenges what solutions would appeal to them develop tailored messaging and consider offering specialized products or services to attract these new customer types all right so let's get started [Music] [Music] so this will extract the types of customers from our data frame perfect so it's consumer corporate and home office those are all the um variables that are in our data frame all right so let's count the unique values in our segment and we will do this by number of customers so what this meaning does is it counts unique values in our segment and resets the index to turn them into a column and then we can correct the renaming of columns so we want to give our segment the name of like total customer or type of customer I will go with the type of customer so we will say number of customers is equal to number of customers that rename and want to call the colon which is the name segment we want to rename it to to type of customer now if you want to print that print number of customers there are 5, 101 consumers and for corporate there are like 2,953 corporate buyers and7 1,746 home offices and if you want to create a pie chart out of this we can plot it by saying PL p number of customers and want to B base a pie chart on the count and want to label number of customers toal custom [Music] [Music] R Perfect all right so from as you can see we had the renew um type of customer the total customer so you can see that from this uh pie chart our main consumer segment has 52% 30% of our orders come from corporates and 18% from home offices you can see who we have to exactly focus on which are consumers while consumers hold the majority focusing solely on them Overlook significant potential within the corporate and home office segments let's explore how to balance resource allocation for all three segments to maximize growth to gain even deeper insights we should integrate our customer data with sales figures this analysis will help us identify which segments generate the most Revenue per customer average order value and over all profitability customer lifetime value additionally we can segment customers by purchase frequency and basket size to understand their buying Behavior within each segment here are some additional questions to consider for a more comprehensive analysis customer acquisition cost CAC how much does it cost to acquire a customer in each segment customer satisfaction how satisfied are customers in each segment churn rate what is the rate at which customers leave in each segment by analyzing these factors alongside revenue and customer lifetime value we can create a customer segmentation model that prioritizes segments based on their overall value and growth potential we can also PL a bar graph for the total sales for each uh customer type and group the data by the segment column and calculate the total CS for each segment and you want to do this by so right now you don't see the exact sales numbers the bar chart you can see the exact sales numbers for each customer type so let's PL it [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] so there are around one .2 million from our consumers and we have around 600 or 700,000 from our corporates now we can also plot uh barware from this which means plot out bar sales pair segment customer type type of customer sales pair segment called a sale this bar chart effectively illustrates the distribution of sales across our customer segments consumers account for the largest portion of sales 1.2 million followed by corporates 1.0 million and Home Offices 0.8 million while the chart is clear a deeper analysis can help us optimize our marketing efforts customer lifetime value CLTV calculate the CLTV of each segment to identify which segments generate the the most Revenue over time this will help prioritize customer segments for marketing efforts for example if you find that the home office segment has a higher CLTV than the Consumer segment you may want to invest more resources in marketing campaigns targeting home office customers market research conduct market research to understand the specific needs and preferences of each customer segment this will inform the development of targeted marketing campaigns for instance you might discover that consumers in your data are price sensitive while corporate customers are more interested in bulk discounts and reliable service you can use this knowledge to tailor your marketing messages to each segment average order value analyze average order value by segment to identify opportunities to increase Revenue per customer let's say your analysis reveals that corporate customers have a higher average order value than consumers you could develop marketing campaigns that encourage consumers to purchase bundles or higher pric products to increase their average order value customer acquisition cost CAC how much does it cost to acquire a customer in each segment knowing CAC can help determine the return on investment Roi for marketing efforts here's an example let's say it cost $100 to acquire a new corporate customer but only $20 to acquire a new consumer customer if the CLTV customer lifetime value of a corporate customer is significantly higher than the CLTV of a consumer customer then spending $100 to acquire a corporate customer may still be profitable however if the CLTV of the corporate customer is only slightly higher than the CLTV of the consumer customer you may want to focus your marketing efforts on acquiring more consumers because the cost of acquisition is much lower customer satisfaction how satisfied our customers in each segment understanding satisfaction levels can help identify areas for improvement and reduce churn here's an example you can conduct surveys or collect customer feedback to understand satisfaction levels if you find that corporate customers are less satisfied than consumer customers you may want to investigate the reasons for their dissatisfaction and make changes to improve their experience this could involve improving your customer service of offering more competitive pricing for corporate customers or developing products or services that better meet the needs of corporate customers we can also create a pie chart for our sales which we can do by pled Pi Pi sales perir segment total sales and we name it labels is equal to sales perir segment type of customer type of customer 51% of our sales come from our consumers 30% from our corporates and 19% from home offices all right so let's now move on to the customer loyalty as a business you want to make sure that your most loyal customers stay happy this will make sure that those customers keep on coming back keep on bringing new people and also placing new orders so you will decrease the cost on acquisition of new customers because there will be already existing customers and you'll also be able to make sure that your Revenue either stays at the same level or increases by keeping your most loyal customers happy and you want to do that as a business now we can do this by either the following ways we can Brank the most loyal customers by the amount of orders they have placed or the total uh they have spent you have analyzed your data pinpointing your 30 most lawyer customers this represents a significant opportunity to strengthen these relationships and maximize their lifetime value here's a powerflow approach design a targeted email specifically for those high value segments proactively offer personalized support with inquiries such as how can we assist you today this demonstrates your commitment to their success for actively addressing potential issues and fostering a deep sense of loyalty loyalty programs consider a tiered loyalty program that offers exclusive rewards tailor to your most valuable customers this include earlier access to new products personized discounts or even Point based reward systems personalized experiences leverage your data in insights to go beyond email consider personalized website recommendations targeted promotions based on past purchase history or even handwritten thank you notes for high value customers customer feedback loops make sure your top customers feel hurt Implement surveys or invite them to participate in exclusive focus groups this demonstrates you value their input and are actively using feedback to improve the customer experience Community Building depending on your business model fostering a community among your most loyal customers can create a sense of belonging this could involve access to online forums exclusive events or opportunities to network with like-minded individuals now this strategy extends Beyond customer satisfaction prioritizing the experience of your top customers directly correlates with increase retention positive referrals and ultimately improved Revenue now um let's dive deeper and see who are our most loyer customers all right so let's now get started with that let's create the variable with the name all so let's first display the first three rows of our day frame so as you can see there is a row called sales or the column call called sales and and each customer has a specific ID with a specific name so if you can count of the number of times this shows up then you also have the number of total orders which then you can which you can use later however you want to so let's start with doing that [Music] [Music] [Music] [Music] now let's rename the columns we want the column order ID which is where is the order ID it's right here when order ID to be named uh total [Music] orders now we want to rename the columns that are equal to order ID this column this must be renamed to Total orders place is equal to True okay so now let's identify the repeat customers customers with order frequency greater than one so repeat customers are equal to customers order frequency custom order frequency total orders and like I said want to make sure it's equal it's greater than one it's equal or great than one perfect now we can we want to organize this in a way that is just the sting we can do that by saying repeat customers sort it or repeat customers that sort values [Music] [Music] [Music] perfect now let's print this out print repeat customers sorted do head 12 want to display our top 12 customers reset [Music] index so the customer with name William Brown who is a consumer has placed to Total 35 orders so this is the list of your top how many customers and as a business you or as a Superstore you can identify exactly the number number of the total orders a person or a business has to place in order to be considered a lower customer and then according to that you can tailor your services to it now the data clearly reveals that a small group of customers Place orders with considerably higher frequency 30 plus we have William Brown with 35 orders and other home office customers with 34 and many consumers and one corporate with 32 so this shows clearly that we have a loyal group of customers there's also significant potential for our home office segment several of our most loyal customers belong to the home office segment now this implies that the home office segment has strong potential for customer loyalty and deserves targeted marketing efforts it also shows that we just don't have like one dominant group of loyal customers we have home offices consumers and corpor while there are many consumers it doesn't mean that we have to focus on one segment it means that we still have to devise a plan that caters to our multiple segments so some recommendations now we can prioritize loyal customers now segment customers by other frequency and uh we can develop exclusive offers Rewards or Early Access programs ta to our most umer customers so for example we can provide them exclusive discounts tier reward programs and earlier access and we can also Target uh more home offices because we see that home offices um keep on coming back and we are able to satisfy few of the home offices that mean that means we can we have catered to their needs and provided a good enough service for them to keep on coming back that means our product is great for home offices that means we can Target more home offices using content marketing social media ads or other type of marketing strategies and we can also analyze our the behavior or the way we provide the service to our um to these customers and because it worked out pretty well and if we provide this kind of service to our newcoming customers then we increase the chance that they also become a l customer so those are like several conclusions we can make now we can also identify loyal customers by sales so this has uh identifi them by total number of orders they place but we can also use amount of sales so the total amount to identify them because a person can come and place 35 orders but if they Place 35 $1 order then obviously that's just 35 bucks now this doesn't say anything about the sales amount so I um ideally you want to organize it by the sales amount to be able to um identify the actual top spending and loyal customers or that said you when there's significant customer so let's say someone has spent like 25,000 that can be done also in one order so that doesn't mean that it's a repeated customer it's it's just a top Spender now um let's start with identifying our top spending customers so let's first create a variable customer sales go to data frame that's grw by customer ID want their customer ID want to also see the name and also what type of customer they are this segment and want to do it by sales and we want to sum that sum those up and we don't want to resend the index and now let's identify our top Spenders by having them ranked descendingly meaning our top spans will be ranked all the way up customer customer sales the sword values by equal [Music] [Music] [Music] so Sean Miller has spent the most who is from home office using total amount of 25,000 USD William Brown has placed the most number of orders which are 35 but William Brown is nowhere to be found here the same as Sean Miller he has spent he's the he's a customer who spent the most in our Superstore but is's also nowhere to be found here meaning that the repeated customers doesn't really Define their spending habits so if you it depending on the way your you run your super store now obviously I would want to I would want our customer to come back but I would dedicate my resources to the customers who spend the most because those are the customers who BR bring the most business to my to me meaning those are the customers I have to keep uh happy so the number the total number of orders is great but it doesn't really speak that much about their spending habits and about their value to your store all right let's now go over to the next chapter which is shipping now as a superst you also want to know what shipping methods customers prefer and which are the most cost effective and reliable and overall knowing this impacts your customer satisfaction and also meaning it also has a great impact on your Revenue so that so for example Amazon has multiple shipping methods but it has the most popular shipping method which keeps the most amount of customers happy and it also makes uh Amazon the most amount of money so as a superstar you want to know which one of your shipping methods is the most reliable so we create the variable the type of customers so our shipping model let's create a variable we use the data of the data frame ship mode we want to count those values and of course we want to reset the index [Music] [Music] [Music] so our standard class is the most popular by it's almost like four times more popular a first class is the leased and the same day my first class is one the least so let's create a pie chart of this plot out pie shipping model all right so this is our class standard class or these are like the shipping methods the most popular one is standard class which is 60% of the orders use St class shipping and rest is like 40% so as a super store or as any store you invest in your shipping so you end up buying some kind of deals with uh delivery um companies like the sh you PS and others and sometimes you end up recommending the wrong option to your customer so let's say second class is fast but it ends up costing the con customer way too much the customer ends up not buying your product and this decreases the as you can see this F your store but for if you know that standard class is the most popular option then you can have like a button saying this is our most popular option which is time class and most of the time people choose the most popular option so this will help you s this will help the superstore save the cost of investment into these others or dedicate the amount of resources that each class brings and it also allows the superstore to recommend its most popular option which is stand a class so the problem that many Superstores have because many stores have um sores in many locations in many um States but they don't know how much how well each is performing on a dashboard for example you could have that but they have no idea how well each of the stores in each state are performing and leaving them with clueless where where there is an underperformance or for example where they can where there is a high potential area in which they can open a new store in so let's move on to this chapter which is geographical analysis so many stores have hard time in identifying high potential areas or also identifying stores that are underperforming so things like Walmart Target they have like many branches and they they will want to know how well each branch is doing and the perfect way to do this is by counting up the number of sales for each City the number of sales for each state and then this will allow you to see which of the states or which of the cities is performing the best and which of them is performing the least and dedicate your resources accordingly so let's say if one city the story is simply just losing money for years or for more then you will want to adjust your strategy according to that so maybe you will want to close the store or adjust it in a way so it starts bringing in more profit or Revenue well so let's get started with that [Music] [Music] [Music] all right so as you can see the most popular state is California and the least popular is New Jersey so maybe you can go over this and let's say in few of the States where there's still high potential for a profitable sore you can identify that's say Washington and calculate all right so maybe from from this you can see that Washington is performing forth or New Jersey is performing like the least of our top 20 from this You can conclude that you might have to work on New Jersey more to increase the order count this also allows you to increase the revenue or you can see that the California is your most popular option so you might want to keep California happy and you can also do it per City so City DF City value counts reset next print City that had the most top 50 top 15 so the most popular city is New York with the order count of 891 and uh then Los Angeles and Jackson is the least popular out of our top 15 and you can also increase this to the 25 so not only can you can focus on the states but for each state you can also focus on the the city that's underperforming or overperform forming so this allows you to also dedicate your resources to the to the city that you want maybe to increase your revenue or increase your potential or maybe there is like a city for example L Beach where there's high potential but you're not using any of your resources now we can also uh organize it sales per state let's say state sales so previously we did by order account and we can also do it for state sales we want to sum it up and then reset the index you want to rank it a call fall perfect so as you can see our still our most popular state is California and then New York and then it changes it and then doesn't change yet taxes so this is according to the sales amount the popularity of the state according into the sales amount and let's also sort it for her City [Music] [Music] [Music] [Music] [Music] [Music] so most popular city is New York La Seattle San Francisco this is exactly the same as our previous anoun houses around the city nothing really has changed all right so as a store you want to be able to track down your most popular category of products or your best selling products or sales performance across categories and subcategories and find the sweet spots where strong categories also have top selling subcategories and also spot weaker subcategories within otherwise strong categories that might need Improvement or product popularity fluctuations see save popularity seasonal trading up and down and helps and this helps to forecast the future demand or um you can group it by location for each location there might be a different popular product you you want to put it in a certain place to maximize your store's Revenue so let's get started with finding our top performing products or their categ ories so let's first extract our products the categories of our products from our D Frame the unique print products so right now in our data frame we have only three sorts of products as you can see the category and each one has a subcategory will cases chairs but we have mainly three uh categories which are fiture office supplies and technology so let's go now over to the types of subcategory a product subcategory the uh a print product subcategory Cas is CH they fa and bunch of it now let's group the data by a product category and how many subcategory it has so we want to say for example office supplies may have like 20 subcategories or Furniture may have uh five subcategories so let's see how many subcategories each one has [Music] [Music] [Music] [Music] [Music] [Music] [Music] so so there are nine office nine for his office supplies four for furniture four for technology so office supplies is much more sophistic category now we can also see our top performing subcategory so let's say subcategory then you want to count the sales go bu category [Music] [Music] [Music] [Music] [Music] [Music] [Music] so our most popular subcategory is tag specifically phones it has the most amount of sales Furniture chairs office supply storage so from this you can see our most popular subcategories and what subcategories you want to recommend them on a front page or um in the store now let's see which one of our main categories performs or has the most amount of sales product category goby so as expected uh Tech is the most popular one and then furniture and then office supplies so maybe you will have inside the inside your store you will have a much larger department or not much larger maybe a little bit larger or maybe it's in the first row right in front of the customers to be able to or present your most popular option immediately to your customers now this will allow you to of course increase your revenue and sales if you want to create a pie chart for this you can say prodct pie top product category I can organize by sales labels top product category category well it seems that pack is per a little bit bad than most or like these two but it's not that much different that much it's not really that different all right so let's now see which one of our subcategories is the most popular one now remember that we saw which one of our subcategories had the most amount of cells now let's create a barograph out of it we can do this buy sending false and S by SES sending is true let's create the bar graph out bar sub category count sales is category top product subcategory the sales on this SP it and some find now this shows perfectly that um our most popular option is phone and chairs and so since this are generate the most amount of that means that customers are more willing to pay money for this so you can end up spending more of your marketing resources on phones and chairs because it will it's there is already shown that because of the model resources you have provided for phones and chairs it already Works meaning if you increase amount of resources you spend for phones and chairs then your sales will also increase accordingly but you can also uh conclude that art en envelopes and labels aren't that popular so maybe right now you can give a discount and get rid of those and buy less of those for the future so you can buy end up buying more of the popular options for example phones chairs or you can also investigate why they are not popular maybe those are like the most the worst envelopes you could have bought or maybe it's not the right it's not the right art you have bought maybe those kind of art people don't like but if you were to choose complete other form of Arts maybe they will customers will end up buying them so this shows exactly how now this data stores can use to optimize their sales or optimize how their resources are allocated so you would end up making more um more money or more sales now businesses love making sales they love seeing Revenue increase and profits increase that's all lovely but there you should be able to track down yourselves so that you can see in what kind of situation you are and you can adjust to that situation and what's what is a better way than having a pie chart or a bar graph or just a normal graph to see how much growth or how much decline you're experiencing for example if you are a business and there's a the clining Revenue then o year over year or month of month over month then you can see that there is a problem and then you can also allocate your resources towards fixing that problem whether that is investing more in customer preferences or investing more money into marketing or into resources that make your customers more satisfied or adapting new technologies those are all um things that you can do whenever you see a declining Revenue But first you must be able to see it coming and also businesses um have a problem with st growth so they may grow one month and then the next month there's like no growth or maybe there's a decline so you want to see that as a business to be able to St stabilize the growth so use uh continuously grow as a business or missed seasonal opportunities if a business isn't aware of how those changes throughout the the year they could miss out on maximizing profits during Pig Seasons maybe on some Seasons there's a certain product that's uh what that's high in demand but you are you don't have enough stuff enough stock to cover it so you end up not being able to meet the demand and losing out on um revenue and profits and there's so those were regarding our yearly sales friends and there's also a problem with qual today monthly sales so for example cash flow issues now many businesses experience cash flow issues so that may maybe one day they look at their bank and see that they are out of money and they cannot invest more in their business or there's also inventory imbalance or ineffective marketing so for example whenever you have a cash flow issue drastic dips in sales during specific quarters or months can lead to cash crunches making it hard to pay suppliers employees or ongoing expenses or when whenever you have in inventory imbalance some uh periods you're overstocked and and those items you have to give away and some periods you under stock and you are not able to meet the demand or maybe your marketing is ineffective so if you spend significant amount of time in marketing and you don't reach your desired outcome that means there's a major issue with your marketing campaign and then you can see that from the sales you're making so for example if you're spending this amount of money with marketing or increased amount marketing and there's no significant increase of sales that means you're doing something wrong with your marketing campaign or lagging response to emerging Trends so monthly sales data can highlight new trends or drops in demand more quickly than just here overviews so you can react much faster to uh emerging trends for example a certain product was released uh 20124 and all of a sudden it is high in demand in uh many countries then you want to be able to adjust to that demand and get get the supplies for the product but you are not able to do that if you track yellowy sales or don't do any tracking at all so those are all the all the problems that exist if you are not able to track down your sales be it monthly quarterly or yearly and so we are intending to solve that problem by having my graphing it and concluding from the results we get from our graphs all right so let's get started now let's convert the order dates column to data frame format and the order dates is equal to pd. to daytime DF or the dates first is equal to true now let's group the data by years and calculate the total sales amount for each year we can do it by yearly sales scre the variable first and then Group by thef or the date the year sales that's some early sales is equal to early sales let reset the index now I want to give the appropriate calls the appropriate name cuz right now in the data frame it's not the order date should be named year and the uh sales should be named total sales now let's print this out so this is the amount of sales for each year in total and we can also let's plug the bar graph out of this plug a bar into the sales year in the sales total sales all right so from this uh barg graph there can be few uh conclusions made for example there's a steady growth from 2016 to 2018 that might explain for example new product launches they are effective or economic factors or marketing efforts those are all the explanations that a person can make but you can make this conclusions only when you have a larger data available to you and in this dat frame we don't have the marketing cost or any other cost involved regarding this so that's our conclusions are PE limited but what we can see is that um this bar graph combined with um any other bar graph for example marketing cost you can make a pretty good amount of conclusions from that as a business now how about um total we can also PL this using just a normal graph which means I will just copy it or no I won't this plot it you sales here CL the sales so this shows a little bit different now I prefer this sort of graph for instead of a bar graph for tracking the yearly sales because it shows much more clear at the amount of increase with the mod decrease now we can also uh focus on the quar sales like I said to be able to uh react to emerging Trends or to emerging or to um react fast fast to or to be able to react fast to any kind of change now let's again convert the order date to date format or their dates [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] from our Cy sales we can see that there's a steady increase of quar quarter Cs and all of a sudden July it blows up to new heights so from this graph we can see exactly that uh Q3 and Q four did very well and q1 and Q2 didn't so something might have changed for example seasonal Trend or you have increased your marketing or you have introduced a new product or you have targeted a specific customer segment and as a business is really important to know that so you can also expect that if you follow a similar line of actions then you might have higher demand for your products on cute Q3 and Q4 so you might want to Overstock it or maybe you can analyze this uh further and replicate the successful strategies for future quarters and you this will cause for this will make sure that a business can stly grow and increase its Revenue which is good and also for um qon you can see that there's um it starts out with pretty slow I mean businesses may want to start out the year more quickly so they can investigate it also for example is this a seasonal uh for the industry maybe on certain Seasons a certain product is not high in demand or maybe the a competitor did some kind of marketing or um some kind of or use some kind of a strategy to drive more customers to them or maybe we have changed um marketing effort in our Q3 and our marketing efforts were not productive for q1 and Q2 so maybe that's it all right so maybe you want to investigate this uh much more deeply so not quarterly but monthly so let's do it now um let's start off the same way cor the order date column to the data time format DF order dates periods to daytime [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] [Music] all right so from this graph you can see that it's growing month over month beside first month of uh 2019 and also the third month the the third month of 2018 so this is generally an upward Trend which should which might suggest like a healthy sales going on and it looks like August and December might be your seasonal so you might want to Overstock the products there and okay you can also see that there are seasonal dips in third month of 2018 and also 2019 and November of 2018 so you might want to consider it uh like seasonal promotions to stimulate offseason sales or diverse diversify your product service offerings to reduce the Reliance on seasonal demand maybe you want to start deploying new marketing strategies here or try to Target new customer segments by introducing new products so that you might offset the seasonal Trends and overall it seems pretty consistent so it looks like that there's a healthy sales friend so you might want to invest more in your proven strategies which might be for example certain marketing tactic promotion or product offering for a certain month and like I said for the dips the store might um try to deploy new marketing strategies or introduce new products to Target new uh customer segments which might which might offset uh the dip and that said um and that said it's uh more it's important to consider the longer time frame a year is certainly a year certainly has great amount of data but it does not really accurately um reveal seasonal patterns because for one year that might be the case but for another year it might just be completely uh the opposite so you might want to consider a larger um sales line graph for example maybe for 5 years and then you can see if that's the case and if that's the case then this might suggest a seasonal Trend and if not then you will act accordingly all right so we have covered um the sales Trends we can move on to the next chapter which is all right let's move on to the next chapter which is mapping so we want to create a map out of uh sales per state so for each state we want to color it um according to the amount of sales so if you are if there is a high amount of sales then it should be colored yellow and if there is low amount of sales then it should be colored uh blue so the question is why would someone want to do this now companies looking to expand into new Geographic areas face the challenge of identifying the most promising States and regions for their products or Services now for example how do you know if um your product will sell in a certain state so one of the one of the tactics that people often use is by seeing if there is a similar store like them operating in that state or in that City and so if there is a similar store um working there and it's not a saturated Market meaning there are still plenty of amount of people who might buy your product then it's a good idea to go there for example a company that manufactures Athletic Apparel is considering expanding its retail footprint by analyzing total sales data by US state they can see that states with a high concentration of fitness centers and active population like for example California Texas or Florida might be good candidates for new stores and if there are currently no uh um sports stores there then it's even better or for example if um you're a business and you want to strategically allocate your Market budget and sales team and so you have uh stores all over the States you you want to optimize it for each state maybe one state is performing well and another state is not performing so you might then from the map you might see which state is not performing well and allocate your resources accordingly to be able to um maximize your return on investment by optimizing certain strategies but if you don't know which state is not uh performing well then if you don't know which ST which state is performing well then you have no information on where you have to optimize like for example a n a national Pizza chain wants to optimize his marketing spend now sales data reveals that their P pit areas in the midwest consistently outperform those on the west coast now this data suggest that they might need to allocate more marketing budget to increase the brand awareness and sales in the Westland States and you might also want to do this because of competitor analysis so staying ahead of the competition means understanding where your competitors are having the most success analyzing their sales patterns across states can reveal their Geographic strengths and weaknesses now for example a Coffee Roasting Company notices a competitive coffee brand is experiencing High sales in the Pacific Northwest State now this could indicate the competitor has established a strong partnership with local grow grocery stores or launch successful marketing campaigns in that region now the company can use this information to Target similar grocery stores or develop competitive marketing strategies for the Pacific Northwest so without further do let's get started with it now I will walk you through the code and instead of writing it instead of writing it we will I will just walk you through it so let's first import the clely graph function or the library so we initialize the ploto jupit notebook and we also want to create a map for all 50 states which we do this right here and add the ab abbreviation column to the data frame because right now there is not we add we initialize the variable and calculate the amount of sales for each state and grou it by State and this is exactly what we need and then we add the abbreviation to some of sales we do that here and finally we plot it and this is how the map looks like so the Blue Area are the ones with low amount of sales and the yellow area which California has high amount of sales so from this map you can see which areas um main sales come from and according to it you can op you can optimize accordingly so let's say you have a pizza chain or pizza chain of stores and you want to see which one of your States is performing the best and which one is performing the poorest so you want to spend your most energy on optimizing what doesn't work so from this you can see clear forign is great so you can leave it alone but for example Texas is not performing that well so you might allocate more marketing budget or more uh resources there to have a to start making um or to start getting get more sales because in Texas there are still many amount of people who consume pieras but they are not buying so why is that and you can also see so for example if this is a completely another store this is like relates to for example a retail store for Sport Goods and all of this States they've got um a store in from this You can conclude that uh California is performing real well so it it's probably not a good idea to go there since it might have since there might be like Market situation there but you can go for example to for example Florida you might go to Florida and start selling um similar uh Sport Goods there because like you see the market is still uh new or the market is still not saturated all right so that was that so we can also create a bar graph out of it now from this you can see that most of the the total sales by state California is doing the best and North theota is doing the worst and of course now as you remember we previously categorized or showed how large each of our categories are and we did the same for sub our subcategories but we never did it in the same uh plot so so here we display our main category of products which are Furniture office supplies and technology and for each uh category we have subcategories and based on their size it's uh and it's organized based on their size so chairs it seems that chairs is the largest or sells the most in our furniture category then it goes to tables and we have then small amount of book cases and finishings and in our office supplies you can see that the best uh storage um product is performing the best and envelopes and labels are performing the worst and for our Tech category phones are performing the best and the machines accessories and copies and from this you can see that phones is overall the best subcategory it's even I think it's even larger than um chairs yeah little larger so this is a much better way to display um if you're trying to make an argument um to display the plot and of course you can also do it this way all right and hope you guys enjoy this project I definitely did and I will see you guys in the next video this video was sponsored by lunarch at lunarch we are all about making you ready for your dream job in Tech making data signs and AI accessible to everyone with is data science artificial intelligence or engineering at lunar Tech Academy we have courses and boot camps to help you become a job ready professional we are here to help also businesses and schools and universities with a top notot training modernization with data science and AI corporate training including the latest topics like generative AI with lunar Tech learning is easy fun and super practical we care about providing an endtoend learning experience that is both practical and grounded in fundamental knowledge our community is all about supporting each other making sure you get where you want to go ready to start your Tech Journey lunner Tech is where you begin for students or aspiring data science and AI professionals visit Lun Tech Academy section to explore our courses and boot camps and just in general our programs businesses in Need for employee training upscaling or data science and AI solution should head to the technology section on the lunch. page Enterprises looking for corporate training curriculum modernization and customize AI tools to enhance education please visit the lunarch Enterprises section at lunch. for a free consultation and customize estimate join lunch and start building your future one data point at a time