data nerds welcome to this full course tutorial on intermediate sql for data analytics this is the course for those that understand the basics of sql but want to take it to the next level perfect for those that took my first course on this now to master this tool we'll break down more advanced sql concepts in short 10-minute lessons during this you're going to be working right alongside me completing realworld exercises following each lesson you'll have the option to do interview level practice problems to not only prep you for the job but also reinforce your learnings and by the end of the course we'll have used sql to build a fully customizable portfolio project that you can share to demonstrate your experience now sql is by far the most popular tool of data analysts for those in the united states it's the top skill that's in almost half of all job postings for data analysts and this only increases in demand for senior data analyst roles coming in at two out of every three job postings now in related data jobs like data engineers it's almost the same appearing in two of three job postings and for data scientists it's in almost half now sql or sql is the language used to communicate between you and a database it's my mostused tool as data analyst starting with my first job working for a global fortune 500 company and even to my most recent role working with mr beast yeah even jimmy uses it this tool is so imperative i use it all the time with python excel powerbi and tableau to connect to my databases so over the years i've been cataloging everything i found helpful with using this tool and i put it all into this course now you're probably wondering who is this course for well if you're unfortunate to take my first course here's some items you should definitely know keywords used for data retrieval functions and keywords used for aggregations and grouping the different types of joins and also unions keywords used for logic and conditions along with date manipulation data schema control and finally subqueries and cte now as far as the math required to take the course if you have a secondary education such as high school in the united states you have the requisite knowledge to take this we're going to be doing at most just some basic algebra and statistics now let's get into the course structure we're going to be breaking this down into two halves in the first half we'll have an intro that will get you set up and comfortable with the database we'll be using throughout the entire course next we'll jump right in pivoting data using case statements we'll be transforming and analyzing data using aggregation and also statistical methods then we're going to get into intermediate date and time functions because frankly you can't get away from date and time data in databases we'll then wrap up the first half covering window functions the most requested topic i've gotten by far on this covering basic and also complex aggregations now for the second half of the course we're going to shift gears we're going to not only install postgress on your machine but also we're now going to be working in these lessons to building our portfolio project we'll start by setting up the database locally and installing a top editor for running sql queries with this environment set up we'll build our first view and this will actually help us solve our first portfolio problem after this we'll transition into learning the most popular functions to transform messy data to solve our second portfolio project problem and then we'll wrap all this up with query optimization understanding how to use keywords like explain to optimize queries so by the end of this you'll have a real world project to showcase your newfound skills and demonstrate your experience now i'm a firm believer in open sourcing education and making it accessible to everyone so this course is completely free i've linked all the resources you need below including the sql environment and all the different files you need to run the queries remotely and locally oh and also include my final project that you can also model after now unfortunately youtube isn't paying the bills like it used to so i have an option for those that want to contribute and thus help support fund more future tutorials like this for those that use the link below to contribute you're going to get some more additional resources specifically after each lesson you're going to have access to interview level sql problems it will not only reinforce your learnings but also prep you for job interviews in here you're going to get community access to be able to ask any questions to fellow students along with access to the queries and notes behind each lesson so you can follow right along as i go through it finally at the end you'll receive a certificate of completion that you can share to linkedin now for those that have bought those supporter resources you're going to continue watching the course here on youtube but then you can go to my site to actually work through all these problems access to the notes and access to community all right we're about to get into the first lesson before we do that i want to cover some common questions and answers specifically we're going to start with this one first what database are we even using well every year stack overflow interviews a bunch of nerds to find out what are their top technologies that they're using and 50,000 chose that postgress was their top option to use and to use over the coming year and according to this visual it's not only the most admired it's the most desired to learn database so for this course we're going to be using and learning with postgress now that we know the database how the heck are we going to be running these sql commands well as i mentioned previously this course is broken into two halves and we're going to be using an option for the first half that gets you up and running quick specifically we're going to be using google collab which is a free option and it allows us to have an environment that we can not only load the database in but also query it i've linked this notebook below and includes all the code necessary to install this database and get into querying it now don't worry if you haven't used collab before i'm going to break it down all in the next lesson which for those that bought the course perks you're going to get access to these lessons which are in a jupyter notebook for the second half of the course we're going to shift gears and we're going to install postgress locally on your computer and run all the queries from there we're going to get you set up with pg admin which is postgress's custom guey in order to interact with databases but from there we're going to get you set up with the most popular database editor dbaver which is used by over 8 million users and this is where we're going to be running our queries and i like this editor because it's not only free but it also connects to a host of different databases so whatever you use and learn in this course with this editor you can apply to other databases now that we know the database and the editor what data set are we going to be using for this well i present to you cantazo and this is a data set created by microsoft used to imitate real business data jumping back into dbaver we can see the erd or entity relationship diagram and this shows how the data set revolves around sales data that's the fact table and then we also have four dimensional tables that relate to it this is going to be great for analyzing business transactions in a real world scenario we're going to go over everything you need to know for this after our google collab lesson now that we got that out of the way let's get into some resources you have available starting with those that have decided to support the course first i'm going to walk you through how to get access to the course notes which detail all the different topics and code that i use within each lesson and next you're going to have access inside of the course platform to interview level sql practice problems after each lesson i'm going to provide you with a bunch of different practice problems that range in difficulty for you to go through and test your skills if you get stuck feel free to jump in the comment section below and talk with other students in the course speaking of help how the heck do you get help in this course well you could jump into the youtube comment section and hope somebody comes and actually answers your question or you can get a really quick answer going to a reputable chatbot like chatgbt i use this bad boy all the time with my coding issues and it gets you an answer quick all right next question well isn't really a question it's more of a statement people tell me all the time luke this video is too long i can't navigate it well unfortunately i think you don't know how first of all i include chapter markers for all the different lessons throughout this the next is keyboard shortcuts i like to use j and k in order to jump forward or backwards 10 seconds and then finally if you need more precise navigation you can just click and drag up on the navigation bar of the video itself and then you can do precise seeking pretty cool all right last question who helped build this course and i'd be remiss if i didn't give a shout out to kelly adams she was the brains behind putting together the lesson and also a lot of the practice problems for this this course wouldn't have been possible without the help of her all right let's get into the first lesson all right in this lesson we're going to be going over how we're going to be running sql queries in this first half of the course using google collab which is a type of jupyter notebook so link below is a blank notebook and opening up it's not fully blank but it's blank enough to actually get started with writing sql queries let's do a quick demo of how we're going to use this for sql queries first i need to run this cell up top and it's going to give me this warning that hey this notebook was not authored by google it's fine it's run anyway it's from me you can trust me it should take about 40 to 50 seconds to run this cell which we'll go through more later in this video basically it's loading the database and getting it set up for us to actually use and now run sql commands so inside this code cell let's provide a command so we're going to begin by writing our command underneath this percent sql syntax right here at the top and i'll provide this query looking into the sales table looking at those top 10 options i can run it by pressing this play button or pressing shift enter in less than a second i have all the different results pictured below if i want to run another cell i can just come underneath it click code make sure that i add that percent sequel to the top of the cell it's not going to work otherwise and then run my next command that i want to right underneath this all right with that out of the way we're going to now dive deeper into understanding what is google collab what are jupyter notebooks how to actually use these to run sql queries and what the heck is going on with all that code that i had in those cells now if you have familiarity with using google collab already or already confident in using jupyter notebooks and you feel like any of this that i'm going to cover is not relevant to you it's fine go ahead skip to the next lesson this is more focused on those that don't have any background with using jupyter notebooks so let's start with jupyter notebooks here i have a jupyter notebook of this actual lesson inside of vs code don't worry you don't need to actually open inside vs code just showing this for demonstration purposes now personally i love jupyter notebooks for performing analysis because not only can i have these text cells like are pictured right here and then scrolling down even further i can see that at the bottom i have a sql cell along with the sql output so i love these because i can use sql to extract out and analyze the data i need and then if needed use something like python to visualize it now moving into google collab which you can see right here i'm inside my web browser and this is that same exact file that i had inside of vs code but now it's here inside of google collab and similarly it has that same functionality where i can write python code in cells along with using that sql and the outputs of that below it i really like google collab because it makes it super easy to share and collaborate with others this isn't just a static document i can come in here and actually run all the different cells inside of this notebook and if somebody wanted to they could come in and modify this query further so right now this one's only looking at years let's say we wanted to look at the actual total revenue i could just add this line in the command run it and get the results right below so super easy to collaborate with others and now you may be wondering why are we actually using collab for running these sql commands well basically this code right here that i have in this cell that we're going to cover i promise allows us to load in our database and for you to have access to the database immediately without having to actually install it locally on your own computer so basically we can get up and running with running all these different sql commands really quickly let's start with a blank notebook to walk through this process of understanding how to use notebooks if you navigate to collab.resarch.google.com this is where we're going to start a new notebook and it will have prompted you to log into google at this point anyway go and click this so it starts this new one which gives the title untitled zero you can go up up here and actually change it and i'll change it something like collab 101 quick overview before diving into the center portion right here we have a typical menu up at the top to do a bunch of different options and then we also have this sidebar over to the left hand side that gives us a lot of different options as well in the center here is where the actual notebook is itself and i can do things like either add a code cell or a text cell if i wanted to i could type into it this is a text cell i can also change the formatting of it by highlighting it and toggling it to being a heading they also have multiple other options available as well whenever i'm done with this all i have to do is press shift enter and whenever i press shift enter it then starts another cell a coding cell below that now in collab these are exclusively python cells we have to do some magic if you will in order to get it to run sql but you may not realize it but you actually know some python even if you don't know it i could do something like 2 + 2 press shift enter and what's going to happen here is it's going to run the cell of 2 plus 2 and then we get the results of four now if i don't need certain cells like this one up at the top i just click into it and click the trash can similarly i can do it to this one down below now let's go over these menus and for this i'm going to be demoing it using the actual lesson plan notebook from this cuz it makes it more interactive to show actually the capabilities of it anyway over on the lefth hand side if i click this over on the left i have the table of contents based on how i formatted all the different lesson notes i can actually scroll through and see all the relevant topics if i wanted to find something i can just go in here type fine of markdown and as expected take me to all the different markdown things inside of here we also have other things like variables secrets and also files that's more in depth if you're using python for this you won't really need to use that along with these three at the bottom also not going to be using as much in this sql course now up at the top in the file menu right here file edit view insert everything like that's normal runtime is the one location that i find i'm actually using the most and find the most important anytime i'm opening a notebook i'm going to be doing this run all and i can also see that i can do this with the shortcut of command f9 and this will go through and actually run all the cells down here at the bottom it gives you a status update of what's going on along with the time it's taking so far now scrolling through all these different cells i can see they all executed properly but sometimes we run into bugs and they're not running properly in that case we can come up here into runtime and i recommend running this of just restart session and run all it will prompt you if you need if you really want to do this and yes go ahead and do it basically clear everything out and run it again that's only if you're having problems you shouldn't but if you did now you know in this last section of the lesson let's understand what is going on with how we're running sql queries inside of this notebook for this i want you to actually open up that blank sql notebook and load it into your window and you can follow along with me if you haven't done it already go ahead and up at runtime click run all so i mentioned earlier all of this code here which is in python goes through and actually installs and sets up your database we're going to walk through it really quickly but the important thing to understand here is not necessarily the code or that you need to code it yourself it's mainly understand what's going on behind the scenes first it goes in and imports some important libraries that we need for this next if it's in collab which we're in we go in and install postgress so postgress is actually running inside of this environment that we're inside of it goes through and sets up a user a password and then from there actually installs the database itself which you can get at this link right here from there we import in a sql library in order to be able to run sql commands specifically it's called gps sql and then with this jubsql we go ahead and load the extension actually connect to this database that we loaded in from up above and do some other fancy things that help us get formatting and everything else set up properly so similar before below this magic command of percent sql i can write a sql query as i'm writing these sql commands you should have autocomplete come up so in this case i have select if i want to use it all i have to do is press tab and then once i have everything i need there once again i can press shift enter now that magic command is really important if i were to copy this paste it below one i get all of this highlighting saying that it's misspelled and that they have syntax errors and then two when i actually try to run it i get actual syntax errors so very important that you put these magic commands up at the top now so people don't think i'm crazy magic commands are the actual official language of this and we're not only limited to sql magic commands they also have a host of other ones let's say i want to use this one of time it where it measures the execution time of the next line of code i could type the magic command of percent time it some help pops up of what actually is going on with this module that we're actually using here which is pretty actually useful and then underneath it i can put some python in here i'll just do something simple like 2 plus 2 running this pressing shift enter we can see that this special command provides the time of this took 9.93 nanconds now with these magic commands you can also use just one percent sign and that means it applies only to the line that it's currently on so in this case i could do 2 + 2 on this line press shift enter it's still going to run it which in this case looks like it's a little bit faster but if i actually had only one percent sign and let's say this is on another line pressing shift enter it's just going to output the four and it's not going to actually time it it's not until i actually use two of the percent times and run it that it will actually time it and now we're back up to 9.85 nconds so i'm just reinforcing this because it's very important that you remember to do that percent sql before any sql command now if you're nerd like me and you want to dive deeper into the documentation of jubsql because of the brains behind that sql magic man you can a link below all right for those that purchased the supported resources you now have some practice problems to go through and get more familiar with how to use jupyter notebooks and sql queries together in the next lesson we're going to be diving deeper into the database to understand all the different tables and what comes along with it with that i'll see you in the next one in this lesson we're going to be getting an intro into the database that we're going to be using for the entirety of this course specifically the contazo database for this we're not only going to explore why we're using this data set but also the components about it exploring all the different tables using things like the erd or entity relationship diagram now we're going to use this lesson as a warm-up to get ready to get into using intermediate sql so during the course of this i will be covering different past topics that you should know in order to get you up to speed as fast as possible if you haven't used sql in a while by the end of this we're going to be covering a query from scratch in order to dive in to the most popular tables while using google's collab and some additional ai features to speed up your workflow now the contazo database that we're going to be using for this is based off of a data set from microsoft which they've been using for years whenever they launch any products specifically sql products in order for you to explore how to use the functionality of it anyway this database is really robust because it contains a lot of different information in it such as sales transactions product information store details and even date and time data and this database is great because it allows us not only to explore all these different intermediate sql topics that we're going to be using for this but also it's based on a real world business set of data so what you're going to learn in this course you can apply to the real world and so you may be like luke how the heck do i get this database installed well if you remember from the last video we have this python code up at the top that actually goes through and installs the database the database or the sql file for loading it it's located at this link and we go through this script right here in order to load it in to this collab notebook which i've conveniently linked a blank notebook below that you'll be able to follow along any of the lessons with so this diagram shows via these lines between all these different tables how they are actually related and there's actually a lot of columns inside of these tables themselves so we put these ellipses at the bottom to basically signify or symbolize all the different columns that are in it so let's get into breaking this bad boy down we have a total of six tables in this contazo database specifically our main fact table is the sales table and this contains all of our quantitative business metrics that we're actually going to be analyzing and inspecting as we go throughout the course so it's probably the most important table you need to know then we have four related tables or commonly known as dimensional tables these things have descriptive attributes that we can use in our analysis so for things like store we relate it using the store key and the sales and then stores table and the store database has information on well the stores similarly been said about the product and also the customer table the date table is slightly different in that it relates to the different dates specifically our order and delivery date the last table in this is that currency exchange table and it's not related at all to our fact table and we'll show why in a little bit now you may be wondering how can you actually go through and see what this database looks like and understand what are the tables in it well we're going to be exploring tools later in the course specifically this is pg admin right here where i can visualize that erd and it shows how our fact table of that sales table is related to all those other different dimensional tables additionally it's pretty nice because i have this kazo 100k and i can go into something like schemas and then down to tables and further i can further explore all these other different tables as well even looking at things like the columns for the sales table but we're getting ahead of oursel we'll learn how to do that in a bit i'll teach you some shortcuts on how you can actually do something similar to this in collab so let's get into running some queries first thing we need to do is go through and actually run all the different cells in your notebook basically get that database loaded into our environment and so we're looking to explore what are the tables in this database that we just loaded into here i'm going to use gemini for this if it's your first time using this ai model from google it's going to prompt you with this privacy notice make sure you click continue and i can prompt it this of what sql query shows the tables in a database what we can do is access all these different table names by looking in information schema which is a meta database and specifically using the data attribute the looking at the tables within it which is a table now for this you can either click copy cell or you can do add code cell now remember we're going to have all the syntax highlighting issues because we're not or we don't have in that magic command we need to put at the top specifically that percent sql so i'll just copy that from here paste it up here and then run this bad boy so from this we confirm we do have six tables in our database and if i wanted to i can convert this data frame to an interactive table like this and then we also have this option to visualize it which we'll be doing down later down the road so let's explore first that sales table as that's the most important part of this whole puzzle i want to see all the different columns of this so i'm going to use select and then star we're going to do this from that sales table now if you're noticing right now i have some autocomp completion happening right now you see i typed sales and i have this underscore fact underneath it this is the ai autocomp completion especially whenever i'm learning how to use sql i don't find this very helpful and actually quite distracting so we can turn this off real quick if we go into open settings and select under ai assistance we can uncheck this option here for show ai powered inline completions whenever we close this we can see no longer pops up now this is good enough query as is but anytime i do a select star type thing it's very resource inensive especially if there's a lot of columns and rows so with this i'm going to limit this to just the first 10 rows then from there press control enter also don't need this query off to the side so i'm going to close that so with this sales table we can see that it has all those different relations to those other tables such as the dates customer key sore key and product key from there we have information on what is actually sold in this sale specifically the quantity the price the cost and then also the currency used and its exchange rate in our example at the end of this we'll go through calculating what is the net revenue and how we need to actually multiply or use all this together to calculate that let's get through exploring these tables specifically we're going to go with the easiest one first next of exchange rate if you recall our currency exchange table is in no rel way related to that sales fact table but what the heck is in it well exploring it we can actually see that in it it has a date column from currency to currency and exchange rate basically it at a specific time in history it tells you how you could convert a currency from one to another what the rate you need to use that now conveniently our sales table automatically just includes this exchange rate which was calculated from this table so technically this table is only needed if you need to go back and dive into understanding the exchange rate and how it's trending over time all right we have four tables left and they're all the dimensional tables that related to our sales table let's start with store first this one is related to that sales table on a store key and then has information on where this is located such as country name of the store even the size next up is our product information and it's related to that sales table on that product key it has information on the product specifically what is the name who's the manufacturer how much it even weighs and what categories and subcategories it falls into next up is our customer table it's related to that sales table on our customer key and this has a bunch of information related to the customer itself like where they're located what their name is what their birthday is blah blah blah anyway what you notice right here in the middle is we have these ellipses and that's because there were so many columns in this it didn't show it here now previously whenever we were looking for the tables in the database we could run this on that metadata inside the information schema so what i'm going to do is actually i'm going to take this right here command c this and i'm going to paste it right into here but for this we don't want to use tables we want to use columns running shift enter on this this only gives us table name information so i'm actually going to change this to select star run shift enter so we can see everything available in this query and inside of here i can see of this columns table we have a table name and column name so what i can do is i can now filter this for the table of customers so i'll specify where table name is equal to customers running this again pressing control enter got a typo it's customer running control enter again so now i have a way to view all the different column names and it's not cut off and so we can see everything inside of it but not really finding anything that great right here for now but other stuff we'll use in the future last table to explore is that date table this is related using that date column here to the sales table order date and delivery date now this table has a lot of different ways that you could aggregate all the different date data in here by looking at maybe day of week or month or year so this is great and all especially if you're using a tool like powerbi and you want to just grab something quickly in order to filter maybe for january 2015 data but in this course we're going to be diving deeper into using different date functions and so we're not really going to rely on this table at all to get the data out because you won't always have a date table available in order to investigate things so basically just ignore this bad boy now let's wrap this lesson up by getting into an investigation of how we can use all these tables together for a common example so let's say my boss who's not so good at sql comes to me and wants to get some different revenue data that has information about customers and also products they're ping purchasing and whether they're of different high value or low value items so we're going to walk through this example calculating the net revenue for this and how we can put this all together using all the different tables first thing we need to do is calculate net revenue so let's look back at that sales table we're going to use that same query as below and we get this table that we saw previously now for this how do we want to actually calculate that net revenue well in order to do this we need to use the net price now you'll notice from this the net price is less than the unit price that's because the net price is the price after all the different discounts promotions or any adjustments so basically it's what we actually charge to the customer when they pay for the product and with this net price we need to multiply it times that quantity so what i'm going to do is put a comma here go to a next line and say we want to multiply the quantity times the net price and we'll say this we'll label this as the net revenue now when i name variables or when i name new column names i'm going to put this underscore between it i just find it easier to read the naming convention that kazo's database is using and looks like i have a typo which is pretty good that we hit this right now because this is how i'm actually going to go through and troubleshoot this first it will tell me that there's this runtime error anytime i'm running a query you can just ignore that you're going to be seeing that all the time but it actually points this carrot here to where the issue is specifically it's point to this line and it has to deal with quantity is not spelled correctly at all running this again pressing controll enter okay we actually have it now and over to the side we have that net revenue double checking this it looks like all the numbers are actually getting calculated correctly there's one last step to do and that is we need to convert it to a common currency right now you can see that they're using pounds here and then us dollars below basically we're going to use it all the same i'm in america so we're going to be using us dollars all we have to do for this is just multiply by the exchange rate gone ahead and added it in and now we can see that it is in fact adjusted for what it needs to be now we're going to be adding customer and also product information using this customer key and product key but this table's already getting sort of large so i want to condense it down to different columns i'm for sure i'm going to use and really the only other thing i care about is order date so we'll go ahead and simplify this table down to this next we'll move into our second of five steps and we want to filter for our recent sales specifically we want things from 2020 and greater for this i'm going to use a wear clause and i want this for that order date that we have in that sales table to be greater than or equal to january 1st 2020 now let's go ahead and try to run this and it looks like it works now i would say in order to be safe if you're ever working with date data that you're not sure if it was converted to the date type specifically in postgress you can use this colon operator and then specify the data type you want to use for this in this case date so order date is getting converted or cast as a date this is going to work just the fine but just a tip for you all right next thing my boss wants added in is the customer info about who ordered that order now in order to do this we need to use a join and there's four major types of joins left join right join inner join and then full outer join in our case we want to perform a left join because table a is our sales table and we want any related data to that a table in the sales table returned from that b table or customer table so let's add this left join we're just going to go between from and where i'll add in a left join we want to do this on the customer table we're going to give it the alias just c to make it easy similarly i want to give sales an alias as well i'm actually going to bring this down and then indent this over i'll give this the alias of s now for this left join we want to do this on from the sales table we want to use that customer key and then from the customer table we want to use customer key so we're going to use good actual naming conventions here i'm going to add that s dot to the front of order date along with the front to quantity net price and exchange rate now i'm going to go ahead and run this to see if it actually works and it looks like it works we don't have anything from the customer table i'm going to go ahead and add in all the different columns by doing basically a c.ar notation to bring all those in all right so from this list i can see there's a few different columns we want that my boss has told me about specifically we want to get the given name or first name surname country full and then also the continent that they're from all right second to last step we need to add that product information in and similarly we're going to be forming a left join with this we'll give that product table an alias of p and we'll be connecting it on the product key of the sales table and the product table once again i want to see everything from that product table so i'll do p.star running this control enter i can see that we connected it properly with all the different product information once again i don't want all the different columns associated with this only want to select few does my boss specifically these four columns of product key product name category name and subcategory name all right so looking pretty good with this only one last step to do and we'll have all the information we need specifically we want to look at whether a customer is high value or low value looking at the net revenue we want to basically bin these customers into whether they're spending less than $1,000 or greater greater than $1,000 in order to accomplish this we need to use a case when statement and we're going to add it in at that last column right here so we'll say case when we want to look at the net revenue but we can't use an alias inside of the select statement because it's not necessarily defined yet so we just need to take all of this below paste it in here and say greater than 1,000 and in that case we want to say that it is high else we want to say that it's low so we can end this and then we're going to use the alias for this of high low real original i know let's run this pressing control enter inspecting it we can see that our formula is working for those values that are greater than a th00and we're marking it as high so this has everything that we need in it for my boss remember right now we're doing this limit 10 we actually need all the different data in it so i'll go ahead and press play looks like we have 124,000 different rows in this and if i want to export this to my boss i could click this here in order to convert this into this type of table but what's really convenient about this is i can now copy this entire table which allows us to either export to a csv json or even markdown csv is most common so i'll use that all right so now that's our initial dive into this kazo data set we now have some practice problems for you to go through and get even more familiar with this data set working through some problems in the next lesson or the next chapter we're going to be diving into using the case statement in order to pivot data super exciting all right with that see you in the next chapter welcome to this chapter on pivoting with case statements and specifically we're going to be using statements like case when and aggregation in order to pivot data but what the heck is pivoting data let's take a look at this simple example focusing on that first table first typically our data comes in a long format and in this case we have an example of a columns of date category and sales where we have different categories of a and b it's very common to pivot things such as on the category here of a and b so that way we get to more of a wider format as shown below this is not only easier to read and understand and analyze but also easier to visualize which we'll be doing in this so what will we be covering in the lessons in this chapter in this lesson we're going to be focusing on understanding the basics of using aggregation methods such as count and sum in order to pivot data we'll use count to analyze the number of customers per region and then we'll use sum to calculate the net revenue based on different categories in different years in lesson two we're going to build this up further and start looking at statistical functions such as min max median and average for this we'll get into an example of calculating what is the median sales across categories then finally in lesson three we're going to jump into advanced use cases of case statements specifically we're going to be looking at things like segmentation we'll learn how to analyze by multiple and conditions in order to look at things like bucketing for certain years based on revenue and then similarly we'll use multiple when conditions in order to analyze different bucketing of revenue tiers and see how they apply across different categories now i just showed a bunch of visuals and the goal of this course is not learning how to build or make visuals which i will show in this but really i want to be able to show that hey with these insights that we're gaining you can take it a step further and visualize it all right with that let's get into it in this first example we're going to do a review understanding count but also distinct count in order to calculate the total number of customers per day in 2023 this will be the final table that we end up getting as always if you want to follow along open up that blank sql notebook and run all the cells in it so we can get started so remember we want the total number of customers per order date we can uniquely identify this based on the customer key so add a select statement from there i'll add order date followed by customer key and then we'll get this from that sales table let's start with this first so as we can see from this we have this is the first of 2015 we have duplicate customer keys but then we also have a bunch of different ones we're going to start simple first we're just going to do a count of all the customer keys so i'll wrap customer key and count and provide it the alias as total customers let's run this and this isn't going to work right because well we need a group by right so adding that group by statement we'll add in we want to do this by that order date and then run this again all right so now we have the order date by total customers right now i'm noticing that the dates are not in order so i'll add in an order by order date and not too bad but remember previously whenever we were actually looking at it we could see that customer key is actually duplicated we want to find the unique customer so we want to use something like distinct so going back up into our original query all i'm going to do is add in distinct in here and then run control enter and now those numbers are going to drop right because they have a we're going to remove all those duplicates last thing we need to do for this one is just add a wear condition for filtering for dates in 2023 so i'll add an order date and for this i recommend using the keyword between so we don't have to do that greater than less than all that kind of mess and then putting in between january 1st 2023 to december 31st 2023 running this we can actually check the contents yep january 1st to december 31st one quick note now on visualizing this you can use this button right here and actually select it to go through and draft different visualizations to try to understand what is going on here with the data what it will do is it will give you different previews in our case this is time series data so i know that's the best choice to use for the visualization whenever i go to select it it will autogenerate all the different python code you need in order to visualize that data and then all you have to do is click add cell and then running this you can actually visualize it in more detail right below so that's why i really like collab for this is because it has gemini implemented into it makes it super simple for you to just go forward and actually visualize this all right let's now get into actually pivoting using count as an aggregation and for this we're going to be looking at something similar from that last example understanding how many daily customers we have but broken down by region specifically three continents of europe north america and australia for this we're going to be getting this final table where we have things like order date in the leftmost column and then we have the customers based on the different regions in their own individual column first things first though what continents do we actually have available inside of our database it's underneath the customer table when we run this we can see as a previous report in got europe north america and australia so let's go forward with actually adding this table into the query that we just made at that last example in order to do that we're going to be performing very commonly a left join and that'll be with the customer table with an alias of c and we'll do this on our customer key and what we'll need to do now cuz we have two tables in here we'll need to assign an alias also to our sales table and then also to all those other different columns that come from the sales table running this to make sure that the error there's no errors there are accidentally messed up order date run this again okay everything's working fine now but now we need to create individual columns for total customers based on continent so how are we going to do this well let's focus on this syntax right here we're going to be using the count distinct that we use as we used previously and inside of this we're going to be throwing in a case when statement it's case when a condition then what the output we want it to be the column in this case and then end and then finally assigned an alias so i'm going to go ahead and copy this right here and i'm going to insert it in the next line underneath here but we need to go through and actually fill it out so the condition is we're looking for if it equals a certain continent so for the continent from that customer table we're going to see in this case if it equals to europe and specifically the column that we want from this is then that customer key so i'll go ahead and put that in and this one will give the alias in this case called eu customers let's try this bad boy out and bam now we have our european customers in here let's go ahead and add the other two as well of north america and australia all right i got those in as well have the north america and then also the australian customers go ahead and run this and scrolling down we can see that based on the total customers the europe north america and australian they do add up to this line right here so that field of total customers is now somewhat redundant i'm going to go ahead and actually remove that and this will be our final query now visualizing this one similar to the last one this one i find especially has multiple different columns in it the visualizations it provide aren't that good specifically it is here in these time series but it's broken up to where this one's europe this one's north america and then this one's australia they're not all on the same graph so unfortunately gemini in this case is not that strong in producing graphs if you really want to visualize it and you want to know my method for it all you have to do is go ahead and click that table and then remember you can actually copy it and this is going to copy the table to your clipboard so this contents right here i want it as a csv i'll go ahead and copy it and we'll need to put it into some sort of document because it's pretty long so you want to put into a document such as csv so on mac i'll put in something like textit um on windows you'll put into something like notepad i'll just paste the contents in using commandv and then from there just save it inside your favorite chatbot in my case i really like chat gbt you could use gemini claude whatever i'll give it the simple prompt with the actual document of visualize this as a line chart and then with it visualized we can actually i like going in this interact mode on chatgpt we can actually go through and you can see both the all three of these regions along with visualize if you want to download the graph you can just click it there last example for this lesson for this one we're going to be looking at using the sum function with case when in order to look at what is the total revenue by category and we're going to be using that case when in order to look at 2022 verse 2023 this is the final table that we'll be creating where we have category in the leftmost column and then we'll have the total net revenue for 2022 and then for 2023 right next to it now for this i don't want to start from scratch so i'm going to do take this last query that we took right here and then paste into cell make sure it runs properly all right now for this i want to just start simple i want to first look at what is the total revenue by order date so i'll just start by first removing we're going to be done in 2022 and 2023 removing this wear clause also we don't need this customer table so i'll remove this as well along with these this count distinct that we did for all the different customers with it just simply like this i'll just start run it and make sure yep everything's appearing it's got all the different order dates in it okay now let's get the total revenue and that's going to be done by using sum now if you remember from a couple of lessons ago you need three things for this quantity net price and then also exchange rate so i'll add all three of them here using multiplication and i'll assign this as the alias of net revenue um remember anytime we're doing an aggregation need to have that group by let's go ahead and run this not bad but right now we're aggregating it by order date and we actually want to break this down based on the category just as a refresher you don't need to run this query what we need is from the product table is right here this category name so one we're going to need to merge this with our sales table and two extract out that category name so inside of our original query we're going to go ahead and do a left join connecting in the product table with the aliasp and on specifically that product key we'll go ahead and run this to make sure that it still executing properly okay good we didn't bring anything from the product table in we need to do that now and what we want to do is replace this order date now but with category name and we have it in three places so i'm going to show you a shortcut real quick so what i'm going to do is i'm going to highlight all of this right now you can see it's only selected on the top one i'm going to press on mac command shift l on windows you'd press control shift l and now all of these are when i press backspace all of them are removed and then as i type all of them get typed in super convenient saves a lot of time and then i'll go ahead and run this again by pressing controll enter and bam now we have category name and net revenue for each we have this net revenue across the entire data set still need to filter down but this is pretty good so far now in order to do this bas it up on 2022 and 2023 we need to be using a similar type syntax that we used before specifically wrapped in our sum function we'll use our case when when it meets a certain condition of 2022 2023 we'll provide it the net revenue when it meets that condition else if it's not that year it's going to be zero so whenever we sum it all up you only sum up if it's that year and then finally we'll end it with an alias so i'm going to go ahead and just copy this all and then inside of here i'm going to paste it down below so our first condition is checking on whether the date is in 2022 so for this we're going to be using that order date column and we want to check if it's between a certain date specifically between january 1st 2022 to december 31st 2022 had a brain fart there for the column for this we're going to be using what's above here inside of our net revenue of all three of these columns multiplied together and then finally the alias we'll provide it we'll call it total net revenue 2022 let's go ahead and run this and bam we have it for 2022 this is looking good let's go ahead and put a comma on the end and i'm going to go ahead and copy this and then pasting it right below then i'll just need to go through and update it to make sure that we're now using instead of 2022 that we're actually using 2023 make sure to also change the alias okay running this processing control enter we get almost our final results once again we don't need that net revenue in there it's not telling us what we need so we'll go ahead and remove it and now we have our final table and from this we can see that for some strange reason from 2022 to 23 for all of these columns all the data went down that's not really good i'll leave that to my boss to figure out all right it's your turn now to go through we have some practice problems aligned for this to get you more familiar with how you can use case when statements in order to pivot data using these different types of aggregation methods in the next lesson we'll be building on this focusing on statistical functions such as min max average and median and diving further into revenue with that see you in the next one now that we're warmed up using basic functions to analyze pivoted data we're going to now shift our focus in using statistical functions for this specifically we're going to be covering these functions we're going to warm up by focusing on the easy ones first of average min and max in order to pivot that database and understand some data insights and then from there use the percentile count or continuous function in order to analyze the median revenue of sales and we'll continue on that same trend of analyzing this based on all these different categories to see what is the highest performing category so where can we find out what statistical functions are available to us well we go to the source documentation here at postgress they have the section on all the aggregate functions which includes the statistical functions and scrolling on down we can see max and min which we're going to be using shortly which find the maximum or minimum value of expressions across all non-null input values and then similarly we have a whole host of in-depth uh statistical functions those around correlation looking at r squar standard deviation and even variance so let's get into actually analyzing using min max and also average i need you to start up a blank notebook for you to work with and so what are we going to be analyzing for this well if you remember back from last lesson we calculated the total net revenue by category broken down for 2022 and 2022 we're going to be doing a very similar approach to keep it simple looking at things like min max and average because of this if you still have that query you can go ahead and just copy that query right above as we're going to be reusing that and modifying that to apply these new functions so inside my blank notebook i'm going to go ahead and paste that right here and actually running it press control enter first one we're going to try is average and this finds the arithmetic mean of all non-null input values so pretty simple in here we're going to keep this query mostly all the same but instead of doing sums here we're going to be performing the actual averages because of that i need to name the aliases appropriately naming them average net revenue 2023 i'll press controll enter go ahead and run this and bam now we have our average values and this is pretty neat because if we remember back computers had the highest total revenue but yet in this home appliances have the highest average net revenue anyway if you want to visualize this we could click that graph thing and try to actually visualize it below but being that this is categorical data i don't find that the graphs that ever provides are that good because this table is so small i actually sometimes can just take this and copy the entire contents and just paste it right inside the chat itself since it doesn't take up too much space and give it the prompt to visualize this and then bam we can see the different average values for these different categories across the years and actually compare them seeing things like computers actually were lower in 2022 than 2023 now that we understand average let's explore min and max and it's going to be a very similar syntax to this so i'm going to use some ai to automate this i'm going to go ahead and just copy this query right here open up gemini and i give it the prompt add in min and max statements similar as done with these average statements and then below this just go ahead and paste it in let's see if it can actually do this going to expand this out to actually be able to inspect this and it looks like we got it done so we don't need to be all repetitive anytime there's repetitive task and give to ai to actually do this i'm going go ahead and actually insert this into here and close out of gemini and then i have all of these syntax errors because remember we don't have the magic command for sql at the front and now i can go ahead and actually run this pressing control enter and bam now in one table we have not the average the min and also the max all formatted and typed out correctly pretty neat with ai so let's crank this up a notch and start looking at a similar analysis but now using the median for those that are not familiar with what a median is if you were to have a list of numbers and then you were to sort them in order the median is the middle number so in this case when we have these seven numbers right here six is the middle number whereas when we have eight numbers we take the average between the fourth and fifth number which could be only four and five and that becomes 4.5 now median is extremely important especially when you're working with data in this case we're looking at salary distribution this from my python course and we're looking at salary distributions and you can see that we have the salary go up and then go down but then it goes out and we have like high out values way out past 350,000 if we just use the average this average is going to be pushed towards a higher number and is not going to be representistic of the actual data so median helps fix this issue by basically sorting all those numbers taking that middle number and getting a more representistic number of what you would expect to see in this case salary what would you expect to see i wouldn't want to expect to see a higher salary when i know i'm going to get a lower so you may be like "this is pretty easy all i got to do is change all these average values to median and then run this unfortunately there's no such function as median hence the no function matches this given name and argument type and that's common across all databases whether you're using postgress sql server or mysql we're going to use this percentile con or continuous and that continuous portion is a key part because now this function that we're using percentile cont is not only an aggregate function but it's an ordered set aggregate function so what does this all mean not only do we have to use percentile con and what fraction we want to use for this in this case the median or half of it is 0.5 but we also have to use this other syntax here of within group and then order by so let's actually break this syntax down using this percentile continuous function we need to provide it basically a list of ordered values it's not going to sort it itself and be able to pick it out like other aggregate functions because of this we first in parentheses have this order by column this specifies how we want to order the values that we're going to be picking out the median value from but this only sorts the values we actually need to bind it to this percentile continuous function that's why we have this within group portion right here to bind that ordered set if you will to that function so let's just take a simple example first using net price here i have a simple select statement of net price we're taking it from sales itself and it looks like we have almost 200,000 rolls of net price let's get the median value from this so i'll first start by using that percentile continuous function notice they also have a percentile discretet that's if you don't want it to average if you have two numbers in the middle you don't want it to average if you want to actually pick a certain value you use discretet i mostly stick to continuous and then we're finding the median value or that that is at the 50th percentile hence 050 and we need to first bind what we're going to be binding this so we'll use that within group and then in parenthesis we'll use that order buy specifically of that net price and then we'll assign an alien an alien an alias of median price let's go ahead and run this bad boy and bam we get that median price of $191 just out of curiosity i'm going to compare it to that average net price and similar to what we saw with that salary data that i was showing previously the average price is much higher and that's because we have these high value items that aren't necessarily purchased as much driving that average up so media in this case is a lot better at getting a representistic understanding of what the common net price is people are seeing so what are we going to be calculating this last example well we want to find out what are the median sales by category comparing 2022 to 2023 notice here we're going to say sales i'm going to use sales more frequently um but this is technically net revenue but in the business side we typically just say this is sales i don't like to start from scratch from this so i'm going to work with that very last query that we just did where we found out the average min and max four different categories so i'm going to go ahead and just copy this all and then paste it into a new code cell here remember we're not going to be using this average min or max so i'm going to go ahead and remove it let's just start by getting the median net revenue or sales for basically all the years and then we'll filter down by 2022 and 2023 after so i'll define that we want the median by specifying percentile continuous we'll use the binding function within group and then we'll use that order by and this will be done on the net revenue which is quantity times net price times that exchange rate we'll give it the alias as median sales all right let's see if this bad boy works controll enter all right not bad we have our median sales for our columns but fast forwarding to the future remember we want to call them on 2022 and 2023 so starting with 2022 first i'll put year 2022 here so know that that's what we're working with basically we want to provide the necessary values right here inside the parenthesis to filter down for 2022 so because of that we need to use a case when statement so i'll add in case and then from there i'll press enter and indent this in and then we need to fill in when the column name equals a condition then we basically we want this value of the net revenue and so after that we'll actually end it okay so we need to fill in this column name equals condition mainly we want to meet the condition of verifying the order date is in 2022 so we'll remove this out and we'll start by defining the order date we'll use that between argument and specifying that we want between january 1st 2022 to january or december 31st then it's equal to these values i'm going to go ahead and just for good measure to make sure it's a little bit more readable put that into parenthesis so just to be clear right you can see this order by we have this pink parenthesis right here it's then doing a case statement to determine if they're within a certain date then it's equal to this value else not going to have any values let's go ahead and run this all right not too bad i want to add in 2023 now i don't feel like retyping all the values so i'm going to use gemini for this and i'll paste in the code giving the prompt of addin 2023 also under this thing and it looks like it did it correctly i'll go ahead and insert this in double checking it yeah looking good and now running this final query we have the median sales for 2022 and 2023 so just taking it a step further actually analyzing this we can see that comparing those median sales to that total net revenue which is also total sales we can see some interesting insights specifically for computers from 2022 to 2023 the median sale actually went down and corresponding with this the total sales of that same category of computers went down so maybe it's something you could bring up to the svp of computers all right so you now have some practice problems to go through and get even more familiar with using these statistical functions when applied with pivoting and using case when statements in the next lesson we're going to be getting into advanced segmentation we're going to be learning how to use keywords like and and when to break down analysis even further with that i'll see you in the next one welcome to this last lesson on using case statements in order to pivot data and in this lesson we're going to be going into advanced segmentation so what is segmentation well it's a really important data analytics concept in order for you to take large data sets and break it down into smaller pieces in order to analyze different behaviors as a data analyst i'm applying this concept all the time when i'm using large data sets so i can dive deeper into the details and understand different behaviors so how are we going to do this well we're going to start off easy in our first one we're going to learn how to use the and statement within a case when statement in order to analyze multiple conditions for that net revenue of all those eight categories we're going to break it down looking at segmentation of year based on 2022 or 2023 along with looking at whether it has a high or low value median price mainly look at the net revenue for orders that are less than the median value and look at the net revenue for those greater than the median value for a second and also final example we're going to be looking at how we can use multiple when clauses within a single case block this is particularly important whenever you need different outcomes based on different conditions and for the example we'll be doing we'll be breaking down the revenue into multiple different tiers so instead of looking at orders less than and greater than the median value with this we're going to take it a step further and we're going to be able to look at orders based on where they fall in certain percentiles now before we get into both those examples i want to just demonstrate it real quick how to actually use both of these concepts the first concept is we could use and to combine multiple conditions within a case when statement and this is simply just done by adding condition one and condition two and use that and end statement so for this one we're going to be looking at two things quantity and net price i'm going to go ahead and run this query right here we can see based on our order date we can see things like the quantity and then also the net price what we want to do is create a new column and classify whether these orders are a high value order so basically i have a quantity greater than or equal to two and the net price is greater than or equal to maybe $50 and then if it's not that we just want to call it a standard order so going to insert a new line underneath this select statement starting with our case statement and then indent it in to insert in when we're going to be checking two things right we're looking at what is the quantity and that it's greater than or equal to two and adding in the and keyword we're going to look at the net price and whether that one is greater than or equal to 50 in this multiple condition case we want to categorize this as a high value order else we want to categorize this as a standard order we'll end this case statement and give it the alias of order type okay running this pressing controll enter we can now see that this allows us to do multiple conditions and those that have greater than one or greater than 50 are categorized as high-v value orders the second concept to understand is we can use multiple when clauses within a single case block now there's no limit to the number of wens we can put into here but basically it's shown as every time you have a when you then need to have that then keyword and specifying what you want after that and then usually end it with some sort of else now i find myself using this type of approach of multiple when statements whenever i have to break it up into different categories for example we're going to be following the same approach right looking at quantity and net price but we're going to want to categorize this now so previously we were just categorizing high-v value orders those greater than two and greater than 50 and everything else is a standard order instead we want to more precisely fine-tune this we want to classify when it's a multiple high-v value item so greater than two and we're going to change the value to greater than 100 then we want to also check for single high-value items so those that are greater than 100 but are a quantity less than two so one and then we'll categorize those that are multiple standard items so those that are greater than two but still less than 100 and then everything else is going to be a single standard item so we can build on this query that we already used previously one thing to note is like i said we changed this so this is no longer 50 i'm going to change that to 100 and call this multiple highv value order for this we just enter a new line enter in our when statement and then use what we're going to analyze for is the net price greater than or equal to 100 then in that case we're calling it a single high-v value item next when quantity is greater than or equal to two then we'll have multiple standard items and then finally the else will be a single standard item okay let's go ahead and run this bad boy and scrolling on down we can see that it appropriately classified based on these multiple different conditions pretty neat all right so if there's anything you get out of this video it's these two concepts because we're now about to get into more technical examples to demonstrate how you would use this in the real world so as always i like to start out with what we're going aiming to achieve in this and similarly we're going to continue our analysis looking at our categories but for this we want to not only break it down by year looking at revenue in 2022 versus 2023 but the reason why we need this and condition is we want to segment it further in order to understand the low revenue and high revenue what do i mean by that in the last lesson we looked at what was the median order value for a single order well basically we want to see what is the total or net revenue for orders less than median and then what are those greater than median thus high now technically as we showed in the last lesson we have median values for each of the different categories in order to understand how this and condition works in here we're going to keep it really simple at first and we're just going to calculate a single median value for all the categories and apply it to all so let's start by calculating that median value we'll start with select and then use that percentile continuous function and then use within group as our bridge to then put in order by for calculation of our net revenue of quantity net price and exchange rate we'll name this as the median for right now we need this from the sales table and since we're working between 2022 and 2023 we'll add a wear clause on this as well specify and use an order date between 2022 and 2023 and i have a syntax error because i applied an alias of s for sales we'll be using this later so we'll keep it i'll add that s there run this again and we get a median value of 398 so remember what do we want to do we want to calculate basically the revenue that is less than the median value and the revenue that is higher than the median value so just as a reminder we're trying to calculate with that median value of 398 for orders that are less than 398 what is that total revenue low revenue and then also the higher what is higher and doing this for 2022 and 2023 so what i'm going to do is go back to the problem that we used previously because we reuse this where we're calculating those median sales um at the end of lesson two in the last lesson i'm going to go ahead and copy this and then back in our blank notebook underneath our median calculation i'm going to go ahead and paste this in now we're not going to be using these median sales like we calculated it right here so let's just run this to see what we have right now and it should just show all eight of our categories now we can add in those calculations so specifically for this one i don't care about the 2022 and 2023 23 i just want to calculate the low net revenue and the high net revenue we'll start with low first we know we're going to be adding everything up so we're going to be using the sum function for this and then we're going to have a case when condition and it's going to have that syntax of case when the condition then the value and then end it and we'll end it with the alias of low net revenue so the condition is we want it to be less than that 398 that that order value is so we'll add in quantity time net price times exchange rate and make sure that it is less than 398 this line's getting a little long so i'll go ahead and bump this on down and then also indent it in and now we need what is the value well the value is what is that revenue so i'll copy the same formula that we used above command c and place this into value one so let's go see if this works right now as written and looks like it does do these numbers make sense yeah they do so now let's add a statement for the high net revenue i'll go ahead and copy this all add a comma and then paste it into here changing this to greater than equal to 398 and then changing this to high and then going ahead and run this make sure we have no errors looks like we have everything here so now we've at least gotten that low net revenue and that high net revenue the next thing we need to do is now actually segment it down based on 2022 and then 203 and this is where we're finally actually getting into what we're trying to teach with this concept of using an and in a case when condition specifically inside of our case when we can use two different conditions using that and keyword so basically we want to add our multiple conditions right here inside of here before the then so what i'm going to do is enter down and then indent in and start that second and condition so we've already checked for the first condition of hey is it less than the median value the next thing we need to look at is the order date between january 1st 2022 and december 31st 2022 and i can actually go ahead and copy this and also place it in this condition as well i'll then update this to be this is 2022 and this is also 2022 let's run this make sure that it's working properly and i got a little bit of a typo i added two betweens in here i don't know what i'm thinking there try it again okay now it's working all right this is looking pretty good now all we got to do is add in 2023 so i'll use gemini for this cuz i don't really want to fix all that code paste in the formula giving you the prompt of add in two columns for 2023 it generated it so i'll add that code cell in and then from there add in that uh sql magic command run this pressing controll enter and bam now we have that net revenue for 2022 and 2023 whether low or high so that's how you use the and condition but i'm going to be honest this query i don't like hard coding in values like this and so i'm going to show you real quick a little advanced technique so that way you don't have to hardcode values in so if we scroll back up to our original query right here that actually calculates our median value we can use a cte to insert in this value into that query now as a refresher for cte they start with that width command and then from there you're going to be assigning the name of that cte so i'll name this one median value and use as from there i'll put an opening and closing parenthesis along with actually pasting in that value i like to indent it in to just make it easier to read we're not using it yet but i'm just going to press control enter to make sure that my query is still operating correctly it is and then now all i need to do is insert it into this from command right here so i inserted it in as median value giving it the alias of mv and then if we remember the column name is median so i'm going to replace that 398 remember we can press command shift l to select all of it and then type in mv.median okay let's run this bad boy hope it works and it does gets our final value now this is pretty good now that we can see these breakdowns between the years based on high and low net revenue we visualize this we can actually understand better what happened with the computer sector specifically remember computers drop down in revenue well it's not for those low revenue or those value of orders less than the median value really what happened was is they saw a drop in the orders for those high value orders those that are greater than the median so pretty interesting insight that we found out and we actually break it down further in the next example in this example we're going to be building on that last example further by using multiple when clauses within a case block previously we were only using one when clause but we're going to actually step it up a notch and use two when clauses why is this important well this is going to allow us to segment within a column into in our case different revenue tiers so all of these will be categorized whether high low or medium along with the associated calculating for the total revenue so what does this high low and medium revenue even mean we're going to be segmenting based on where an order falls within its percentile specifically if an order's revenue is less than the 25th percentile we'll cate categorize this as low between 25 and 75 is medium and greater than 75 is high now why this 25th and 75th percentile well it's actually pretty common in statistics to use these values in order to bucket things into their different quartiles technically statisticians like to call that range between 25 and 75 the entire quartile range we're just going to call it medium and then everything less than this low and everything higher than this high so that's the basis on where we're getting these numbers for now you may be like luke how the heck do i calculate the 25th and 75th percentile but remember the median is the 50th percentile so that's why in this case we use that 0.5 so let's go ahead and copy this that we used before and i'll put in here.25 and we'll call this that 25th percentile going ahead and running this we get an error because silly me can't start an alias with a number in sql we actually need to use a letter to start so i'll just say hey this is revenue 25th percentile and now run this okay not bad let's do this for the 75th percentile and adding all in pressing control enter we have the 25th percentile and 75th 75th should be much greater than the 25th as expected looking good all right we're going to be using these values in our final query so we're going to be not going to be hard coding them in so i'm going to create another cte so i'll go ahead and tab this over and then from there create a ct of percentiles and then assigning it inside parenthesis here so running this make sure it works just fine not going to cuz i need to insert a query below it i don't want to start from scratch with that bottom query so i'm going to actually just going to scroll up to that previous one that we just created i'm going to copy this one and we'll modify this one to make sure that it works pasting it in here first thing i'm going to just get rid of all of these conditions that we created along with that median value because we're no longer using it but i will go ahead and add in that cte of percentiles and we'll just name it something like percentile okay let's actually try to run this and i can see what the error is now i have this comma after the end before the from let's try this again okay we got all the categories as expected and we have our cte inside of there so the first thing i think the easiest is because we already have this group by and everything like that let's add an aggregation in here to actually provide what is that total revenue so we use the sum function for this then we'll add in the quantity time net price time exchange rate and we'll give this the alias total revenue all right let's go ahead and now run this all right so we have our categories and total revenues we now need to do one more step and this is actually what we're trying to learn of implementing specifically breaking down all these different categories into those revenue tiers and so for this we're going to be using the multiple when statements within our case i'm going to go ahead and copy this right here and then paste it in between here all right so the first thing that we want to check for is whether we're mean that low tier condition basically that the order revenue is less than that 25 percentile so for the condition we're checking whether this value here i'm going to go ahead and copy this and paste it in here is less than or equal to that revenue 25th percentile that we're calculating up here so we brought it in with that alias of pctl and that's that revenue 25th percentile and we're going to assign it the value of low for the next condition i'm going to go ahead and just copy this right here paste it into condition two and i'm going to just say we want to check to make sure that it's greater than or equal to the 75th percentile in that case it's going to be high everything else is going to be classified as medium for the alias for this we'll name it as revenue tier all right looks like everything's in order let's go ahead and run this bad boy press a control enter and we get an error which i'm catching it because it's pointing out that it's with the when clause right here now we're aggregating right we're doing the the sum of the total revenue based on these different tiers so technically we also need to do a group by for this so underneath here i'll add in revenue tier sher run this again and bam now we got it not too bad now technically i would like in the order i'm a little nitpicky i would like the order low medium high so i'm going put number values out in the front of these in order for it to be able to actually order in the correct order so i need to actually add that in underneath the category name and we're going to do an revenue tier run control enter we got a comma run control enter okay now we have it and better order high medium low 1 2 3 and this is pretty neat cuz we're able to do multiple segmentation in order to analyze these different revenue tiers and we actually visualize it when we put into something like a 100% stacked column chart we can see that we have the high the light blue whereas medium and low are getting darker something like that computer sector that we keep on talking about they are very reliant on their revenue coming from those high ticket items those that are greater than 75th percentile whereas something like games and toys are highly reliant on low and medium value items so pretty interesting insight one last technical note for both the first and last problem we used these 25th and 75th percentile across the entire range of categories and similarly for that first problem we use the median value across all the different categories technically this isn't necessarily the best practice you should do for this we went back to that first problem you'd actually want to calculate the median for each of these different categories and then from there actually segment it down and break it down further but whenever we look at the query it breaks down it gets a lot more complex and isn't really what the focus is of this lesson focusing on that adding multiple when conditions or using that and condition but if you like to see the nitty-gritty technical details they are in the notes all right you now have some practice problems for you to go through and get more familiar with using these multiple different conditions for segmentation with that we'll be moving into the next chapter on dates and we're doing a lot of different functions with that with that see you in the next one welcome to this chapter on date calculations and in this we're going to be learning how to use different date and also time functions and keywords in order to analyze data now the first lesson we're going to get an intro and how this is useful in performing time series analysis specifically we're going to use things like date truncate and two character in order to calculate things like the number of unique customers or net revenue per month in the second lesson we're going to fine-tune how we can extract out certain components of the date and also use things like the current date or now in order to investigate certain time periods from when we're analyzing it and in the final lesson we'll cap it off with keywords like interval and functions like age in order to calculate things like average processing time and compare that to the number of orders we have and this is all very important and so kelly and i included this in these beginning chapters because you're going to see as you go throughout the rest of the chapters a lot of the concepts were learning with how to manipulate dates are going to be used in those future chapters date and time data is everywhere you go you can't get away from it now i highly encourage you for any of these functions or operators if you're curious of learning more go into the source documentation which provide the link over here or maybe over here so we're going to be using the date trunk function so what does this do well in date trunk you provided a field that you want output from it whether something like seconds minutes hours days weeks whatnot and you provide the source usually this is in the form of a date or time so let's go over a simple example first i'm open up with a blank notebook here i just have a simple query where we're looking at the order date because that's what we're going to be manipulating with this date trunk formula from sales and we only have 10 uh outputs so i'll controll enter okay this is good now all these dates are the same um but i want to be able to see all the different dates so a little trick you can do is i'm going to use order by and then use the function random now whenever i run this pressing control enter i'm getting random dates right here so we can better see if it's actually applying to a lot of different data anyway let's get into that date trunk function so typing this function right here we can see that it also outputs in here hey we first need our date expression and then we need our date part which i just clicked to open up we can also use this documentation to further investigate what is going on here so pretty convenient of what's going on here okay so we're going to first input in that date expression and it's a string so it needs to be in single quotes i'm going to put in month from there we need to put in the date part so i'll put in order date now we'll run this pressing controll enter and we get if we see these these order dates we can see that it's just the month now notice the data type of this it's getting converted to a timestamp and this is a little bit inconvenient and a little bit too verbose for me so we can clean this up specifically if you remember from a few lessons ago we can use this double colon sign which is the cast operator and we can cast this instead of as a timestamp as a date running control enter now i just have it output as a date take it one step further and also just rename this as order month and we're good to go all right so what are we actually trying to calculate in this exercise this is the final table that we're aiming to get to in it we have the order month which just showed you how we did and we want to get with this we're going to use group by to analyze the net revenue and also the total unique customers so let's start with that query that we just previously built we're going to add onto this by first calculating the net revenue remember we use the sum function for this this is calculated by multiplying quantity times net price times exchange rate and this is our net revenue so i want to make sure that this is operating correctly and silly me we're doing an aggregation of sum so we need to perform some sort of group by so instead of this order by i'll put in a group by and specifically we'll call out that we want to do it by the order month i'll also go ahead and just remove out that order date so we don't have to add it to that group by run control enter okay not bad getting net revenue per order month we only have 10 results right here i have it to limit 10 mainly i do that when building queries so they operate more quickly i'll take it off at the end next thing to get is total unique customers so i'm going to go ahead and add that in we need to do a distinct count of the customer keys so count and then distinct specifying customer key and then we'll assign the alias of total unique customers running this okay we got totally unique customers net revenue and order month so exactly what we needed out of this we can go ahead now remove that limit 10 and press ctrl enter bam now date trunk is really great especially if you just want to specify one attribute you want to extract out of it such as something like month as we did you could either do quarter year decade century or even millennium and so if i like to customize it more i like to use the two character function specifically you provide it something like a timestamp and then the text output and it outputs it in that text format scrolling on down we can see that it has a host of different options that we can use for this you can specify a lot of things like hour of the day year even things like month and the good thing about this is you can actually combine these together in a format or in an order that you want so let's just show this simply by implementing in order date and formatting them as month and year we go ahead and press control enter and we have our random dates right here let's now add this new function in in its own line so enter two car and then the next thing is the actual field itself and then next is what we want to output so in the case that we want something like just the year only i'm going to put it in single quotes and then go ahead and run it and we can see unlike the last one where we had to like cast it as a date and remove all the time and stuff it just outputs what we need and then if i wanted something else say not only the year but also the month i could just put it in there so double m in that case running control enter now we have the month and year so this table is super helpful in understanding what are the different formatting options that you have for this what you can use so back to the original example that we were working with we can actually replace this entire line and use this two character function specifying order date and then how we wanted it formatted and of course we'll give it that alias similarly of order month so it's performing the group by properly i'll press control enter and so let me forgot a comma after this and so now i feel this output is a lot more readable regarding this order month because that removes that day and we can actually see what it is for each of these months the revenue and total customers now we because we aggregated this on a monthly basis vice that daily basis that we were previously doing we removed a lot of noise and from this we can see that in 2020 we had obviously some sort of worldwide event that caused an impact in the number of unique customers and also net revenue that we had but it looks like as of 2022 these numbers have returned back to normal except a slight dip in 2023 maybe something we'll have to investigate later all right it's your turn now to go and test these out we have a few short practice problems for you get more familiar with this in the next lesson we're going to be jumping into even more complex formulas such as current date or even now with that see you there in this lesson we're going to build further than we learned in the last lesson specifically by understanding more about how we can actually filter dates and even do it dynamically first we're going to learn two more functions on how to format different dates and for this we're going to be diving into how we can analyze things like the net revenue for each year for every category and then from there we'll use things like current date and now to basically filter data by a certain time time frame from this time period now pretty neat all right first of the two functions is date part and this one extracts a specific component from a date or time stamp as we can see we have date part function unit and then what is the source or column name host of different options so with the sample query we can look at things like the year month or date here we're using that date part function specifying those different components the applicable column and we give it appropriate aliases one thing to note with this which is not necessarily my most favorite part is that they come in with precision so they have decimals after it and i don't necessarily want this depending on what unit i'm working with because of that i prefer to use something like extract and it has a very similar format to date part it's actually based on date part and we can see as going through this we can do things like day decade dow and whatnot basically all the same things that we can use in date part we can use in extract the syntax for this is slightly different though in this case we're going to use the unit and instead of doing a string for it we actually do an uppercase actual variable name and then we say from the source in our case our column name so this bottom query is using that extract and we're going to do exactly the same thing that we did just above in that similar example using date part we specify year month day from order date and provided the appropriate alias for it let's go ahead and run this bad boy and like i mentioned i like this one a lot better especially when dealing with things like year months or dates where i want in your digits for these values so let's use this extract function in order to analyze the net revenue per order month now previously right we use this two car or two character function to actually analyze per order month what the net revenue is let's instead create columns for months and then also for years so i'm going to go ahead and remove this portion right here and put in extract it gives us a hint up here we want to put in the part first well we want the year next we're going to use the keyword of from and then the date expression specifically we want to use order date we'll give it the alias of order year next we'll get into adding that month one we'll write the extract formula do it for month from and specify order date for this we'll give it the alias of order month now the group by for this we're doing two columns now so i'm going to want to actually do order year and order month this looks good let's go ahead and run it and not bad it's all over the place so i'm actually going to go ahead and change this to do an order by at the end of this so that we can get some semblance out of this data and bam this looks a lot better and now we have this in different columns so depending on how those that i give this data to they can slice and dice it even more easily all right let's actually get into some new concepts and implement dynamic filtering by using things like the current date or the time now in order to filter back let's talk about current date first so typing a simple select statement along with current date it provides me the documentation for this and it basically says hey it returns the current date as of the specified or default time zone parentheses are optional when called with no arguments basically you can provide a time zone if you want anyway running this we can see we get the current date i'm filming this apparently i'm filming this on valentine's day that reminds me i need to call my fiance and actually wish her happy valentine's day so i'm glad i saw that anyway that's current date let's go to the next one and that one is using the function of now similarly i can run a select statement with this just calling the function make sure you do have an open and closing parenthesis and we run this and we can see that it is valentine's day at 2:30 in the morning okay actually not filming this at 2:30 in the morning this is actually green witch meime which is over in england so that's what time it is there so that's why for the current date it gives you the option to actually throw in a time zone in there to update it appropriately so what are we going to be calculating well the short answer is we're going to be looking at understanding what is the net revenue per category for those orders 5 years ago back from today we're basically building this table the important thing to understand is this this is a dynamic filter and these type of things are very important and understanding to do because sometimes you want to or you'll have workflows set up that run queries automatically at midnight and you don't want to be pulling in all the data maybe you only want the data for the last 5 years and things like this are great for that so let's start with a base query that we've seen time and time again first we're going to be extracting out the order date the category name and then perform a calculation for net revenue we're going to be doing this from the sales table and left joining it with the product table on the product key and then because we're doing aggregation above for the net revenue we need to actually group it by the order date and the category name pressing control enter we got this and it looks like these dates are unordered so i'm going to go ahead and throw in an order by run this and now we have the order the dates in order so we only need one do one more step but i'm going to break this step down because we want to include only orders within the last 5 years basically we shouldn't be seeing anything from 2015 i'm in 202024 as of filming this so we're going to create a wear filter in here to do this but i want to break it down slowly to show what's going on step by step so i'm going to insert in how what components we're going to use to filter within a wear clause we'll start simple first we'll look at current date and see what it outputs to here as expected we get to see it's valentine's day now in order to extract out these last five years we need to get what is the year in the current date and also what is the year in the order date so for this we're going to use that extract command we want from it the part which is the year and we'll do from we'll keep it simple first with just the order date i'll give it the alias of order year running this see it's working just fine we're getting that order year for all those different order dates next let's extract out the year from the current date so we'll just put in here that keyword of current date and we'll say this the alias of current year run this okay not bad now we want the year that we're going to be filtering by right we want it to be basically 5 years ago so all we need to do is i'm going to copy this one we're going to be getting rid of all these um but i want this one right here and instead what we're going to do is we're going to do minus5 and we'll set this one as minus5 clever i know running control enter we can see that the minus5 is actually five behind this so what we need to do in our wear clause is combine these in a way to where it filters for that so wear clauses go underneath from or the left join and for this we want to make sure we want to see the order year so i'm going to go ahead and copy this above c is greater than or equal to this minus5 value that we did right here and i'll go ahead and post that in here let's go ahead and run this crl enter and bam now this is a little bit hard to read but we can see just looking at the order date column these are the orders for the last five years i'm going to go ahead and remove these unnecessary values now we don't need this anymore that was just for building this and we can see that we have this now this query's built for the last 5 years now you may be like "luke this is it's valentine's day right now but this is going back to january 1st 2020 what if we went to be very precise about that?" and i'll say "aha to that we'll answer in the next lesson." all right you now have some practice problems to go through and get more familiar with using these different things and creating your own dynamic filters in the next lesson we're going to get into date differences basically using functions like age to measure the time between different dates with that see you in the next one in this third and final lesson on this chapter on date and time function objects we're going to be going into now understanding how to calculate intervals in the first half we're going to continue on from that problem from the last lesson and instead of making really that verbose way of calculating the last 5 years we're going to use the keyword of interval to write much more succinct queries and readable queries to understand what we want to get an interval of in the second half we're going to be going into a pretty interesting business problem of exploring average processing time so in order to calculate this interval between the order date and also delivery date which we know we're going to use functions like age and also show what can be done with that previous function of extract first let's explore how to use this keyword of interval interval can represent a span of time such as days months hours minutes decades or even weeks and we use this by using the interval keyword and then a value and unit so let's test this bad boy out i'm going to run a sele simple select statement specify the keyword of interval and then let's do something like five centuries with this we can see that it gets the title of interval and it calculates it to be 182,500 days anyway normal output for this is in days whether you're using centuries or even use something like months running control enter comes out in days so how can we use this in the query that we used in the last lesson to filter for orders within the last 5 years i've simplified the query basically we're pulling out the current date order date from that sales table and we use this formula of pulling the year out from order date and the year out from the current date and subtracting five run this we can see that the current date is valentine's day and the order date is within that last 5 years now notice this right i called it out last time this goes all the way back from january 1st 2020 so technically this is slightly greater than 5 years so let's write this query a lot more succinctly i'm going to go ahead and remove this portion right here and for this we want to make sure that the order date that we're actually trying to filter for is greater than or equal to dates of 5 years ago so we can use once again that current date and we can subtract from it the interval of 5 years running this bad boy we can see that this one now does the last 5 years as shown by the order date and it's very specific right it gets all the way down to filtering it to february 14th valentine's day in 2020 so getting into cleaning up that full query from last time this is actually it right here if we run it again we can see that it does the current date the order date category name and net revenue so it breaks down by category the different net revenues this one technically remember wasn't 5 years so what we can do is go back and replace this portion right here with that newly formated clause that we came up with and then whenever we go ahead and run it we can see that now we have it for the last 5 years all right before we get into age and also review of that extract function let's look at what we're actually trying to solve in this portion of the lesson if you recall back we have two columns an order date and delivery date column i forgot to put a comma here so it's not appearing now it is um so we have things like order date and delivery date what we can do with this type of information is calculate how long what is like an average processing time for a customer to receive an order it's a very important metric whenever used in business analytics so what we're going to do by the end of this is show on a yearly basis not only the net net revenue which is those blue bars but also what is the average processing time for those years and we find that it's going up so what are we using for this well we're using the age function and with this we provide two in this case timestamps we provide dates and it'll output an interval let's do a simple example first running it just right inside of a select statement we're going to be using the age function and then we're going to provide it two dates now i have a couple of errors with this i'll go ahead and run this first pressing control enter and we have this render age and it says hey no function matches the given name and argument specifically it says age integer integer i don't want to evaluate integer i want to evaluate as a date the problem is with the date we have to provide it as a string so make sure you have single quotes around it running controll enter we have it now if you notice from this i did the 8th of january to the 14th of january and it's saying it's -6 dates for the age function to get a positive value you need to provide the end date and then you need to provide the start date so i'm going to go ahead and place these in the correct order running this get six days now let's say we want to do some math with this i have currently six days and let's say we wanted to subtract i don't know five days from this if i put in here after this minus five and try to run this i'm going to get an error specifically it's going to have that the operator does not exist interval minus integer right now this is an interval and we're trying to subtract an integer we need to convert this portion to an interval sorry i mean integer need more coffee well we can use the extract function i'm going to go ahead and remove this all cutting it out we're going to use the extract and we need the part and then specify the from keyword and then the date expression for this we're going to specify day it doesn't need to be in single quotes for this it understands that keyword of day i'm specify from and then for the date expression i'm going to go ahead and paste in that age now running this pressing controll enter we can say see that we get that six from this and i can do the minus five from this now pressing control enter we get one so let's get into calculating the average processing time by year and we're going to be doing this for the last 5 years similar to this table we also need to calculate the net revenue but we're not going to do that until the very end because it's going to make our query a lot longer with actually joining in the table that has the revenue data so let's start simple first and let's just look at the order date and delivery date and we're going to be getting this from the sales table now let's put in a new column in for the processing time we'll throw that age function and remember we need to put the end date first so that would be the delivery date followed by the start time or the order date and we'll name this as processing time this query is getting quite long so i'm going to go ahead and throw in a limit of 10 in here just to start with running this we can see we're getting basically zero processing time everything's getting delivered on the same day that it's ordered i want to see a little bit more different orders so i'm going to throw in an order by random and run this and now we can get some actually that have some days in there to show that okay this is actually working so let's start getting the average processing time and aggregating it based on year because of this we need the year for this i'm going to use the date part function and as the first argument we need to specify what is we actually want we want year and we want that out of we're going to go with the order date of when it actually started we'll give this the alias of order year now next thing we do is actually get into aggregating this age but remember we have this processing time is an interval of of the data type interval so 3 days 0 days it's not going to be able to actually average that so we need to use that extract function first and i need to specify the part from this specifically it's the days and specify that from keyword and then age delivery date and order date i'll then put a closing parenthesis on this and then rename this alias to average processing time now we're doing an aggregation so one we're going to need to specify a group by and then because of this we don't need to have that order date and delivery date because we have to aggregate by that and so i'll remove this order by random and we'll throw in a group by right here specifying order year also this won't have a lot of outputs i'm going to remove that limit statement then running this and silly me i'm reading this here the delivery date must appear in a group by clause or be used in aggregate function basically i forgot to use the actual function right here oopsies throwing that in of average and now running this we are getting the average processing times all right we need to clean this up still these years are out of order and these amount of digits on here are just unreadable so i'm going to throw in an order by clause underneath here and we're going to order by the order year running this we now have it in order and now let's clean up this average processing time i only really want two digits and if you remember from the basics course we went over the round function in there and with the round function you provide the value or column x and then after that n is the number of actual digits that you want or decimal places after this so if you don't use anything with zero digits but we want two so we'll go ahead and put two on there run control enter bam got average processing time over well we want to do the last 5 years so from our previous example up above i don't like typing code if i don't have to i'm going to go ahead and copy that wear clause and then putting it underneath the from statement now running this we have for the last years the average processing time all right now just one last thing to do we need to now add in that revenue based on each of those years for this we're going to be using the sum function to sum it all up and we're going to be multiplying quantity time net price times exchange rate and we'll give it the alias of net revenue all right let's go ahead and run this not too bad typically with these high of numbers i don't care about these two decimal places so i'm going to use the round function again and in this case i'm not going to specify an argument at all run this and okay it looks like it's given it two decimal places with just 0 0 what if i can just specify this and looks like i can't specify zero but instead what i can do is i could give it something like cast and then for cast i can specify that i want to cast it as an integer we're going to control enter bam now we have what i want and then if we graph this we can see that the average processing time over time has gone up from around a little less than one up to 1.6 even with the dip in revenue or i think the number of orders probably loosely correlated we still saw the average processing time go up so this is a good little data point to keep track of and we can pass on all right it's your turn to go through you have some practice problems to get more familiar with the extract age and also interval keywords and functions and from there we're going to be going into the next chapter after that on window functions pretty complex topics i'm looking forward to get into it see you there all right welcome to this chapter on window functions this is probably the most requested topic for me to cover in this intermediate course so i was super excited to get into this now we're going to break this chapter up into five lessons starting with this lesson focusing on the syntax of window functions doing some simple examples and explain it to you then we'll start picking up the pace looking at things like aggregation ranking lag lead and then finally we'll close it off looking into how we can use things like frame clauses well this doesn't matter unless you really understand what window functions are so let's look at a simple example so let's look at a query that's breaking down the net revenue by an order number in it i'm listing things like the customer key and then the order key and then the associated line number for that order calculating the net net revenue and then getting it from sales we're then ordering it by the customer key so what's going on here okay customer key 15 made only one purchase of over $2,000 but customer 180 did three separate purchases where two of these purchases of were the same order they just have different line items so now let's say based on all these individual orders we wanted to find out because we want to do some deeper analysis what is just the average order specifically what is the average net revenue for an order what i could do is run an aggregation function removing all those other different columns and just run it here and see that the average order order value is 1,000 but i want it in this table and i can't get that necessarily unless i actually get it in this type of format using this aggregation but this is where window functions come in instead all i have to do is just insert in our window function don't worry we'll go about it over in a second it's using the over function in this case go ahead and run it and have an error because i got a comma in here and we can see that it's the same value as below of 132 but it's now in our original table just like we want it so we can do even more calculations with it so why use window functions they let you perform calculations across a set of tables related to the current row like we just showed and like we showed they don't group the results into a single output row this is very beneficial as we're going to demonstrate some future exercises so this is using it for things like running totals ranks or even averages anyway let's get into the syntax for this we start by defining the window function or what we want to do if it's an aggregation something like sum or count if it's ranking it's something like rank or dense rank the next is over and this defines the window for the function inside of it we have the keyword of partition by we'll get that in a second let's actually go back so let's walk through that window function that we just saw without using partition by and i'm going to create a new line and for this the window function we're going to use for this is average because we're calculating what is the average net revenue from there i'll put our variables in there of quantity net price and exchange rate then we'll put over and it's very important after this that we include open and closing parenthesis even if we're not going to put anything in there and then i'm going to give it a very verbose title to make sure we understand what it means the average net revenue of all orders going ahead and running this we can saw similar to before it's at 1,032 but now let's say we wanted to filter this type of window function further maybe by something like the customer key and this is where partition by comes into effect it divides the results into partitions or better said divides it into separate groups without actually having to use such as a group by clause to do this so going back to our previous example i'm going to go ahead and copy this all so we can see it add a comma new line and paste it in and this one we're going to find partition by customer key and we'll say that this the average net revenue of this customer let's go ahead and run it and scrolling down we can see for customer 15 it only had one order so that's the average net revenue whereas customer 180 had multiple and the average of that was 836 so let's just briefly explore the power of window functions by looking at a simple example i'm not going to ask you to follow or understand the syntax because we're going to get to it later on but this is going to demonstrate the features that we're going to be getting to in this chapter so here we are looking at things like customer key order date and net revenue i'm going to go ahead and run this so let's say for each of these customers we wanted to rank their orders from highest to lowest based on net revenue well i could use something like this window function that's using row number and some other stuff that we're going to get to and in this case we can see that it actually calculates the rank for this specifically it has the highest at rank number one and the lowest at three for 180 now let's take it to another level we could do something like calculate the customer's running total so in this one we can see with 180 that has multiple orders first it's at 525 and then it jumps up to 2500 these orders on the same day so the running total is the same we could also do things like get the customer total net revenue and this one isn't really that impressive because yeah we could see that it gets the total but personally i like it because i can then use this to maybe calculate hey what percent is this order of the total net revenue of a customer in that case i can just add it in doing the net revenue divided by the window function to figure this out and so now i can the table by the way and now i can see what is the percent of the revenue for a customer so this makes even more unique of why window functions are so powerful so let's actually get into applying those concepts so you can actually write those queries like i just ran through what we're going to be finding is the percent daily revenue based on an order line item to do this we already know how to do net revenue we'll need a window function to calculate daily net revenue and then we'll calculate a percentage from that so let's start building this query out we'll list the order date the order key the line number which an order or an order key could have multiple different shipments with it so therefore it has different line numbers and then finally we'll just start with the net revenue we'll get this from the sales table and we're just going to limit it to the first 10 all right so and as mentioned the line number right it just says 0 1 2 3 the highest number in this is six i don't really like that these are two separate columns so i'm actually going to combine it i'm going to take the order key and multiply it times 10 so basically adds a zero to the end and then from there add in that line number and then i'll give it the alias of order line number we can see that what it looks like actually here it gets accomplishes what we want so i can go ahead and remove line order and order key so the first thing we want to calculate in our final table is that daily net revenue in order to do that we need to use the sum function on our net revenue and now we want to use over and then inside of over that partition by remember we want the daily net revenue so we're going to be putting the partition by of order date and from there we get that daily net revenue now now we want to calculate finally the percent that an order line item is of the daily net revenue this one's going to be pretty simple as it's going to be a lot of different copy and paste so first i'll drop in the net revenue i'm going to multiply it times 100 so it gives us a bigger digit for that percentage and then from there we're going to divide by all of this window function right here which i'm going to copy and paste into here and we'll give it the alias percent daily revenue not too bad it's all over the place i want to actually see this ordered for this day to see what is the highest so i'm going to put in an order by specifying the order date and then the percent daily revenue but for this one i want it in descending order and bam we get this final table showing this percent daily revenue based on the order i went ahead and graphed it so we could compare the different order line items for that day and we could see that some of the orders i mean these are taking up of 20% up to 10% of the entire daily net revenue and conveniently up the top of the chart but the total daily net revenue so it's pretty convenient for window functions that we can get all this type of data into a single table and it makes it a lot easier later on whenever we dive deeper into it and maybe visualize it one minor note on this query it is getting a little bit verbose in that a lot of this is repeating stuff that we could reuse and that could be done using something like a ct or subquery what i could do is actually just put it into its own subquery of the core items that we need from this so the order date order line item net revenue and daily net revenue and then from there put it into something like a subquery so i'm going to go ahead and tab this over and so that way we can do a select star from and then put this all within parenthesis and we'll give this alias of revenue by day so it still has the same output in below but now instead of having to repeat all that different code here like i did to calculate the percentage i just come up here and then insert a new row specifying i want to do 100 times the net revenue divided by the daily net revenue and give it the alias of percent daily revenue now running this boom this has everything that we have or want out of it and in my mind it's a slightly easier to read and to get through when sharing with others so let's now get into performing a cohort analysis in this last example and cohort analysis is going to be done a lot throughout this project because it's pretty popular in business analytics all right what the heck is this well a cohort is a group of people or items sharing a common characteristic and then the analysis of this examines the behaviors of this over time specifically the behaviors of that group being a personal item so what does this even look like well for this example right here this is what we're going to be doing and analyzing or putting people into cohorts based on the year of their first purchase and then from there calculating what portion of the revenue they are contributing to the net revenue so down at the bottom is the purchase year and then over on the y axis the net revenue in 2015 where this is the data that starts it starts in 2015 everybody contributing this is from cohort 2015 so that's all the net revenue then we get into 2016 we have their 2016 cohort along with it's a small contribution from the 2015 cohort in 2016 we work our way all the way to 2019 and we can say see once again that the cohort of that year is the largest contributor whereas those from previous years are less of contributors now it's important that you understand for this example that the cohort is based on the first year that you made a purchase so what are we aiming to get out of this well we want to have this final table where we have the cohort year or the year of their first purchase and then from there the purchase year or the apparent year of the net revenue with the total revenue for that cohort so we're going to start simple first with our first query looking at using a window function to extract out the cohort year based on a customer so we'll start with the select statement specifying the customer key we'll also do order date so we can see what's going on there and we'll get this from the sales table we're going to limit this to only 10 values also have a comma right here we don't want that going to press control enter and bam okay so we're getting customer keys and order dates i'm also going to go ahead and just for good measure i'm going to order by the customer key so that way we can make sure that we're looking at this all appropriately specifically if there's grouping such as here is 180 i see them all together so now let's use a window function to get our cohort year we're going to start first by just getting what is the minimum date out of or for a customer so in the case of this 180 i would expect it to be this order in july so we'll start this window function using the minimum of specifically order date and we're going over and then inside of parenthesis we want to put the partition by of the customer key and i know it said date right now but we're going to just name this the alias cohort year because we know we're going to change it okay press control enter okay looking at 180 it is in fact that lowest date for order date as the cohort year now what we want to do is extract out that year from there so i'm going to run that extract function on this and we need to specify the part first so that's going to be year and then from and then everything after this is that date expression so then i'll put a closing parenthesis all the way in the end run this one thing real quick i did show that how it was 2018 but now look at this if we go to something like customer key 387 we can see that their first purchase was in 2018 but then later on in 2023 they still have that cohort year of 2018 so we know our formula is working so let's go ahead and clean this query up because i don't need this order date at all for what we're doing it for so i'm going run control enter now i'm noticing we have duplicates because customer key is appearing more than once so what i can do is add a distinct statement right after this and now when i do this boom i'm only getting distinct values for the customer keys and the cohort year so this table's a lot more concise um we also don't need this limit 10 anymore and technically we don't need an order by either this is good start so we figured out that cohort year for the customer this is once again the final table we want to get to now what we need to do in order to basically add these additional two columns on of the purchase year and the net revenue is we need to calculate this if you will separately and join in that cohort year using something like a cte so let's put this all into a cte that we can then join to that sales table to basically attach on for all these different customer keys their associated cohort year and then we can aggre aggregate and find out what is that total revenue so we'll use the width keyword specifying this as the alias as yearly cohort we'll use as and then an open parenthesis we'll tab this on over and then put a closing parenthesis and then i just want to make sure that this works properly so i'm going to do a select star and do from that yearly cohort above and run this to make sure that it is outputting correctly yep still the same table okay now let's join this on to our sales table for this is very important we get the correct join for this specifically we're going to be connecting to the sales table so i'll put that right here and then for a join we're going to be doing a left join we're doing a left join because we want to make sure that we have and keep all of the different sales values from it and then we're joining using yearly cohort which has been distilled down to remove any duplicate data so we're not creating duplicate rows for each of these i'm going to give them an alias this one s and this one y and then as far as how we're going to join this you'll be on the customer key from both of these tables let's go ahead and well let's actually do a limit 10 because there's a lot of data all right so just inspecting the table we can see we have the customer key of 947009 and then it is joined on that customer key along with its associated cohort year so just as a reminder of what we need to get to now we need to get the cohort year and then the purchase year or the year of the order date and aggregate all this to calculate what is the net revenue now when we do this aggregation we're not going to use a window function this time we're going to use a group by on these different years so let's clean up these columns that we're actually using we're going to use that cohort year and then also the purchase year which is based on order date so we need to extract out the year from order date and we'll define this as the purchase year let's just make sure this is correct before doing our aggregation next okay good we have all the order years and purchase years now we want to move into getting a sum of all the revenues so we'll use the sum function specifying the quantity times the net price times the exchange rate and we'll give this the alias of net revenue and so because of this i'm going to be doing a group by specifically on that cohort year and also the purchase year which we'll need to put the function in here going ahead and running this we can see we have our results and anytime you're finding your or getting to your final results you need to make sure that the numbers are making sense specifically i know that the net revenue for 2015 was around $7 million and i can see from this that for 15 i say 2017 mean 2015 so for 2015 is owned 7 million and this checks out also silly me i could also use purchase year in this case doesn't necessarily have to be the function itself all right so we want all the values for this because we're almost there i'm going to go ahead and remove that limit 10 go ahead and run this and bam we have our final values where it shows based on a purchase year how much a certain cohort contributed to the revenue for that year and this is the ultimate visualization that we get to and we can see that pretty interesting enough the cohorts from previous years don't really contribute that much to the overall net revenue so we have a little bit of a retention issue all right it's your turn now to go in and get more familiar with window functions by doing practice problems in the next lesson we'll be diving we'll be diving deeper into aggregate functions and basically fine-tuning our knowledge of how to use window functions with that i'll see you there welcome to this lesson on aggregation functions using window functions anyway we last lesson we went through and started to apply aggregation techniques to window functions but we're going to build slightly more further on this covering three key concepts so in the last lesson we actually used the min aggregation function to analyze the impact on yearly cohorts on the total revenue as we went through the years well in this one we're going to do a simple analysis basically reinforcing what we learned with cohorts but this time using the count function in order to analyze the total number of unique customers and how they impact based on their cohort into future years next we're going to move into using the average aggregation function and for this we're going to be focusing on the long-term value of a customer basically using window functions in order to calculate what is the total amount of revenue that a customer has contributed and we'll not only be able to break it down on a customer by customer basis but we'll also be able to analyze it from a cohort year perspective as well and finally we're going to wrap it up with some simple examples on understanding how to filter window functions basically where you should be applying your wear clause in order to filter a window function properly all right let's get into it so for this example we're going to be using count in order to aggregate our window function the syntax is all the same so major concepts remain all the same let's get into what we're going to be solving for this we're trying to find out the number of unique customers which we're going to use the customer key for this and find out based on their cohort or the first year they bought an order how they contribute to future years this graph right now is showing the total number of unique customers every single year and then from there broken down by cohort so we can see from 2019 we had those from 2015 16 17 so on for the final output we're going to be doing it very similar to last time where we want a table of the cohort year and then the purchase year and then the number of customers or the number of unique customers based on a cohort year and purchase year now you could start with the query from last lesson and modify it to fit your need for this but i find this question is actually more simple than the last one so we're going to start from scratch also it's just good practice so for this we want to get not only the customer key but also based on a customer we want to get its cohort year or the first year it made an order and then also a purchase year when the year of a purchase was we're going to be ultimately using that customer key to get the unique count anyway let's start programming start with a select statement adding in our customer key next we want the cohort year so similar to the last time we first want to get that minimum order date for our window function and then from there we want to do it over the partition by customer key i'm going to name this alias right now cohort year cuz we'll eventually get it into a year format right now it's obviously just date for this we're going to be getting this from our sales table let's go ahead and run this just to see scrolling down to that 180 customer key we can see that it is all from june so this is looking good next thing we want to do is extract the year out of this so we'll do an extract we need the part which is year and then from and then the date expression which we're going to wrap all in parenthesis and then we'll put another closing parenthesis on there to for the extract we're control enter and we have the years now next let's get the purchase year and we don't need to do a windows function for this right we just need to extract the year from our order date so i'll do an extract the part is year and then we'll do a from and then the date expression is order date we'll give this a name purchase year and go ahead and run it okay overall looking good everything look like it's calculating correctly now what we need to do is move into actually getting the count of the unique customers using that customer key and you may falsely think that this is actually going to be wrong code right here but you may falsely think that you could use a group by to do this specifically we'd want to do something like count the different customer keys or i should say count the distinct customer keys and then from there add a group by to do by cohort year and purchase year however when we ever go to run this we get the following error message window functions are not allowed in a group by basically we can't combine these window functions and group eyes so trying to use group eye in order to calculate this and save some code not going to work here just wanted to point that out but what we can do is make this into a cte and then use that cte to run a window function on to get the customer key so for this to create our cte i'll give it the alias of yearly cohort open parenthesis and put all the syntax in there next just to make sure that everything is working correctly i'm just going to call a simple select star on from yearly cohort run this and it's still running properly okay we're good to go let's refine this now now remember this is the final output that we want we want cohort year purchase year and then number of customers so let's work to build this table for this i'll remove the store put in cohort year first put in the purchase year and then next let's build the window function to go through and count the unique customer keys but we don't want it just on one thing we want it based on both the cohort year and the purchase year so in this case customer key 180 was a unique customer in 2018 which is also the cohort year but then as a cohort year 2018 was a unique customer and also two 2023 so for this we'll first start by doing a count and one thing i'm not with this right this has some duplicate data we want to get unique customers so technically i'd only want to see for 180 this 2018 2018 but then also for this 180 this 2018 for the current year and 2023 i wouldn't want to see duplicates so what i can do first is i can add a distinct up here onto this query running control enter and then it's not really showing because we remove the customer key i'll just put it in for the time being running this again we can now see that for customer key 180 now it's just those two entries so for this we're going to be doing a count of customer key or you can do countst star over and then this is when we would get into our window function using partition by both the purchase year and then also the cohort year and we'll name this as num customers so let's go ahead and run this all right not too bad the one thing to note right is this is has multiple different duplicate lines because we have all those customer keys in there so once again what we can do is we can just add a distinct right after the select statement so that way we can get it filtered down to only unique values and then from that we can see that okay this is all on order now so now we need to actually do an order by so we'll add an order by doing by cohort year and then purchase year and scrolling down we have what we want now and i know based on previous calculations that we had in 2015 only 2825 unique customers so this validates or help understand that this data should be correct so if i were to visualize it this is what we get and it comes out very similar to our total net revenue so overall not a lot of unique insights that i found from doing this compared to net revenue but at least we explored it before moving on i want to touch on a short example specifically around what we talked about just recently on this error message that we got when we ran this query where it said window functions are not allowed in group eyes technically this isn't entirely correct as i'm going to show in a second we'll be able to run window functions with group eyes but overall i don't recommend using window functions and group eyes within the same query and we won't be doing it for the remain of the course so i want you to learn this major concept now so let's look at this simple example we're going through and we're going to collect the customer key and then using a window function we want to count how many orders that a customer has and so we're going to partition it by the customer key and this will be the total orders i'll go ahead and just run this and so we can see that customer key 15 has one 180 has three orders so they have a total of three orders everything's working fine but now let's say we want to calculate the net revenue but this time we want to group this all up because i'm tired of all these separate rows and so we're going to use a group by with this to find the net revenue so i'll put in the average and this will be of the quantity time net price time exchange rate and i'll give it that alias of net revenue we're doing an aggregation so like usual got to use the group by and we'll do this of the customer key now when we go to run this query it is going to work notice remember 180 is at total of three orders and i'm getting an error message because i'm silly and i didn't put a comma after this column go ahead and run this now and magically this query does work now right because we have an eye with a windows function in it but now going to that customer 180 they only have a total of one orders in fact if you look at all of them and i can move into this bigger table all of the total orders for every single one of this column here is one so what's going on here well what's happening is window functions in the process of running through this run after a group eye so what's happening is everything is getting grouped together and then the window function is running this causes a major issue of conflicting aggregations so if you're not getting an error message stopping you from this you're probably not going to get the right results there are better alternatives which we're going to go over and mainly that's using cte or sub queries and breaking up your queries to separate them with our previous query what happened was we ran our group by aggregation to find our net revenue it condensed all those customer keys down to one value and then secondly it finally ran that windows function so that's why it feels there's a one value for all these total orders cuz it only sees one after the group by so let's fix this up so this actually works and see how we're going to be doing this in this case i would want to run the ct first so i'm going to get rid of this aggregation of this average and we'll give this the alias of order value also i'm going to be removing this group by because we're not doing the aggregation anymore we'll then put this all into a cte so there's no aggregation inside of this one we'll then just query this table getting the customer key and total orders to make sure that it's aggregating properly i'm going to go ahead and press ctrl enter and with this one we can see that 180 has in fact three and now we can go through and actually do our aggregation with that cte specifically we would get the average order value and give the alias of net revenue and then perform a group by on customer key and total orders going ahead and run this bad boy we can see we have 180 with three orders and the correct value there so it's working so for the remainder of the course you're going to see me anytime i need to do a new group eye or a new window functions i'm going to just create a new ct and then do it there all right getting into that second exercise we're going to be focusing on the average function and using this in our window functions overall the syntax is the same so nothing changes there but we're going to introduce a new business concept specifically this one of customer lifetime value and as the name implies it's the total revenue generated by a customer over their lifetime with that company we're also going to explore some other concepts such as the average order value or the typical amount spent per transaction but that's less of a focus for this more of ltv so for this the main concept of concern is lifetime value which we have the abbreviation of ltv and it is the total revenue generated by a customer for a business over their entire relationship with that company so what are we going to be ultimately calculating for this well we want to find out based on a customer so in this case the x-axis is that customer id we can see what their total lifetime value is and even compare it as with this dash line to the average lifetime value for that particular cohort that calculation is a little bit more tricky to get to but we're going to get to it so the final table we're aiming to get to is this based on a particular customer key we want to be able to extract out not only the cohort year but also the customer lifetime value and what is the average lifetime value of that cohort so we can compare that two now this type of analysis is really great because we could do something like target these high value ltv customers because they're more likely to make purchases and that's typically what businesses do they're not going after the lower numbers they're going after the higher numbers enough happen let's actually get into it and for this first query we're just going to focus on three things the first is getting that customer key the next is we want that cohort year which as we've seen before is extracting out using the year now there is an aggregation function in this of the minimum order date so we are going to need to do a group by after this but anyway next thing we want to get into is actual a sum of their total purchases so i'll run a sum function put in the quantity times net price times exchange rate and then give this the alias of customer ltv specifically we're summing up all of the revenue for that customer so it is that we're going to take this all from the sales table and like i said we need to do a group by specifically by the customer key let's go ahead and run this all right not too bad we have basically the first three columns of well the four that we need for this and speaking of that fourth column that's what we want to get of the average customer lifetime value for a particular cohort but remember we learned previously can't be inserting window functions in a group by so i need to put this all into a cte and then do the window function we'll start with that width keyword giving the alias of yearly cohort then signing it this alias obviously putting it all within parenthesis like usual i want to double check my work so i'm going just do a select star from that yearly cohort and everything's outputting like we saw previously good to go so far inside of our main query we want to one we want to have all the different columns so i'm going to do a select star it's only three feel fine doing this but next is where we actually want to do that window function using average and for this we're going to use that value of customer ltv over in parenthesis going to do a partition by and we want to do this by cohort year we'll assign this the alias of average cohort ltv okay let's go ahead and run this not too bad one thing is i probably want this ordered in a particular order specifically on the customer key and the cohort year so i'm going to stick an order by in there and specify cohort year and customer year and bam this is the final table that we need for this and this can tell us some different statistics that we can now use like i said we could send this off to maybe something like the sales department or marketing department and they could do targeted ads at these high-value customers in order to have a potential or more of an impact on revenue now i also graph the average lifetime value for each of the cohorts and as expected it does decrease over time that's expected because they have less time in the cycle than say somebody that 2016 or 2017 guess the one thing to note is 2015 and 16 or less than 17 so the first few years we started off we didn't do as well with customer retention and extracting value out of it so pretty neat insight out of this all right last part in this lesson and we're going to focus on two examples very quick in how to use where within a query to filter properly and this is really important especially for the practice problems we have for this in order to understand that you're filtering properly so let's start easy first how can we filter before a window function well we could use this where statement here and it is going to apply before we actually invoke the window function itself let's actually test this with a simple example so for this i'm querying to get the customer key and we've seen this windows function before of extracting the minimum year to get that cohort year based on that customer year i'm going to go ahead and run this and there's trusty number 180 we see that it is in cohort year 2018 so we know the problem or the query worked out properly so let's say we have a scenario where we only want to look at cohorts from 2020 onward basically we don't want to put people in cohorts before 2020 we want to just analyze from 2020 onward this is a great example of filtering before a windows function in this case we can use the wearer and we're going to specify that the order date is greater than or equal to january 1st 2020 now let's see what happens with customer key 180 okay previously it was bucketed into as its first order in 2018 but now since we're saying "hey don't pay attention to that anymore i just want to focus from this point onward it getting reclassified as 2023." so that's how you filter before window functions the wear clause can be right there in the statement underneath the select and it's going to get applied before the window function runs now contrast that with filtering after a window function in this case we need to do something like a ct or subquery we're going to do a cte here we would need to get that cte and then from there filter so let's continue on with this last example to show when we might want to use this so the first thing i'm going to do is i'm going to remove this where because we're not going to do that anything with it we're going to put this portion into a cte we'll give it the alias of cohort and then i'll do a opening and closing parenthesis to put it all within then we'll do a select star and we'll do this from cohort so now let's insert our wearer and in this case right we want to filter after the windows functions specifically we want to remember previously customer 180 was in that 2018 time as its cohort year let's say we just didn't even want to look at any of the cohorts if they were broken into that cohort we don't want to even classify them at all after 2020 so for this we want to specify where it's greater than or equal to and the year in this case would be 2020 now what i would expect from this is that customer key once again is going to disappear also this from is messed up right here it's not it's from cohort okay let's go ahead and run this and bam we notice now that 180 is removed from this because of how we actually applied this after the window function to remove those cohort years from the original purchases anyway wrapping your mind around how wear function are applied before or after window functions does require some practice so we got some practice problems for you go through and test out and get more familiar with it all right in the next lesson we're going to be jumping into functions around ranking and that one is pretty exciting so we'll see you in there all right we're continuing on in this chapter of window functions now focusing on how we can rank different values and use a certain order to rank them for this we're going to be covering three main types of ranking functions row number rank and also dense rank i'll be explaining the difference between all these and we'll be doing this by ranking customers based on how many orders they've completed so more as they completed higher the rank they are but before we even get into any of those functions we need to first understand how to use order by within a window function in order to get the correct ranking that we want now previously with sql you've seen order by used typically after the from statement to actually order your values but we can use it inside of the window function and what this can do is order our values within a specific partition that we're running this window function on and order by as a reminder always defaults to ascending but you can specify descending so what are we going to do in well with the sales table you've already seen the customer key order dates net revenue what we can do is get a running order count and what this will do is based on an order so in this case for 180 it has it first order in june or sorry july so we have our first order and then in august on the same date we have two more orders so then it bumps up to three similarly for customer 387 they have four orders on their first day and then the next day or the next time they complete an order they bump it up to five so let's build something similar i'm going to throw in a select statement along with customer key order date and then net revenue we've seen this all before this is going to be from our sales table i'm going to go ahead and run this just to see what it's outputting okay sweet not bad let's now just get with this a count of the orders based on using a window function based on that customer key so i'm going to insert a count function and we'll just do it on count star we're going to do this over and then i'm going to do a set of open parenthesis i'm actually going to indent this down to make this easier to read and for this we're going to partition by customer key we'll do by order by in a second i want to go ahead and actually see this first now we've seen this before but i want to call out some things specifically notice that the customer keys are different from before so it actually went through and ordered it with that partition by except it didn't order it by or didn't order the order date because this one for 180 in june or july is after the august anyway that's something to note for later but we do find we get the correct calculations for this because 180 has three orders and we can see all of them here so now let's insert in our order by i'll put it right after the partition by and specifically we're going to be doing it by that order date and this can control our row processing order let's go ahead and run this bad boy and so inspecting this table we can see now that not only the customer keys in order the order dates in order and whenever we go through the count itself for 180 it has one and then the next time in august there's two more so now it increases by three so our function's working so for these aggregate functions like count average or whatnot it's going to determine how values are accumulated row by row also i realize i never gave this an alias so i'm going call it running order count anyway let's now use one more aggregate function to demonstrate this further specifically i want to do a running average of the net revenue so basically i after the first order i expect to be around 525 and then after these next two orders i'd expect it to be an average of all three of these and it's going to do this line by line for this i'm going to insert a new line i'm actually going to just go ahead and copy a lot of this because the boiler plate code itself is going to be approximately the same instead of doing a count we want to do an average and specifically we want to do it on that net revenue so i'm going to copy that and then paste it into here we're still going to be partitioning by the customer key and ordering it by the order date in order to carry carry out that rowbyrow execution and we're going to name this running average revenue okay let's go ahead and run this and now inspecting it we can see that yeah for the first one they only have one order next one of 180 the first order the average is in fact the same as it is but then when we move into the next round of orders it's averaged among all three of these so for 180 at this point in their order history the average order revenue is around $836 with that knowledge let's actually get into our ranking function specifically first we're going to be focusing on row number and understand the importance of how it needs to interact with order by now in postgress they have quite a few different ranking window functions but in this lesson we're going to be focusing on these top three right here row number rank and dense rank for row number this returns the number of the current row within its partition counting from one so for this let's just label the rows in our sales table with a row number for this we're going to select all the rows of the database and we're going to say this is from the sales table since we're doing all the rows i'm going to just limit this to the top 10 run control enter okay this is our values back let's now get into using this to assign a row number at the front of here one quick note for this data frame that's appearing here is this is the index right here but it's not necessarily callable the 0 1 2 3 4 5 so we really can't use that that's why we actually have to generate this anyway after the star we're going to start a new line cuz i want all the actual columns to or yeah all the different columns to appear for this we're going to specify the row number and then run over and i'm just going to do an open and close parenthesis we're not going to use uh a group by just yet i'm going to go ahead and run this and if we scroll all the way over we now have the row number right here 1 2 3 4 5 6 7 8 9 10 as expected and it looks like our data set is in order now i don't really like where this row number is appearing i actually want it at the front of the data set and i'm also going to give it an alias of row num anyway let's move it to the front so important thing to note is yes this does provide the row numbering but this actually does this in a well chaotic order and it's not guaranteed that it is going to be assigned based on how the data set is in the system so i don't recommend doing this i always recommend anytime you're using any of these numbering functions to use an order by so what is this order going to be based on well every order has a unique order key and also line number additionally with this i'm going to also use that order date to make sure that we're maintaining in that date order i'm not 100% sure if the order key may mix around in certain locations in time so for this i'm going to press enter and now i can enter in my order by and i'm just indenting this in to make it easier to read we're doing an order by we'll specify that order date first and we'll just run it as is just to show this it's still providing those row number in sequential order by order date but like i said we want to be very specific to make sure that it's doing it correctly so what i'm going to do is also add in that order key and then that line number control enter and bam so this has what we need in regards to row number let's now take this a step further combining it with partition by specifically let's say that we want to start or have a number for like our daily order number and so every new day we want to start this over let's look first what would it what is it at on the 2nd of january 2015 and we do this by inserting a where statement and this is saying hey filter this for where order date is greater than this this is only done just to look at that uh on the next day running controll enter and silly me already forgot what we learned in that filter lesson right whenever we're doing the wear right here underneath the uh window function it actually will apply this filter data to there and so in this case it automatically did start uh numbering rig again at one but generally if i didn't have that wear statement wouldn't do it anyway not necessary for you to actually follow along and do this code i'm just doing this for demonstration purposes i put all of our original query inside of a cte called row numbering then i called it to look at the first 10 values in it anyway with this same queries below now i want to filter for the second basically where order date is greater than the first running this one we can see that that row numbering starts you know at 26 27 or whatever we want to do a new window function in order to assign a new numbering each day so let's go back to that original query that we have remove that ct so we can make it easier actually to run through all this for this we now want to add a partition by right above the order by and for this we want that new numbering to start every single day so we're going to give that the order date as the variable for this okay going to go ahead and run this and actually we're not going to see any difference in what's displayed here but we do see down here that in 420 that it only the numbers only go up to like 97 so it does look like it's working what i'm going to do is i'm going to now just go ahead and copy this all paste that into that cte so we can demonstrate this again and then run this remember previously it was starting the numbering around number 24 or something but now on the second it is starting it at the first so this is working so let's now get into comparing the three major types of function we uh using for this row number rank and dense rank note all of them provide big or an integer value from this they all have a different way of ranking depending on what your use case is so let's look at a simple example for this so let's say we went through and calculated how many o orders each customer did and this one up here they've had a total of 31 orders well we're going to go through and rank them using row number and then rank and dense rank and then seeing how they actually differentiate from each other for this let's first just pull in the information we need using a simple select statement we're going to use customer key and then to count all the orders we're just going to do a count star to basically count the line as an order and we'll assign this as total orders we'll get this from the sales table and we did an aggregation method so we'll do need to do a group eye of the customer key so let's go ahead and check this bad boy out and with this we can see the different customers and then the orders the or the total orders are all over the place they're actually get where we need to order this but we're going to be ordering it in our window function so i don't want to necessarily do that until after our window function also this is just too many values i'm just going to do a limit 10 for the time being all right let's get into now building our window function first with row number to assign a row number based on what are the total number of orders they have so for this i'll start a new line we'll insert our row number function we'll do it over and then inside of here we don't need to do a partition by because well we've already grouped by our customer key to find out what the totals are so all we need to do now is just put an order by in here specifically we want to order it by that count star and we'll give this the alias of total orders row num real original i know all right let's go ahead and run this all right bam what we have here is well not really what we want uh so these all have total orders of one and it is uh providing a total orders row number one two all the way to 10 but i actually want to be ranked from highest to lowest so as remember from the beginning we can actually insert in a descending comment right in that order by now when we run it we can see that we get this total orders row number and we're starting at the highest of 31 all the way down notice with this when we get to numbers like 26 26 and 26 repeating it does assign a number of four five and six but technically these are all tied ranks and so that's why we need to learn other functions like rank so i'm going to go ahead and we're going to insert another line in here i'm going to copy this row number because a lot of this is just boilerplate that we can repeat instead of row number i'm going to change this to rank and then for this alias instead of row number num at the end and i'm going to add the i'm going to change this to rank running this now okay we got the total orders rank and if you notice from this one whenever we get here and we have repeating values of 26 this is going to assign at four four so if we actually see then when we get to 25 it assigns it the number seven so then we'll skip five and six and now you're probably like well loop what happens if i don't want to skip five and six in my ranking method and i just want to continue on ordering from there well that's where dense rank comes in so let's once again going to go ahead and start a new line paste in our old code change this one to dense rank and then also update the alias to dense rank running this one i put a comma at the end of this don't need to have it going to run again we now have this new one where okay we get here we have the repeating fours and then when we get to the next one it jumps or it stays uh consistent in number and jumps to five and then we have some uh similar numbers again of 24 and so these are all sixes so when i need to order something or number something it really depends on what is my criteria on whether i'm going to use something like row number rank or dense rank that's why got some practice problems now for you to go through get more familiar with these functions in the next lesson we're going to be getting into lag and lead which we've been doing functions that really look at only the current row it's on but we can actually use functions that look either before or after certain rows pretty powerful all right with that i'll see you in [Music] there all right welcome to this fourth of five lessons on window functions and in this one we're going to be getting into functions like lag and lead these type of things in a window function allow us instead of looking at the current row to allow us to look at things like the row above it or the row below it so we're going to be exploring these main functions and we're going to be doing this first with a very simple example where we can go in and look at our 2023 monthly revenue and be able to evaluate our month overmonth growth because now we can look either before or after row and be able to calculate this from there we're going to shift into our final scenario which is slightly more complex in order to analyze the growth of cohorts over the years and basically see how they change from year to year so what functions are we going to be exploring for this well there's generally about five different ones that you can use let's start with the easy ones first we have first value last value and nth value for first value it returns a value evaluated at the row that is the first window or first row of that window frame the last one does obviously the last one and then for nth value you can specify an an integer inside of there and it will return that row now for lag and lead they're very similar in that this one you're providing either a row that is lagging or a row that is leading but we're already to get too deep into it we need to actually explore this with an example so what i want to do is first we need to calculate this we need to get in 2023 what is the monthly net revenue and then we're going to be applying these functions in a window function in order to evaluate first last lag lead and try all these different ones out so let's start simple with just building this query so i'll start with the select statement and the first column we want is the monthly v variables so i'm going to use the two care function for this i really like the output of this it's a lot easier we're going to run this on the order date and i want it in this format we'll give it the alias month next is we want the net revenue so we need to use the sum function for this as always it's quantity time net price times exchange rate we'll give it the alias net revenue we're getting this all from the sales table and we're doing an aggregation so we need to do a group by specifically we want to do the group by by month let's go ahead and run this bad boy okay first thing i'm noticing is it's returning it back but it's ordered all over the place so i'm going insert an order by we'll also do this by month and we want to analyze this for just 2023 so i'm going to insert in a wearer and we'll just do an extract to extract out the year and analyze for 2023 so with extract we'll specify the part from and then order date and we want this equal to 2023 it's a digit so we don't have to put that in a string and bam we have what we need now now we can actually start getting into running all these lag lead and stuff functions now because we're going to be running window functions on this i'm going to put this all into a cte we'll use the width statement we'll give it the alias of monthly revenue and then i'll provide some open and closing parenthesis and then i'll do a select star from this cte and make sure it's still outputs yep still good to go so let's get into exploring these we're going to start with the easy ones first like i said of first value last value and nth value so for this i'm going to specify first value and we need to put something in here specifically we need to put the the value expression that we want out of this in our case it is net revenue which is right here inside of our cte the next is an over and then inside of here we don't need to do a partition by because it's everything's already grouped but we do need to do an order by specifically where do you what do you want us to choose the first value from in our case we want the first value based on the months not on the value of net revenue so i'm going to put month in here and we'll give it the name or alias first month revenue going ahead and running this we can see that we do in fact get that first month into all these different columns let's now look at last value let's change the function name and then also the alias running this one now this one unfortunately whenever you read it you find out oh heck this is not the last month's revenue for each of these and that's because the actual syntax inside of here of the last value of how we need to order by needs to change specifically i'm going to delete it and put this in here and we need special conditions of to use rows between unbound proceeding and unbound following when we look at it we can see that it is actually the last month net revenue anyway just wanted to demonstrate demonstrate that because it takes a little bit more to fine-tune this one we haven't covered un unbound proceeding or unbound following we're going to be covering that in the next lesson so stay tuned for that one all right next up is nth value i'm going to go ahead and copy this first value and then insert it down here underneath and then change the formula name of nth value also for this we need to specify the integer expression for what number of rows after this to do so right now we'll just do three rows down and we'll call this third month revenue okay running this bad boy we can see that we get the third month's revenue now we do have this nonvalue in here once again if you want to fix this which we're going to cover in the next lesson we can insert this statement right here right after this running this we now have the third month filled in for all this don't worry those all make sense tomorrow when we cover all those but at least you got the basics now let's wrap up the simple example with lag and lead so for lag we're going to be looking it's going to be a lag so it will be the previous month we're going to look for this returns values evaluated at the row that is offset rows before the current row within the partition if we want to it's optional because it's here in square brackets but we could put an offset integer similar what we did for the nth value and we're to offset that we're not going to do an integer for these examples we just want the previous and then use previous to find the next months so going back to our query i removed some of the other ones because i don't they're it's getting too much in there we're going to specify this as lag with the parameter of net revenue we're going to still order by month and this will be the previous month revenue let's go ahead and run this and inside of here we can see yep this is in fact the previous month's revenue this does take a parameter like i said you can do that offset so if i wanted to do two in this case run control enter it's now offset by two values before it goes down we're not using that for our case we're just going to go we're doing trying to calculate eventually month over month so we're going to leave it at one and run this okay let's now use lead i'll specify the formula first and instead of previous month it is going to be next month go ahead and run this and we can see that it is actually the next month's revenue for all of these so you're probably like luke what the heck does this even matter yeah you can find this out but how does this even help me as an analyst well let's say that i wanted to find my month overmonth growth which is pretty common in the finance industry for evaluating your performance so here i have a chart that's showing basically our bar chart of the different revenue each month and then the line chart is showing what's happening what's the rate of revenue growth month over month here we can see that in may we had a substantial increase in growth and that's because we had a pretty low one in april anyway it's pretty big predictor in the business world so let's get into calculating this so in order to do this we care about the previous month's revenue in order to calculate this so we don't really need this lead function right here i'm going to go ahead and remove it so what we need to do first we'll just take net revenue and subtract it by this value right here to get basically our change every single time and we'll give it the alias monthly revenue growth okay now we can see from this that it is calculating that growth an easy line to see is row two we went from the previous month at 4 million to that month of 2 million so we lost 2 million yeah calculations are doing well okay we want a a rate of change because of this we need to actually divide this value of net revenue minus the previous month revenue and we need to divide that by our original value which is our previous month revenue so i'm going to go ahead and copy that again paste that down here let's go ahead and run this and bam there we have it we are now calculating that rate of change and we can basically see that um for this for in march we had about a 50% reduction in revenue whereas in something like may we had almost 150% rate of growth just to be clear this is decimals so this is a the percent if i wanted to i could just do 100 times this value and then we can see this percentage a little bit easier with this so let's get into example to show the benefit of these type of functions that can be used and previously in the second lessons of this chapter on aggregation we went in and covered what was the average lifetime value based on your cohort if you recall back your cohort you're assigned a cohort or customers assigned a cohort based on the year of their first purchase and we notice this trend that following around 2016 we see the lifetime value drop during these and that makes sense because their actual total lifetime is slightly shorter but we did have this unexpected rise in from 2015 to 2016 anyway what happens if we want to go through and actually analyze what are these different drops between each one of these ltvs well we can use our lag function for this now in order to do this we need or the table we need to start is this we need the cohort year verse their average cohort lifetime value now we basically calculated this back in lesson two of this chapter on aggregation here i have the formula here and actually what we got to inside of the lesson was this final table which consisted of one ct and then another formula underneath it anyway i took it a step further inside the notes so for those that have access to the notes you can go right to it and all you need to do is copy this the other option just pause your screen and copy this in also this next example we're getting into isn't that long so you can just watch along if you don't even want to do any of these so we need to run window functions on this after i've gone through and inspected this it's a ct inside of another ct but then we have to do the select distinct because we have multiple different rows so i'm actually going to put this one also into a ct i'll give it the name of cohort final i need to go ahead and put a closing parenthesis on this and then do a select star from cohort final make sure everything's appearing just fine underneath here and it is okay now we can work from here to insert in and we want to create our first column to look at what is the previous year's lifetime value so for this we're going to be using lag i'll do a select star to basically show both those rows and then do a lag and the value we want to put in here is what value we want actually to appear and that is the average cohort lifetime value we'll do over and then inside parenthesis we're going to do an order by and we're going to order by that cohort year and we'll give it that alias of previous cohort lifetime value and okay we're seeing that it is the in fact the previous cohort lifetime value all right all right final thing i want to do here is calculate that percent change so that year overyear change or better that the year of the cohort over year of cohort change for this remember our ratio is a final minus original over original so i'm going to take that final value of average cohort lifetime value and then subtract our windows function so command c this command v and then from there i want to divide this by the original so once again paste in that previous cohort ltv and we'll give it the alias of lifetime value change okay we're going to go ahead and render this okay it's not the correct value i expect this to be more of a a decimal number i think this has to do with basically my order of operations here i think i have that parenthesis in the wrong place i want to do the subtraction first and then after that the division okay this is giving it to us i want to actually see these as a percentage so i'm going to put a 100 at the front multiply this and bam now we can see our lifetime value change yeartoyear as expected we had a slight increase in 2016 and then from there it went down now if you actually go through and visualize this looking at this is the average cohort lifetime value uh in the bars and then the actual rate of change we can see that the rate of change actually is picking up so although i would expect it to go down i wouldn't expect it to go down at this high of a rate so maybe something we need to dig into in a real life scenario if this was happening all right you got some practice problems now go through to get more familiar with handling these type of functions and window functions in the next and final lesson of this chapter we're going to be going into further detail and syntax understand to be able to use like we demonstrated that last value function how it wasn't working properly without further syntax specified to it we're going to be going over that in there all right with that i'll see you in the next one welcome to this fifth and final lesson on window functions where we're going to be focusing on frame clauses now up to this point with window functions we've only really focused on two things of the window definition that's the portion after the over and that is looking at how we can use partition by and order by but there's actually one more thing to cover with this and that's the frame clause this aspect helps control what amount of data we want to actually control putting inside of the windows function what do i mean by all this well as we're going to be solving in this say we have something like the monthly net revenue this is from 2023 as you can see it's highly volatile going up in february and then going back down and then going back up well this is where frame clauses come to the rescue what we can do is look before and after certain rows and in this case average them and so we could in in this case perform a threemonth running average to basically smooth out our line this is very common in business analytics especially with seasonal data that has these types of ups and downs you want to remove all the noise and actually be able to look at it more clearly so postcrest has some documentation that goes all into this but it gets quite complex so i simplified it in our notes so this lesson is going to be focusing on using the frame clause of rows basically looking at what rows we want to put inside of this window function now as hinted to this comes right after our order buy and we can either include something like the start frame or we can use that between keyword to signify a start frame and then also an end frame but what the heck are the start frame what's this end frame well there's five main things we could put inside of here and is the majority of this lesson of what we're going to be getting into for we're going to be able to see how we can use current rows preceding rows and also following rows don't worry we're going to break each one of these downs so you'll be more than familiar with it by the end of this now with postgress you can also specify besides using row you can also use things like range or groups this i would classify into more of advanced sql so we're not going to be really covering it in this course additionally i had chat gbt make this fancy dancy table down here and i want to just show with this range and groups also isn't supported in a lot of other popular databases specifically mysql and sql server so i don't really want to waste your time if you don't have those type of keywords available to use that's why we're going to focus on rows anyway you need to learn anyway to be able to apply range and groups if you want to learn that later on now for this entire lesson we're going to be analyzing our monthly net revenue similar to what we did in the last lesson because as you remember we had some unanswered questions on how to use some of those functions without what we're going to learn in here we'll get to that by the end anyway you should remember this query or have it in your system already right this goes through and gets not only the month but also the net revenue for that month pulls it from the sales table and it only extracts 2023 i just want to look at one year so there's not a lot of data we're messed with since we're doing an aggregation function we need a group by and then finally need or buy because it gets all out of whack and you should see something like this looking at our monthly net revenue if we graph it looks something like this that we saw at the beginning goes up in february and then also has a strong dip down in april then returns back to normal we're going to be working towards getting a running average with this but we first need to understand current row which the keyword of this one before we move for any further so i want to run a window function on this query so i'm going to put it into a cte we'll give it the alias monthly sales and i'll put it into parenthesis we'll select both the month and also the net revenue and pull this all from monthly sales let's go ahead and run this and it's exactly the same thing that we saw before now we can actually run or use a window function here instead of on here like we said before can't be using window functions with the group by we can it's really complicated we're not going to do it anyway we're going to do the window function below specifically all i want to do is get the average net revenue for this month so basically repeat that same value so we'll start by calling average on net revenue we'll go over i'll start a new line and indent down and then before we actually anything else let's just look what it generates it's going to generate the average across all 12 months so we want to do this or we want to order it by the month itself so i'll do an order by and then specify month now running this query we're still getting the average but it's slightly different now for january it's still the same and that it's the january average but if we look at something like february it's not only it's getting the average based on january and february for march it's getting it based on january february march anyway we want to control this average so let's move forward we're going to start by renaming this column so i'll give it the alias of net revenue current because we're about to use current row now this has the following syntax of rows and then the start frame we'll eventually get to this one of rows between but we're just going to start simple with this one first and then remember our start frame or end frame can be any one of the following we're just starting first with just current rows where we then move into calculating the average or should i say the running average all right so i'm going to insert in rows and then from there current row now this is selecting to run this window function on this current row so there should be no difference between any of these promise you we'll see more of an impact in a little bit but i do want to demonstrate how you can also write this rows between and we need to do our start frame so current row and our ending and we'll do once again current row as you would expect it's only looking at the current one so it's going to be the same across all of these let's get into our next keyword of looking at how we can use something like n proceedings looking at preceding rows or preceding values this is going to be the final table that we get in it we still have our monthly and our net revenue and what we're going to look at in this case is just one row back and the current row in order to get the average right here at the first one we expect it to be the same as this one but whenever we get to the second row right here it's going to look at this current one and also the preceding one and get an average so the average of uh 3.6 million and 4.4 million is around 4 million and so for this we're going to use inside of our start frame we're going to use n proceeding specifying n as a number so getting back to our query right now we have in there to do rows between current row and current row we want to go one row back and also look at our current row so i'm going to remove this portion and we'll specify one proceeding oh not receding but actually preceding okay let's go ahead and run this and as we saw that in that demo table we're now taking the average of the current row and the previous row we get 4 million in this case so we can take any number of values also we could just make sure that this is actually working properly by putting a zero instead in here instead of one which means the current row and we're getting all the same values as before but we're going to change by back to one now i'm going to take a step further just for demonstration purposes you don't have to do this portion but i went through and we did before we did one proceeding i wanted to see what it was going to look like for also doing two proceeding and three proceeding and i got this fancy dancy table and from there plugged it into chatbt to actually visualize it and what we can see from this is that with each preceding row that we include so we include more to take an average of it this line becomes smoother and smoother it goes from it starts with a darker line with just core net revenue and it gets lighter and lighter depending on the preceding amounts that we used i know there's a lot of overlap here so i also took it like this and i graphed each one of them individually showing how over time it gets smoother smoother and smoother so now you're probably like luke is this what you do in a real world scenario i'd say well not typically just the proceeding but i combine this with something like following and so we'd use values before and after the current month and get something like a three-month average which let's do it now so just as a reminder it's going to be of the syntax and following where you can specify the number and it gets that many number of rows after the current row so back to our original query that we're working with instead of doing one preceding and the current row what i'm going to do is now change this to one proceeding and one following running this query we can see that all our values now especially even that first month is smoothed out because it's not only taking the current month but the following month whereas something like june is taking not only may's month but also july's month and then averaging it together to get this value this i feel is more representic of what i'd see in the business world i went ahead and visualized it and this shows how this actually smooths out our net revenue line by performing this three-month average now you could take it up a notch and do something like a fivemon or even seven months uh running average but at that point i think you're going to be removing a lot of key insights from this so i'm going to stop right here at that all right last two start frame and end frames to end with and this is unbounded preceeding and unbounded following if you notice carefully all they're really doing is replacing the n with unbounded and this says hey we want to use all rows from the start or maybe all rows from the end so let's actually just do just that in here we're going to place one on these with both of unbounded and what do you think's going to happen here well if we use unbounded on both it's taken the entire window function or window frame into the account for printing this average so we're basically getting the average of all these 12 months now typically when i'm seeing anything like unbounded used i'm usually seeing it with something like current row running this we can see that the first row is equal to basically itself and then as it goes along it's taking into account all the values behind it along with the current row and it seems like the line is just getting smoother and smoother and smoother as it goes along so where am i typically seeing these unbounded parameters being used well if you remember from that last lesson when we were looking at same uh chart of the monthly revenue we were able to use like lag lead functions specifically first value last value and nth value let's go ahead and run this we saw that for first value it actually did give us the first value but then the last value it didn't work out properly it just gave us what the current row and then for the third most or whenever we did nth value and specified three it gave non values for the first two but then finally gave us the third value for everything else well this is where unbounded comes in so let's fix these functions be able to do this and then i'm going to indent this down to make a new line to make this a little bit more readable and then from there i'm going to insert in our frames clause specifying that rows between and then rows between unbounded proceeding and unbounded and f unbounded following running this one now whenever we look at that last month revenue we can actually see that it actually does equal the last month basically we had to open up what it was going to look at for that window function by using this frame clause similarly we can do the same with that nth value i'm going to go ahead and just copy and then paste that right into here now running this and we can see that now that third month's rent is appearing in every single line regardless if it's before or after so bam we've now covered all of the major aspects of using window functions you now have some practice problems to go through and get more familiar with using these different frame clauses inside of window functions and then in the next chapter we're actually be getting into and how you can install this database locally and run it locally so you can have a workflow that's actually workable with that i'll see you in the next one all right welcome to the second half of this course don't know why i had to jump in like that wanted to have a dramatic entrance for some reason anyway in this chapter i'm going to be taking you through all the steps necessary in order to install postgress locally onto your computer get you set up with its editor of pg admin and then also get you set up with an even better editor of dbaver so let's break this down in this lesson we're going to be installing postgress or the database itself locally onto your computer we're going to be downloading it from the internet and then it's going to install postgress but also pg admin now pg admin pg short for postgress uh is the editor used in order to interact with postgress databases so anytime you need to start or stop the database or even if you want to run a query with it we can do that with pg admin which we're going to demonstrate all during this lesson now if you already have postgress and pg admin installed you don't need to do it again but in this lesson after we do the install we're going to go directly into actually loading the database specifically our contazo database that we've been using in those jupyter notebooks once we get the database set up we're going to do a quick walkthrough of the entire pg admin ui so become more familiar with it now there's one major flaw with pg admin and that it only connects to postgress databases and after this course you may be proceeding on to learn other databases because of that we're going to be installing in the second lesson dber now this is a database management tool so can only connect to different databases you can also run queries on it to see the output we're going to be using the community edition of this and it's free and open source and has everything we need to do to get started this is the most popular database tool that i know of so i'm super excited to use this and everything that you learn for using dbver can also be applied with other databases which dbaver can connect to one last note before we begin some of you may run into installation errors or other errors along the way i highly encourage you to use something like chatbt to help you out it's a lot quicker than trying to post a comment and helping or hoping somebody else comes in to help you out now let's say you can't figure it out or you're on something like a chromebook and can't install postgress in that case you can continue to run all of the different queries in our sql notebooks they're going to work exactly the same and have the same output but as far as interacting with the guey and stuff like that you're going to have to figure that all out yourself because obviously it's going to be different than deep so first we're going to navigate to the download page from postgress and from there you're going to select your operating system that you're currently on i'm on a mac so select that we're going to be using the interactive installer by edb so we're going to select right here of download the installer which everybody regardless operating system it's going to get navigated to this page where you can then select your operating system once again and download it for mac or windows you're going to want to launch this installer if it gets a warning message it's okay click open it we're now going to walk through the actual setup wizard that it has included for all this we're going to leave all the defaults the same core things that we do want to make sure are installed which are by default our postgress server and also pg admin which is the guey interface next is the password and i'm actually going to set this one to a really easy one of password now my database i'm not going to have any confidential material on it and this database that we're installing isn't secret at all so i don't care if somebody else accesses it but it doesn't really matter because it's local anyway so other people can't necessarily get to it unless i have it access to the internet anyway that's a long story key thing if you're only going to be doing this course with this feel free to just set it to password you should be okay but if you're not set it to something else and remember it keep the port number the same of 5432 it's common to postgress databases and we'll keep the default local and we'll go ahead with this setup's complete i don't need to launch this stack builder exit i'm going go ahead and click finish and i'm going to verify this is now installed by going to my applications folder under postgress 17 i have these different options of what's installed we're going to be opening up pg admin this is the guey interface for interacting with our postgress database it'll start loading up with this open we have two main panes in here we have our lefth hand side pane which is our object explorer which shows all the different databases we're connected to right now it's asking me to connect to the server specifically postgress 17 which is the one we installed so i'm going to go ahead and put that password in of password and i'm going to click this of save password so as we can see postgress 17 is a server we have one server and then from there we have databases inside of that server right now we only have this one standard database that comes in all postgress uh servers called postgress we're not going to touch this bad boy i don't really care about it but we can see that hey there's only one database in here we also have options to adjust login or group roles and then also table spaces we're also not going to be messing with any of this this dashboard over here on the right hand side i find pretty useless it just tells me when i have interactions inside of my server and when it's getting used so now let's install the kazo data set locally for this we need the database file for it which is right here of this kazo_100k.sql file it'll go ahead and start downloading you don't need to do this but i went ahead and opened this file just to show you the contents and in this it walks through actually creating all the different tables that we need inside of our database along with loading in all of our data into it it's pretty long file so let's get this file into a database and for this we need a database for this so we'll rightclick databases select create and database we'll name this contazo_100k keep all letters lowerase the owner will be maintained as the super user of postgress so we can use that same password to access it we don't need to change any of these other settings right here they're all good enough we'll go ahead and click save now we can see we have two databases underneath here and it automatically dropped everything down underneath here now if i go under schemas and then it has a public schema i can see underneath tables if i click this to actually drop it down to see any tables there's no tables inside of it so this is where we need to actually load that sql file into this database first thing you need to know the location of where it is i recommend just put it on your desktop we don't need it after it's done so you can just delete it back inside of pg admin i'm going to rightclick that contazo_100k database and i'm going to go here to psql tool this is effectively like using a terminal to interact with our database i'm going to start this command off with a forward slash then i this is going to tell it to execute the script that we're about to insert into this and so inside of single quotes i'm going to then put the file location and i'm going to insert that in of users luke bruce desktop and then the sql file itself make sure that it's exactly right on a mac if you go to the file itself click option and then rightclick it you'll get this option here of copy the sql file as path name on windows all you need to do is shift and rightclick the file icon and select copy as path okay we got it in going to go ahead and press enter it says that pg admin like to access my desktop yes i want to allow this and it's going through here and actually creating all the different tables and altering them and i can see from this that we've put in it looks like six different tables and it tells us all the different counts of the rows that we inserted into those tables so i'm going to come in here and go to tables and try to see it there's no tables in here but all i need to do is just right click ino select refresh so now it should have it and we can see scrolling down the tables we have six tables inside of here which from this menu i can actually dive into in the case of sales actually dive into the different columns and everything else that has associated with this i can also just do a quick check of this by rightclicking something like the sales column going to count rows and it tells me at the bottom there's over 199,000 rows in this sales table but how can we actually query this database of contazo 100k well first we need to make sure that it's actually selected and then come up here and select query tool we can also see that they have a shortcut of option shift q it opens up in a new tab right here i see these other tabs here if i don't want any of these other tabs i can go ahead and select x and close it out up here at the top it tells me which database i'm connected to if i had any others i could switch it right up here we have our query window which i'm going to put in a simple command to look at the sales table and the top 10 results to run this i'm going to come up here and select this play icon for execute script or i can press f5 and all the results are displayed here below i also have this scratch pad over here on the right hand side so if i don't have queries that i want to keep track of i can just put it over on the right overall i don't find myself using it that much other key features of this area are you could open a sql file right inside this window or if i want to save this i could save the file we also have options for explain which we're going to go into more detail in some upcoming lessons now down here at the bottom we only have the data output but also any messages and notifications inside of that output they actually have a few unique capabilities with this in that you can copy any of your different exports of data out of it if you want to put in gbt or something let's say we have a more complex query that actually does some analysis such as this one right here that looks at the total yearly net revenue well we can not only save results to file but also we can graph and visualize it right here i select line chart and then we want the year for the x-axis and then the total year net revenue for that y-axis and then select generate and not too bad to actually get into visualizing queries pretty easily in this last thing to note with pg admin is i can also do things like view the erd or the entity relationship diagram for the database by rightclicking it select erd for database and with this this is showing our sales table i can scroll in through this and actually see the sales table along all the different keys and columns in it and with this table how it's connected to all these other different tables in it so a great way to visualize your database and tables that you're working with all right we now have some practice problems for you to go through and get more familiar with this pg admin guey like i mentioned at the beginning we will be transitioning next to dbver but i do find myself from time to time having to jump in and use pg admin so it pays off and understanding the basics of this tool that's why you got those practice problems all right with that i'll see you in the next one welcome to this lesson on dbaver in this we're going to be walking through setting up and getting dbver connected to our contaza database first thing we're do is going to download the community edition of dbver which is free then from there walk through the steps necessary in dbver to connect to our postgress database and then finally once we have that set up we're going to actually do a walkthrough of the dber ui understanding how we can run different scripts and how we can set up our project inside of it all right with that let's get into it all right if you navigate over to dbeaver.io this is the homepage of dbaver community this covers a few details about the tools you can read further specifically dbe community the edition we're downloading can connects to a variety of databases and has all these different editing and viewing options it's by far talking to all my data analyst friends and also looking at the research it's the most popular database editor so that's why we're using it now dbver needs to make money like any community so they also have a pro edition i'm going to go ahead and click this you don't need to and so with that they have a few different edition editions that you can get and use some get pretty pricey if you're a business but as far as the basic sql and coding that i run i don't ever need the features that are inside the light enterprise or ultimate edition i can get it all done with a community edition but if you come a power user highly encourage you you buy a subscription because you support building out dbver further all right cool story luke let's actually now get into downloading db and close the download and then from there you select what operating system you're on and install it i'm going to be going through with mac windows is going to be very similar so i'm not going to cover them separately after your installer file loads you should click it and open it up on mac it's pretty easy all you have to do is drag the beaver over into your applications folder and now it's here i'll go ahead and open it up if it asks if you're comfortable with opening this app up yeah we know where we got it from i'm going to open it with dbaver opened up and launched um you may notice first that mine may be dark and yours may be white i have dark mode enabled on my mac so i guess it automatically picked it up change it to dark mode anyway it says "hey do you want to create a sample database that can be used as an example to explore basic db or features?" we're not going to do that we're going to just install the contazo data set and then i'm going to take you through this so it should have immediately popped open with this of a connect to database and now we're going to get into installing the database if this select database didn't pop up that's okay there's a few different ways you can get it up and we need to go through it anyway one other thing before that it does have this popup that says "hey do you want to share your data in order to improve performance i'll leave it up to you on whether you want to do this or not." so to create a new database connection you can either go up to the file menu and go to databases and select new connections or you can just come right here to this fancy dancy icon and select new database connection now this is one of the reasons why i recommend dbver so much is because it connects to a host of different databases and so that way you can connect all your different ones that you're working on as a data analyst so in our case our canazo is a postgress database we'll select that now we need to go through and fill out the connection details we're going to be connected by host specifically our local host so it's locally on your computer the database name is not postgress it's the contazo 100k make sure you spelled exactly the same as what's appearing in pg admin next we're going to move into the username which we maintained it as postgress and then the password if you named it like me the password is just password all lowercase i'm going to leave save password enabled because i don't want have to log it in every single time from here it's already picking up that we're using postgress 17 and everything else looks good let's go ahead and test connection in my case it's saying that the postgress driver files are not installed we need to install them basically like if you install a printer into your computer or attach a printer to your computer you have to install driver files to attach to it so similar here nothing wrong with this we're going to go ahead and download with that we get our test results back and it says that we are connected now if you are not if you have issues with that one check all those credentials make sure they're correct but two what may happen uh to you is that your database may not be started and so you may need to open pg admin and actually open it all the way up to the kataza data set and make sure that it's actually running on your machine typically for both mac and windows your postgress databases should start when you restart your computer so you shouldn't have to do this but you may have unintentionally disabled this feature and so you may have to restart anytime you restart your computer with all the credentials put in and the testing of the connection set we're going to go ahead and select finish so let's now walk through dbver and get into understand the ui and also running a few different files okay so we have this pane right here on the left hand side and that is our database navigator also it holds our different information on our projects which we'll get to projects in a minute anyway this has all of our database information in it if we want to see it specifically underneath database navigator if for some reason this disappears like i accidentally close out of it you can go into the windows menu item and from there show the actual view of database navigator pop right back up so what's inside of here well very much similar to what we saw in pg admin we can see all of our different databases in here we have our contazo 100k database we also have these folders on administer and system info these are ones that i'm using less i'm typically staying inside of here specifically inside this contrao 100k go into schemas under public because it's the public schema we care about and i can go into actually viewing all our different tables if i drop something down like the sales table i can see if i wanted to go into all the different columns in it along with what is the data type of those columns i can also see a host of other information like foreign keys and whatnot anyway one thing that you may have noticed about this is that there are numbers over here on the left hand side these aren't the number of rows but instead if you hover over it you can see that it tells you how much disk space that that specific table takes up so you can get a general idea of how big these tables are just on dispace alone so i can see that sales and also the customer table specifically are pretty big relative of course this is actually a pretty small database now what i like about tools like dbaver is how easy it is to dive into these tables without having to write a sql query specifically if i wanted to see what was in the sales table i can rightclick it and then just go to view table now this side is the database editor and it actually has a tab view it's like i could do something like this open also the currency exchange and it has multiple different tabs that i can cycle through now with this i can view a bunch of different things underneath properties i can look at all the different columns foreign keys constraints whatnot next up is data i can obviously look at the different columns inside of here and scroll through it a lot easier similar to excel spreadsheet and then the other one is the er diagram or the erd and this shows how your tables are connected all together i actually feel compared to pg admin this one is more realistic and shows how they're all connected whereas i don't know if you remembered from pg admin but they all like connected into a single line and went all over it was a hot mess so db does a little bit better at this anyway the view that i'm typically looking at most is this data one right here and i can look at as a grid or also as a text text i don't find very useful at all um except if i need to copy and paste it grid is mostly where i'm staying now this guey in here has a lot of different options that you can use to interact with these tables and view them specifically you could enter a sql expression to filter it down you could also actually put in custom filters in here to filter it down down at the bottom we can do things like add rows remove rows typically i do this with sql i'm not going to mess with it here in dbe can also cycle through the different pages and whatnot one that i do find useful however is this of export data and anytime you have any of your data that you have and you want to get it out of here you can put it into a variety of different sources typically i'm doing something either of exporting it to a csv or export it to sql which will make it into a sql insert statement all right so enough of that let's actually get into setting up our project folder i don't need these two tables open up it's also asking me if i want to save these changes in the data set database i didn't really change anything or i don't want to change anything so i'm going to click no so we're going to be creating a project folder i'm going to click projects right here in order to be able to save our sql files if we want to as we go along right now we just have this general right here which has bookmarks dashboards diagrams we can also see it right below here i don't really care about the general one i want to now create a project specific to this course that we're working on so i'm going to come up to the top right here and select create project and i'm going to call this intermediate sql project real original i know with this project i'm going to leave it to the default location which is inside of the dbe folder i could uncheck this and then re change that to wherever i want it to be just so you're aware i'm going to leave it in the default location i don't want to add the project to a working set so i'm going to go ahead and select finish so now i have this intermediate sql project and my general my intermate sql is my main project or my active project so i'm actually going to rightclick it and say set active project and then it should shift to bolded additionally if i go back up to windows in the file menu and go over to project explorer i can have that now appearing below if you didn't close it out it was general probably it should have switched to that anyway i like this type of view because now i can switch between them but if you notice the database navigator we now don't have our sql database in there anymore so what we can do is go back to projects and it actually makes it pretty easy in here right here under general i have the kazo but what if i tried to click the connections of the intermediate sql project there's no database that it's talking about so we loaded the database into general we want it to move it over down here and so bam now when i open this project folder which it's the de uh the default one in this case now it's inside of here in the database navigator as well so this is pretty neat of how i can keep this all grouped together in a single project now by default there's four different folders in here they all should be empty of bookmarks dashboards diagrams and scripts bookmarks are just as they imply bookmarks if you have something that you frequently go to you can just put it there in this case let's say i frequently go to the sales table i can stick it in bookmarks and now makes it super easy anytime i need to go to that just click on it and bam it appears right here next we have dashboards and nothing in this but you could create a new project dashboard if you remember from pg admin they had a actually a default dashboard that shows all the different sections transactions stuff like that i'm not a database engineer i don't really care about all that so we're not using that next are diagrams if i wanted to i could create a new erd i could call it contazo erd now this does have our five core tables in here but also it has a host of other different tables that just come natively inside of a postgress database whenever you install it so they're all going to be there if you make it this way you can filter down we're not going to go into that all right the last thing is scripts how the heck do we create a new sql script well you can do this from sql editor in the menu or just come up to the top here and select open sql script now there's a few different options that popped up here since we've done this first it tells us what is the active database and it tells us what is the active database schema so this is especially important when you're working with multiple databases to make sure that you're running the queries on the correct database now notice whenever we open this up inside of here we also the script itself we also have a new script underneath here i'm going to go ahead and minimize these we don't need this anyway we have script inside of here if i wanted to i can rightclick it and go to something like rest rename and then we can name it appropriately like this is just a test script and i'll make sure that's a sql file press okay and it's since been renamed so let's run our first sql query we're just going to make a simple statement of we want to select all columns from the sales table and we want to limit this to 10 results if you notice this i was typing in all caps as i was going along and then it made it lowercase after we're going to fix that in a little bit anyway if i want to run this single query on a mac i'm going to press command enter on windows i'm going to press control enter also if you forget you can just scroll over these icons and click it and it also gives you what the shortcut is now similar to what we saw before with how we can view different tables and outputs i'm going to have a tab here and then underneath this i can actually explore it in different ways with text or grid i can also cycle through it if i want to export the data so a lot of different options to manipulate this and dive into all of it let's run a slightly more complex query to just demonstrate the the power of this i'm going to make this into a cte and then we're going to run a query on that cte so i'll enter this down put in a width keyword we'll call this sales copy because that's what it's going to be give it the alias of as and then open parenthesis now i need to clean this up i like indentations and things like that i can actually highlight this all rightclick it and then go into format and i have this option for format sql on a mac the shortcut is control shift f so i can actually just do that instead of doing control shift f and it makes it slightly more readable although it didn't end in anyway as always with any ct i'm going to go down here and do select and star and then we're going to be doing this from the sales copy table above now if you notice this it automatically gave me this error message saying "hey sales copy is not located above." and actually if i even tried to run this with this by pressing command enter um it tells me sales copy doesn't exist but i can clearly see that it's up here that's because dbaver automatically treats any blank spaces or blank rows as a endline delimiter basically treats it as like a semicolon at the end so what we need to do is subtract that out of there and i'm still getting an error message i don't know why i am but oh it's running now and it cleared okay i just had to run it once anyway i don't like those two things right now i don't like how it's automatically making everything lowercase and i don't like that it automatically gives me this error message when there's just a space in there so we need to change some settings before i want to proceed on if you're on a mac you're just going to select dbeaver up in the menu and select preferences on a windows i think you're going to select file after you select settings this preferences window will open up and this allows us to go in and actually control things inside of the editor itself i want to control things in the sql editor specifically for the formatting right now the keyword case is set to default which is lowercase i want it to be upper so i'm going to change it to that you can also control your indent size down to like two i like bigger indents so i'm going to do four also if you notice previously it wasn't indenting in those things that are in parenthesis i like that so whenever i click this it does indent it in so i'm going to have that selected as well last thing to uh update is under sql processing and moving this over in this we have blank line is statement delimiter always we remember sometimes we may have blank lines in there i don't really like this setting so i'm going to change this to never you can also change it smart but i'm not guarantee it okay we're going to do apply and close now i'm going to select all of this press control shift f and it formats it exactly like i like so basically it changed all those keywords to uppercase and indented it in like i like not too bad now that was just one query i could put a semicolon in here and then let's say i wanted to do another query on top of this i can keep it in the same script some things i didn't call it before but anytime we're typing any words you're going to have this autocomplete and it also tells you what is going on there similarly if i'm doing something like a function count in this case it tells me hey it's a built-in function in the database i can use this and then after i insert something like from it also it automatically knows hey he probably wants to put in a table so i could put something like customer in here now i have multiple scripts inside of here so what i could do is if i just want to uh enter this script i'm going to go ahead and close this out down below if i just want to do that script right there i'm going to press command enter and it's only going to run the one script tells me there's 104,000 rows now the other thing i can do let's say i want to run all of these scripts they have this icon right here for execute sql script for me it's the shortcut of option x on windows i believe it's alt x i can go ahead and well we're going to close out of this first we'll select here and press option x and it automatically opens up each of these in different tabs now one thing to note is like okay how do i keep track of what are all these different queries that i ran right here well you can put a brief comment up at the top using two dashes and we'll call this one sales copy and then the one at the bottom we'll call this customer count now running this again pressing option x it prompts me hey there are three unpinned results tabs do you want to close these tabs before executing the new query i want to do this all the time if i'm running a new query i just want to see the new results so i'm going to say "hey don't ask me again." and i'm going to say "yes i do want these closed." and that didn't work because i actually told you wrong we need to you need to actually specify that this is a title so you specify title colon and then whatever it needs to continue on after that for what the title is i'm going to do that for both of these press option x and now both of these are named both below this is very convenient when you have multiple queries and you're going to have obviously multiple different tabs also there is this one of the statistics tab basically just tells you the statistics that it ran two queries how long it took and whatnot so now with this test script that we've created if i wanted to i can see that it's not saved because it has an asterisk i can go into file and select save or press command s or control s and the asterisk went away and now i can close it and if ever wanted to go back to that certain script i can just pop it up here and run it as necessary bam so now hopefully you're follow along and you went through and installed dbaver because you need to do that unless you plan on using uh jupyter notebooks or collab to run the future queries anyway we now have some practice problems for you to go through and get even more familiar with dbver of all these different settings and actually getting familiar with running sql queries with that we'll be jumping into the next chapter on building views so that'll see you there welcome to this chapter on views now views only takes up really a small portion of this the majority of this chapter is going to be an intro to the project using views now in this we're going to be going through three lessons in this lesson specifically for this video it's going to be an intro to views how to create views how to delete them how to manage them and why they're so important in the second lesson we're going to be using that view that we're creating in this lesson in order to analyze it further and answer one of our second questions in our project now our project in total has three questions which i'm going to showcase here what we're going to be doing in a little bit and you may be like luke what happened to question one well question one we actually answered or start to answer it earlier in the lesson we're just going to be building on it further in some future lessons don't worry i'll be getting you up to speed okay and then the third lesson that we're going to be getting to in this is actually installing vs code which is a code editor that makes it super easy for us to build up our portfolio project and then share it onto the internet now before we get into views i want to just showcase what we're going to be building in this project specifically we're going to be sharing this to your github profile and it's going to detail everything that you've done now if you're not familiar with github this is a location that you can store and also share or collaborate on files here in this uh menu area this shows all the different files in this repo as we can see we have some sql files and then a readme which i'll discuss more in a little bit and then we have like something like this which is an images folder if i click on it i can see that they have something like image inside of it anyway getting to that readme the readme on the front page of a project is going to be displayed right below it so here i can go through and actually document all the different analysis that i've done so if i have some employer interested in different analysises that i've done they can come to my github and view all that here and now you may be like luke why the heck do i need to install vs code you already had me install dbaver what the heck am i doing with this well vs code if you've taken my basic course you know is really powerful not only in writing sql queries but also in other coding projects like using python or whatnot anyway what the special use case in this project is is actually building our readme here i've typed out all the different portions of the readme and if we actually view it here i can see it all dressed up on the right hand side how it's going to appear on something like github unfortunately dbaever doesn't have these capabilities along with the fact that i can also go through and push this and put this onto github right here from this guey so loads of benefit those that have come from a basic sql course you've used vs code you're familiar with it there's not going to be a lot of stuff new that i'm covering here you'll probably be able to even skip this lesson on vs code so let's get into views well first of all what the heck is a view it's a virtual table that allows us to show the results of a stored query in it for example we're going to be going through in our next example and creating a view you can find underneath the views folder underneath a public schema and we're going to create this one called cohort analysis whenever i click on it it's that virtual table so this is a has all the different results for a certain query specifically i can go here under properties and look under source and actually see what was the sql query taken to actually generate this virtual table and so with this virtual table in this case it's called cohort analysis i can open up a script i go ahead and clear all this out and say i want to select all the rows from the name of the view cohort analysis is even appearing right here telling me that it's a view when i run it i get all the results of it below now views are super important and are necessary to level up your sql skills they allow you to or basically prevent you from having to go through and write the same query over time and time again because you know what happens whenever you have to write the same query over time and time again you're eventually going to make a mistake with a centralized view this prevents that and also ensures that if you have this dedicated view that any other queries that depend on that view will get updated if for some reason you have to update that view anyway i'm getting ahead of myself what's the syntax for this all it is is we need to use the keywords create view give it an alias and then provide all of our different sql that we had below it to actually go into that view so let's go in to create our first view i'm actually going to go ahead and delete this view that we're going to create because you don't have it yet for this just open up a blank script in this we're going to do a simple query that allows us to get the daily revenue for this we're going to use the order date and then also we're going to use the sum of quantity times net price times exchange rate and give it that alias of net revenue this is from the sales table and we need to perform a group by since we did that aggregation all right let's go ahead and run this bad boy and it looks like it's done all correctly one thing to note i didn't filter this or put this in any order one thing to note is you can actually do that in here i can just click one of these filters and say hey order by we'll say in descending order and it shows me okay we start in april 2024 and go backwards with the total revenue pretty neat all right so this is the view or that we want to create so let's create it using that syntax specifically i specify create view give it the name of daily revenue and then just use as don't need to put this in parenthesis i'm going to go ahead and run this pressing command enter and you should get something like this at the bottom telling me that the query is in fact finished now if i go to views there's nothing there what i need to do is i need to actually refresh it you could do this by right-clicking and clicking refresh or you see the shortcut right here of f5 i'm just going to do f5 so now in this case that daily revenue is there i can double click on it open it up it has a few different tabs underneath it like i said the properties underneath it so we can see something like the source which gives us that query that we needed to create the view so we don't need to save our query separately it's right there it also shows our data and then finally our erd in this case it doesn't really connect to anything else just its own table bam that's it now if i wanted to access this view all i have to do is just do select star and specify i want this from daily revenue and since i've put that semicolon in the last one it's only going to run this one when i run command enter now all the results are appearing below okay let's say i'm done with this view or i don't need this view anymore there's a couple ways i can get rid of it i can rightclick it and just come down here to delete it's going to then prompt me are you sure you want to delete this view of daily revenue and it asks do i want to cascade delete basically if there's other views based on those views it's going to delete all of those as well so you need to decide whether that's applicable or not and then click yes or no or not we're not going to actually delete it via that method and i'm just going to confirm it's still there by refreshing this and showing that it is in fact still there instead we what we can do once again put a semicolon i can do something like drop view as a keyword and then specify that view of daily revenue okay let's go ahead and run this pressing command enter it tells me underneath it was completed pretty quickly and coming over here pressing f5 we can see that view is no longer there very important note is that deleting views is permanent you can't recover it once you do that so make sure you really want to in fact delete that view now that we got the basics of views let's actually get into creating the view needed to answer a few of the different questions we're going to be answering in our project once again a reminder we're only going to be answering three questions for our project and we'll be working on that second problem in the next lesson anyway you haven't created this yet but this is what we're going to be getting to eventually and like i said we're going to have our view in here our different sql files to answer our three questions and then our readme this create view is what we're going to start working on in this lesson we're not not going to necessarily finish it in this lesson we're going to finish it in the text cleanup lesson but we're going to get a little bit of a start so what the heck does this view actually provide us that we're actually going to use so we're going to be diving into shortly a more advanced cohort analysis than what we've done previously and we need a table a view if you will to help us out and speed up that analysis specifically this table is going to be basically broken down and aggregated to provide us key things about a customer specifically when were their orders how many order they had when was their first purchase date what cohort they fall into and then additionally some customer information from the customer table this is going to be super helpful especially for something like total net revenue which does that quantity times net price times exchange rate it's just already there i don't have to worry about the calculation everything's there so let's just start building this view and we're going to be doing this by just checking out our query we're going to start with the sales table only bringing in the information we need first i'll do a select statement we're going to do select starf right now and then we're going to be coming from sales now we're actually going to be doing multiple tables in this so i need to go ahead now and i'm just going to add this alias of s i can also press tab and it adds that and then running this we can start picking out things that i want out of this with this actual table below i know i want the customer key along with the order date as always i want that total net revenue so i'm going be doing quantity time net price times exchange rate and giving it the alias of total net revenue we're going to do one more thing also to get a count of the number of orders and we'll do this off of the order key now because we do that aggregation we got to do a group by and i'm lazy i'm just going to go ahead and copy this up layer and place it below all right let's go ahead and run this looks like i got a typo over here look at this syntax highlighting helping out to figure that out running it now bam we got our results that we want below everything look like it's aggregating correctly now with this table i also think want things like the first purchase date and the cohort year this is going to take window functions going to do remember i don't want to put that in a group by so we're going to need to create a cte and then do it anyway what i'm trying to get at is instead i'm going to move over here to the customer information and we're going to extract some key customer information to put into our source table so we need that from our customer table because of that i'm going to do a left join which allows us to keep all that information from the sales table and thus attach any related things from our customer table attached to it i'm going to give the customer alias of c and we're going to link this on the customer keys of both tables i'm going to just run this to make sure we have no issues okay it's running just fine so what information do we want to add on i'm going to do a c do star so we can add all of it on we're going to actually refine it down and it's telling me i need the c.customer customer key in the group eye we won't have to keep this but this will just help clear up this error that i'm getting from using that let's try that again all right so now scrolling over we can see that we start to have the customer information in here so i want things like the country full the customer's age the customer's given name and then also their surname and that should be it and now we need to put this all in the group eye because remember we're doing an aggregation right here so i'm going to come down here and actually put that underneath here clean this all up and then we don't need this customer key anymore i don't believe so i'm going to remove that now let's try to run this query make sure it goes and everything looks good we have all those different columns in it okay now what we need to do is extract out for all these different customers here what is their cohort year or the year of their first purchase so what i'm going to do is put this all into a cte i'm going to indent this over and then also space it down so we can put that width and i'm going to give it the alias of customer revenue assign it as do an opening parenthesis and then finally a closing parenthesis then to make sure that this is all correct i'm just going to do a select star from our customer revenue running this we can see that okay it is providing the exact same information that we had for run a good path now we need to do window functions in order to basically use that order date to get what is the minimum order date for a customer in order to assign that cohort year so for this i want everything from our customer view table i'm actually going to give it the alias of cr and i'm going to do cr.star star and then i want to get that minimum order date so we're going to do minimum specifying our order date there's a window function so we're going to do over and then we want to partition it by the customer key so like 180 here we want to look at that and see what is the minimum of this and then we're going to give it an alias of first purchase date okay let's just go ahead and run this to see how it's doing and we can see so something like rows two and three we should see for i got to expand this out we should see the minimum order date is 2018 in this case also 2023 so the cohort year for this or the minimum purchase date should be 2018 which it is now we can go ahead and build another column for cohort year and all this is going to be is just a copy if you will but we're going to be using extract and then with that we're going to just be copying the contents above of that minimum order date pasting it in here and then giving it the alias of cohort year let's go ahead and run this rushing too fast i realized okay i have to extract something from that window function right i have to extract year from the windows function so now let's try to run this we can now see that for customer 180 and rows two and three it is in fact the cohort year of 2018 this is good so now let's create this into a view that we can then reuse we go ahead and enter a line down here we need to use these keywords of create view we're going to name this cohort analysis and then once again we'll use that as we don't need to put it all in parenthesis though okay let's go ahead and run this pressing command enter and it ran and super fast the view is if it's not appearing remember we need to run f5 and now it's appearing underneath here shows all of our different columns in here on the dr diagram also on the data tab so now with this analysis it makes it super simple what i can do is just create a new well i need a new script so i'm going to say new script here let's say i wanted to analyze something like the total revenue per cohort super simple now to do with this view i specify obviously that cohort year the sum of our total net revenue and then we want to do this from our actual view which is our cohort analysis we did an aggregation so we need to do a group by specifically on that cohort year running this we can see our different results i didn't do an order by i'll just actually use this and order in descending order and so now in a super simple query i can get that at a lightning speed because i don't have to do all that other analysis that i did before in that view because it's already captured before we wrapped up this lesson this is a future loop as you can tell i'm in a different flannel we made a little bit of a mistake in our view specifically with naming a column that with the number of orders i didn't give it an alias what do i mean by this okay going into cohort analysis anytime i want to use it i'm going to press f5 to just make sure that it's fully up to date and if we go into the data we can see that everything looks like it's fine except here for this column this is the number of orders but we left it unfortunately as count and that's not a descriptive name we really need to change it to a more descriptive name and so this problem actually comes up quite frequently so this is actually good use case to go through anyway as remember when under properties and underneath source we can see all the different code now unfortunately with this command right here create or replace view previously we just saw create view create or replace allows us to replace it if it already exists now unfortunately i can't come through here and update count here as num orders because that's what the alias i want it to be and then this countdown here as numbum orders and then if i wanted to run this if i clicked save down here it allows me to say hey do you want to execute this it says cannot change the view column count to num orders instead you should use something like alter view or rename column to change the name of the view instead now alter view is a great thing to know of and what it can do you can go through and add additional columns remove columns and in our case rename columns so we're just going to use this syntax to rename it but that's only going to be a partial solution we'll see so we'll use the keyword of alter view we'll name the view itself of cohort analysis and then we'll use rename column also they have the syntax highlighting saying that it's wrong because the table reference expected don't worry about this it's actually a f a false warning it's not correct and so what column do we want to rename we want to rename that count and what do we want to rename it to is numbum orders now going ahead and run this command enter looks like it ran fine when we come back over here we can see that has a star next to cohort analysis that means it updated so we need to press function f5 we need to select inside the database navigator sorry press function f5 to make sure that it updates and i would actually recommend just closing out of the old one because we've changed some properties in it if you didn't do it already and so we don't want to mess with this we want to see what the newest one looks like so it says hey do you want these changed to persistent database no i don't want them to so clicking cohort analysis again to get the newest up to date we can see that okay it doesn't it didn't change count up here but it did change it down here to give it the alias as numbum orders and i'm a perfectionist and also this is just good practice in general i want to change it in both locations in order to do that we actually need to drop this view like dropping a table and then create this view again so what i'm going to do is just copy all of this code we don't need this alter view script anymore i'm going to go ahead and paste that in here remember we want numbum orders right here and then down here since we're actually using it we can actually just remove it now before we run this create or replace view we need to actually drop this view this one's simply written drop view and then we list the view name of cohort analysis i'll put a semicolon after this and then i just want to execute this entire script right here so i'm going to press option x and it said hey it ran those two queries and it got done with it once again i'm going to close out a cohort analysis just make sure we have this select inside of here press function f5 and open up cohort analysis looking inside of our source and we can see that it updated numbum orders in both locations so crisis averted with getting that column up to date and keeping our query concise all right now we have a few examples for you to go through and get more familiar with creating views we'll be using some of the previous examples that we've done in previous lessons in order to build views with so you can reuse them in the next lesson we're going to be building further on this view that we just built in order to answer that second question in our project to further analyze the cohorts all right with that i'll see you there welcome to the second lesson and in this one we're going to be diving into a question for a project specifically how do customers in a particular group generate revenue regarding the particular group we've broken it into groups before doing cohort analysis is what we're going to continue on from this now spoiler alert for this analysis we're going to be looking at the different cohort years and at the customer level seeing how they spend money specifically seeing how they spend over time if you will generally it's good practice to have customers spend more because it means more money and so we would expect that over time a company would learn and be able to extract more value out of customers unfortunately we find out just the opposite so let's quickly reexamine what we've previously done on cohort analysis i'm not going to walk through this entire query we did this inside of our window functions chapter and with the results of this query we were able to plot out and see how what is the impact of a cohort on future years total revenue so as expected net revenue is going up and there's contributions to these net revenues every year from previous years specifically members of previous year's cohorts because your cohort year is based on your first year purchase so honestly this didn't really uncover a lot for us does it tell us really that much we went even further and also did an analysis looking at the number of total customers and from this we saw that it went up as well once again not a lot of insights from this so what do we need to do well using our previous view that we created in the last lesson we're now going to take that a step further and we're going to analyze for the total revenue and the total customers but then finally get what is each individual customer's revenue well on average at least so let's jump into building this query for this we're going to be using that cohort analysis i can dig into it and see that it contains all the same values that we did in the last lesson but from this i want to get based on the cohort year what are the total number of customers using that customer key and then also that total net revenue for that cohort so for this i'm going to start a new script and i'm going to go ahead and fill in the from specifically from cohort analysis and i like to do this mainly because when i go to fill it in if i do something like cohort year which is what we want one i can see that this column does in fact exist and also it does the correct syntax highlighting as i go along now with this we want to get the total customers per year so we're going to do a distinct count so i need to do count or distinct inside of count specifically on that customer key we'll give this the alias total customers now we need the sum of the total revenue so inside of our sum function i'll use that total net revenue and we'll give it the alias of just total revenue all right we did an aggregation function so we need to do a group by on that cohort year okay let's go ahead and run this and just see what we have so we get back those total customers and then the total revenue now let's look at this visually because i think it's important to understand why we're actually taking this a step further diving down to that customer level to analyze this well here i've plotted it where the bars are the revenue so you see it on the left hand side and then the line is the count of the total customers which is over on the right hand side as expected you can see that these lines basically correlate well to an extent to the size of the bars themselves so simply put mo customers equals mo revenue which that's nothing new here and that's not really any new insights that you'd go to your boss and tell them we actually need to dive deeper into finding out some key characteristics about the customers to actually give them insights of what their spending habits may be like so let's get this customer revenue all we're going to do is take our total revenue up here and divide it by the total customers itself and then we'll give this the alias of customer revenue i'll go along and run this query and now we have the customer revenue on the code here just a shout out to that views table look how simple this query is now now that we have the data in that view that makes this super uh quick to actually do this anyway back to actually exploring this customer revenue revenue over time we can see that it's basically dropping over time let's look at it visually and with chad gbt plotting this i have it showing that over time these customers spend quite a bit i'll be honest uh they're at around $3,000 per customer but then it starts to go down this is a exponential trend line that i had chad put on there anyway this is concerning that customer revenue or the per cohort year their revenue is dropping year after year i would expect like i mentioned at the beginning that either remains the same or goes up over time that's not necessarily a good thing for this now i will say this remember that older cohorts so we'll say in this case cohort 2016 they have all these years to contribute to their cohort so you could be part of cohort 2016 and also buy something in 2024 and so i would expect in general that earlier cohorts would have a higher customer res revenue so we need to adjust our query to account for this but you may be like how the heck do we do this do we use some sort of window function and limit the time of each of the cohorts and what is that is it like one day or one year that they're in their cohort that you allow it to attribute to the customer or to the revenue for that cohort well i actually did some further analysis on this you don't need to actually run this query and what does it show well we're not going to walk through step by step all the parts of the query because that's not important the main thing is what it provides out of it and that's that what i have it plotting or what i had it do is go out and calculate what contributes to the total revenue based on the days since this first purchase so in total about $127 million were spent on day zero i.e the day of the first purchase and then after that it dropped significantly to like 31,000 51,000 and whatnot anyway with this total revenue i went and took it a step further and put it into a percentage and we can see that it goes from 61% to less than a percent i also plotted it for more of those visual type and so what we can extract out of this is that in general or on average a customer spends for the total revenue that it spends spends about 60% of it on the first day and then minimal after that so what we'll do is go back in and adjust our query to take account for this and for a cohort year we'll only look at the revenue for that cohort year if the purchase was completed on their first day and we won't take into account anything else because the majority of purchases are done on the first day so how can we do this well conveniently in that view that we created we have not only the order date but also the first purchase date so we can use those two dates and match up where they're equal to each other to get only get those purchases so i'll put a wear statement in and we'll set the order date equal to the first purchase date and that's really all we have to do for this now pressing command enter we have some updated results and it looks like our customer revenue dropped a little slightly and plotting it we can see yeah it drops down slightly below 3,000 although it was before around 3,000 anyway the main thing here is now whenever i have this exponential trend line which i thought that you thought like with removing those previous years that had more spending actually it's more pronounced that the future years such as 2022 and 2023 spend even less so this is a pretty big breakthrough that we've come to basically uncover in this and could lead to especially at the trend that we're going at right now this could have serious implications on the business and would be a great insight to bring up to our superiors or to our stakeholders so that's the end of what we're doing for answering this question on analyzing customer groups as i feel like we've found a pretty significant insight with that what we're going to be doing in the practice problems is actually going in and doing an analysis of the revenue and the total customer count but looking at it over time on a monthly basis to get to find out why do we have certain years lower than others and to uncover other insights that i'll detail more in the beginning of the next video after you get done with those practice problems we'll be jumping into installing vs code which we're going to be using to document the insights for our project specifically in the next lesson we're going to be documenting what we learned from this question specifically with that see you there welcome to this last lesson as we're going through this chapter on an intro into our project specifically for this we're going to be going through and setting up vs code now as a refresher on why we're doing this and not using something like dbver dbaver is great at actually going through writing sql queries analyzing them and improving them but when it comes to actually sharing it and collaborating with others using things like github or even documentation tools like markdowns it gets quite hard so for both my workflow and kelly we like to use this in cooperation with vs code and this code editor is going to allow us to do two major things for this the first is we'll be able to build a readme or a markdown file that will document all of the different analysis that we've done whenever we want to go and share this and the second it makes it super easy to push this up to github and share it with others to see the work from our readme or markdown file now if you take my basic sql tutorial you'd probably have vs code so you can skip that portion of the lesson but we will be going on to how to actually build out that readme specifically for question two which we answered in the last lesson but before we jump into that we're going to quickly go over the analysis you did in the practice problems that we uncovered even further insights in the last lesson we went through and evaluated how do different customer groups generate revenue specifically we broke it down by cohort year and we found out what is the average customer revenue per cohort year at this macro level we were able to see some call out specifically that there's a general trend going down for c uh per each customer revenue which is not good and so as our practice problem assigned you went further into analyzing why do we have these dips was there something deeper going on in the data set besides just customers spending less so to catch people up that didn't do the practice problem we went through and analyzed customer revenue and total amount of customers on a monthly basis we got this final table which has that total revenue total customers and then the customer revenue let's start with the customer revenue first because we've just been talking about that as we saw in that last lesson we saw that it's slowly going down over time so analyzing at the monthly basis not really helping out that much now if we look at something like the total revenue and the total customers and we plot this we get something like this where the blue bars are the monthly revenue and the line chart here is the total number of customers so looking at general trends overall if we actually plot a line of best fit we would say or we would think that our revenue is going up over time or our net revenue is with the exception of a pretty big dip down in 2020 probably due to some sort of pandemic that happened during that time period and then it rose after that and then it was slightly less down in 2023 anyway the major insights that i think we are applicable to us from that last analysis is if you actually look at it how we said previously you know more customers equals more revenue it does match up but then when we get to 2022 and 2023 we can actually see that there's pretty large gaps in between here there's a lot of customers but the revenue is not matching which helps us explain even further what's going on in this graph basically yeah we're getting higher number of customers but customers are spending less anyway pretty interesting insight on this let's get into installing vs code if you navigate over to the link on the screen you'll get directed to the download page for visual studio code more recently microsoft has been advertising this with github copilot so it has this hey it's redefined with ai and they're really pushing that we're not going to go too much into ai features we're just going to be downloading this code editor you should download some sort of file click it get it launching and in the case of mac it unzips this file and it's automatically the visual studio code app which i can just take and drag and put into my application folder so it's in a much safer more secure location if you're on a windows machine it's going to walk you through an installer so quite a bit more steps but it'll actually direct you on where you could actually put this vs code and if you want an icon anyway regardless you get a system message asking if you want to install this app that's installed from the internet yeah you're fine with it open it upon launch you'll get this welcome message that will actually guide you through a step-by-step process that if you want to do you can do but we'll be covering all the key features you need to know for this so don't feel like you have to do this so let's briefly explore visual studio code before we actually get into installing or setting up our projects folder that we're working with over here on the lefth hand side is our activity bar and whenever we press our activity bar a sidebar slides in or out depending on if we want it there this first one's an explorer we don't have a folder open yet i'm just going to open a dummy folder you don't have to need to do this and anytime you're opening any of these it asks if you want to trust the authors i'm opening from my own computer i trust myself i think so at least anyway this basically shows a file breakdown of what's inside of this folder and then folders themselves have these carrots that you can drop down or open if you want to see inside of it if i actually want to see these file locations you can just rightclick it and then on mac it's reveal and finder on windows it's going to be reveal and file explorer anyway you can see that the structure of this is the same as what we're seeing over here in that file explorer all right other things in the activity bar we have a search functionality so if i want to search sql all the different occurrences will pop up here and i can go to it and it will take me right to it next is on source control which controls how we're going to get this onto github we'll be covering more of this and interactions with github near the last chapter or in the last chapter so don't worry about this too much oh and i guess what i forgot to mention previously whenever this popped up over in this right hand side so if i'm to close this sidebar right here this is our code editor itself so if we actually open back up sorry and open that second query i'm going to close this now we can see i'm going to actually expand this by pressing command plus or control plus on a windows we can see that we have that sql query right inside there and if i needed to add anything like i did want to use an alias right here i could just type it in and then now we're noticing that up here in the top there's this white dot appearing that means it's the ch we have changes this is not saved you can just save it by pressing command s or control s now what we're going to be doing later on is actually going into the readme and building this readme out and what i really like about vs code is as you can see we have all this fancy dancy markdown language typed into here and if i wanted to see what it actually looks like with the readme selected i could select this right here and it allows me to preview the readme with all the different images and whatnot right next to it as i'm scrolling through so one of the main benefits why we're using vs code all right i'm going to go ahead and close all this out also going to zoom back out so we can see everything um okay last two things they have a debug run and debug section we're not going to be using that and then finally extensions it's really popular if you're using this for a particular programming language like python or whatnot to have the appropriate extension installed so in this case you install python to use python we're recommen so overall there's not a lot of extensions or really any that i think you need to install for this if you do want to install one just to see what it's like i recommend this one on code spell checker when i click it it opens right up next door and if i want to install it i just click install it asks if i trust this publisher yes i do and now i have this code checker inside of here so if i actually went back to that sql file we had previously it will now go through and flag some of these keywords that don't have an underscore and it calls it an unknown word you can actually go through and try to do a quick fix with it but those are the column values that came with the database so we're not going to change them at all mainly i find it useful for if say i need to create a new alias and i'm going through it and i wanted to sum something like total customers and i assign the alias if i were to assign an alias with misspelling in it like this when butchering customers it's actually going to call it out and so i know that i misspelled it there anyway that is extensions no i don't want to save any of this the other two things in the activity bar to be aware of are your account right here and then any settings specifically what i find myself gravitating towards using a lot is this command pallet which has the shortcut of command shiftp or control shiftp that's the one shortcut for vs code i would highly recommend having memorized when i do this this search bar comes up at the top and then i can search any type of settings i want to change in vs code so say i wanted to change the maybe the color theme of this i would type in something like color oh i can see that i have preferences color themes and then it allows me to go through a menu and select a host of different options in here the last thing to note is the status bar down at the bottom we won't be using it too much like in our case right now i'm zoomed in and i could go back and reset it you'll also have information down like the lefthand corner if we're using git and then if there's any issues going along with it let's now get into setting our projects folder up that we're going to be eventually pushing to github we're going to be setting it up building out our readme and also adding all those sql files or the last sql file from the last lesson for this we want to open a folder that has our project in it but we need to create a folder if you will now back in dbe if you remember we created a project folder already and it has bookmarks dashboard diagram and even all of our scripts in it i'm not about reinventing the wheel i think we should just use this project right here as our project folder itself so we can do this a couple different ways i want to find the location of this so i'm going to rightclick it and i'm going to say hey show resources in explore now this is the projects folder i'm actually going to back back out just one location so that we can see okay so this is the folder itself of this intermediate sql project along with those folders underneath it i want the file path location to this specifically i want to go to this folder location when we go to open this in vs code so i'm going to open up a folder in vs code on macs unfortunately the folder location that this is within is hidden so i'm going to hit a shortcut of command shift period and i know in my uh home folder of luke baruse that it's in the library folder and then from there i can navigate to the debaver specific folder going into my workspace i then see the project itself and then i can open it it's going to ask if you trust the authors of this file in this folder i do i'm also going to just enable this to trust the files and all folders within here so now we have inside of our explorer right here we have a few different folders if you will i'll be honest we're not going to use any of these at all actually one thing to call out is you may visually only see the bookmarks diagram and scripts but then we also have these other they're called dot files and once again if i press command shift period on this we can see these dot files they're actually just hidden files i'm going to maintain them hidden by pressing command shift dot anyway key thing here is those folders and files aren't important along with we can be selective on what we actually put into github so we will need to be because we don't really want to put these up there anyway anyway let's make our first file we come up here to the top and we select new file i'm going to give this the name of two i like to do two to basically designate hey this is the second question and call this cohort analysis.sql and as you notice as soon as i name that sql file i got this new icon right there that shows me it's a sql file when i press enter it automatically opens in the text editor to the other hand side what i can do is now copy that query that we did previously then inside of vs code paste it all in i have this white dot saying that it's not saved so i can press command s or control s and it's now saved in there so i'm going to go ahead and just close this all out one thing to note inside of dber underneath that projects folder itself we're not seeing any of the different files pop up like we just created that sql file that's because we haven't refreshed it if i actually rightclick select refresh the query now will appear inside of this project folder if you're clicking refresh and it's not refreshing or showing that i actually had to just restart db to get this to work so yeah just word of warning anyway this sql file is now here and so i'm actually going to close out this script and this one so if i wanted to i don't necessarily have to go back into vs code if i want to edit it i could edit it from right here say in this case i call this ca and then i save this pressing command s whenever i come back in here and actually check this sql file i can see that the alias got added i don't want it i'm going to add command s so let's get into building our readme file as a reminder that's going to be basically the front page of our project detailing all the different analysis we did breaking down each of the three questions that we've gone through or will go through now key things to note for github we want this file to be called readme.md and that's because github will specifically pick up on this naming convention and then display this below here so we'll get into creating a file i'll call it readme and all caps locks then this icon changes to that readme icon and for the file it's a markdown file so i'm going to give it m then go ahead and press enter and it's open up right next to it first thing i'm going to do is just start by giving it a title remember we can do different headings depending on how many hashtags we have i'm going to give it this one of intermediate sql sales analysis but what the heck does this actually look like we can click this icon right here for it to appear right on that right hand side so as we go through and type different things we can see how it is actually formatted as we go through this now what sections are we going to be putting into this well really it's up to you you don't have to follow all or even any of the things that i'm going to put into here but i'm going to recommend these major sections first we're going to have a short little overview then from there we'll get into our three business questions just giving the short description and then from there getting into the analysis approach breaking down each one of those we're going to be doing question two on the cohort analysis and i'm going to walk you through that shortly now below these three questions in the analysis approach i only included one example right here we're going to have our ending which has things like our strategic recommendations what we got out of this and any technical details of what we actually used to build this so let's start going through and filling this in we're going to start with business questions here i'm going to put uh one two and then three for the second question we did a cohort analysis and with it we were asking how do different customer groups generate revenue now i'm not really liking how this is formatted so i'm going to use some extra markdown in here putting double asterisks before and after cover analysis and then it like bolds it makes it stand out more so now let's go into filling in the analysis approach that second question i'm going to go ahead and just copy this one right here paste it below and we're going to start going to for this i'm going to title this section cohort analysis next we need to put in an analysis approach that we actually used here so i put some short bullet points in here of how we track revenue and customer counts per cohort what is a cohort is that we're grouping it by year of first purchase and we analyze customer retention at the cohort level the next thing i like to include is the query itself now you can go ahead instead of doing a link we're going to go over link shortly you could put in a code block so i'm going to just do three back ticks in this case you can find it up here top of your keyboard anyway i could just copy this query right here and then put it into our readme and it's displayed right here i could also format it as sql by putting sql after those ticks and then it's getting colorcoded like this oh this is all smushed i'll be honest i'm a fan of dry or do not repeat yourself we already have this code somewhere so i'm actually not going to put that right here instead what i want to do is put a link to this sql file and we can do this by putting square brackets and in square brackets is what the text is going to be i'm just going to name it the name of that file and then in parentheses is the actual file location on a mac i'm going to press backsplash i think windows you can press forward slash and then all the different things that i have access to are going to appear right here i'm going to select that first one of the sql file and then yeah now over here on the right hand side i can see whenever i click it oh the file itself actually pops up so i know the link is working properly and these links are also going to work on github when we get there all right next section i have are on visualizations and that's if you've generated any images you don't have to do this per se but in my case i really like doing this so what i'm going to do is i'm going to come over here and i'm going to create a new folder and call it images i like to organize all my images in one location so i'm going to take that image it's on my desktop i'm going to drag it over into the folder itself it's right here conveniently it's just named image i'm actually going to change that by right-clicking it and selecting rename and call it this of two_cohort_analysis okay now going back into the readme itself the image name is just the alt text you can put with it mainly we need to be more uh pertinent about what the actual image name is once again i'm going to sl backslash and then from there i want to go into the images folder and i want to select two cohort analysis oh it's popping up right next to it no i'm good after this we're going to dive into key findings and i'm going to summarize this calling the main points that revenue per customer shows an alarming decreasing trend over time i call out specifically that 2022 and further years are just declining over time although net revenue is increasing is likely due to a larger customer base which we found out when we did deeper analysis and this finally brings us into the final section of what are the business insights and so for this i have the following that the value extracted from customers is decreasing over time and needs further investigation we need to find out what is the root cause of this in 2023 we also saw a drop in the number of customers and so we also saw a drop in revenue because of these two facts alone the company is facing a potential or actually what we saw in 2023 is seeing a revenue decline so overall this is a good step in the right direction on what we need to recommend on where we need to go all right it's your turn to now go through and build out that readme document and hopefully you've been following along with installing vs code and whatnot we do have a few practice problems for you to go through and get more familiar with vs code if you want that practice along with we're going to have that template for you available in order to build out this question number two all right in the next chapter we're going to be getting into data cleaning my favorite part of data analysis so i'll see you there welcome to this chapter on data cleaning and in this we have three lessons we're going to be covering for this in the first two we're going to be covering some core concepts you need to know about data cleaning specifically this lesson we'll be going over conditional expressions for handling nulls things like coales and null if in the next lesson we're going to be going over strings because from time to time you're going to be dealing with strings and you'll need to clean them up and maybe put them together or even separate them at the end of that lesson we'll be applying all the concepts we've learned in order to further refine our view on cohort analysis finally in the third lesson of this chapter we're going to be getting into answering question one from our project which focuses on customer segmentation now our project consists of three questions and in the previous chapter we focused on that second question on cohort analysis in this one we're going to be using customer segmentation in order to find out who are our most valuable customers and we're not only going to be using that cleaned up view of cohort analysis to help answer this but also some functions we learned earlier on statistics so we're focused on two functions for this lesson and you may be like luke how the heck did you pick that out well if we go into the postgress documentation underneath the sql language we can see that underneath the functions and operators there's a host of different ones that we've covered if you've covered along since the basics course we've covered and touched on a lot of these and we actually have covered on conditional expressions navigate under this we can see that for postgress there's four main types and the main one is case which we covered back in basics but there's two more that we need to cover around coales and null if now postgress has this one on greatest and least these functions have different capabilities depending on which database you're working in also kelly and i don't really use this this much so we're not covering greatest and least anyway let's get into how we can actually use coass and nullif in a very simple example you don't have to follow along with this i'm just doing this for demo purposes so the easy way to demonstrate this is with a fake table i'm creating here i'm calling this a data jobs table it has three columns in it technically four i guess if you count the id and what does it contain well let's just run this query to actually see we get this table and in it we have things like a job title a column on whether is it a real job and then a final column on salary notice inside of here that there's some null values in here we're going to be using coass and nullif in order to clean these values up depending on what we want so let's say for this column on is real job we wanted to fill in null values specifically let's just assume that the database administrator assumed that all null values were no but we needed to make it no well we can use the coales function and in this it returns the first non-null value from a list of expressions right now we're just going to use one expression we'll move on to two after this but we can provide a default value in this case of no and ultimately in our case this is going to be used to replace a null value with a default value so here is a query that returns back our original table let's modify this to fill in null with no so i call that coales function for expression one i leave it as the column of is real job and then for the default value or the last one we're just going to put in no let's go ahead and run this bad boy oops forgot to put a comma run it again okay we can see now we have this column called coales after the function and it's filled in yes no kind of a better practice would be to actually assign this an alias when done so that way we can actually see it and bam we have the updated column title now what's going on here with that coales function where we have this second expression well let's say we wanted to fill in this null value for salary but we didn't want to use a default value we wanted to just fill it in with if it's null maybe just put in something like the job title specifically depending on where it matches up you would fill it in for the appropriate row that it comes from let me demonstrate it okay so we're going to use that coales function again we're going to leave salary in there and then for the second column we're going to specify job title i'm going to leave the default value blank for right now finally i'll give it an alias of salary pressing command enter now whenever i run this i'm going to make this a little bit bigger says error colas types integer and character varying cannot be matched the problem is salary is an integer and job title is a string so anytime you're using this to have a column replace other they have to be the same data type in this case we'd have to cast salary as a text or varcar in order for it to match that same one that is job titles now when i go ahead and run this it actually works below and we have in fact filled in the appropriate column from that job title into salary you could put a default value in here i'll just name it default value but in our case when running it not going to come up so let's reset this back so we can get into null if now with our original table back say we had a scenario where we knew certain values weren't correct or we didn't want them in there and we wanted to make them into a null and like in this column of is real job kind of isn't really an answer maybe we want to now make this into something like null now with null if this returns null if two expressions are equal otherwise returns the first expression and this one's even more simple in that it can have either expression one or expression two where they can be either columns or single values let's jump into it so let's say we wanted to replace this kind of with null we call our null if function is real job would be expression one and then expression two would be that kind of as usual i'm going to give this the alias of is real job okay let's go ahead and run this okay we in fact replace that kind of with null now you don't also have to just do a default value i could do like i said an expression so i could do another column so in this case i could do salary once again i got an error message and it revolves around having a mismatch between the data types i can just fix this by casting salary as a text running command enter and bam anyway the point null if right none of these comparing these as it goes through none of these match so it doesn't convert any of the values to null this value was always null anyway let's jump into some real bro practice problems now previously whenever we've been doing any of our analysis all of our customer keys have conveniently always had some sort of purchase associated with it what we're going to demonstrate is that all the customers in that customer table don't necessarily have an associated purchase with it and whenever we merge them together they can actually have non values or none values or null values now if we were to run an average to find out what is the average net revenue per customer whenever we just have these nine values they're not going to be counted but say we do want to count them because hey they are customers and we want them to be zero instead that's going to affect the average overall and we'll actually get to demonstrating how much it's going to change the average revenue per customer quite a bit now let's get into combining our customer keys with net revenue to show those customers that don't have any purchases previously we've gone through and in our sales table gone through and got the customer key um got the net revenue by multiplying quantity times net price times exchange rate then obviously we're doing a aggregation and so we need to group by customer key now running this we have net revenue for all these different values in here if i try to filter to find any net revenues that are null if i go to run this we'll see that down below there's no values in there so we hadn't been seeing this previously but what we can do is with all of these revenues that we have right here we could merge this onto our customers table and then this will expose customers that don't have a net revenue so what i'm going to do is convert this into a cte use a width statement calls this sales data and then assign it in parenthesis as always i like to make sure that this works so i'm just going to select star from sales data and go ahead and run this yep working below now what we want to do is i'm actually going to go into kazo erd is take that customer table that we have here and merge onto it that sales table so we're going to make the customer table our a table and then sales table our b table so what we want to do is using our customer table use a left join to join on our sales table so we'll move the sales data down we'll say that this is going to be the left join and we'll give it the alias of s for the sales data and then for the from we're going to be doing from customer with the alias of c let's go ahead and just run this to see if it's working and i got this error message saying syntax error at end of input basically i didn't say where or on what we're going to actually merge on specifically we're going to be merging on the customer key of both of these different tables okay let's try to run this now all right we have the customer keys and all the information from the customers table we don't need all this information per se we just want to make sure we have all the customer keys along with all the different net revenues and as you can see there are now no values in here because there's customers that don't have net revenue so let's modify what we're actually bringing in here we're bringing in from the customer table the customer key and then from that sales data uh cte above we're bringing in net revenue running this boom simplified version of actually being able to view this so first let's fill in these null values with a zero just to demonstrate it in a new column we're going to call that coales function running that on net revenue and we want to place those nulls with a zero running command enter bam we got this over here so not bad now what we want to do is to show the difference between these we're going to run an average on only net revenue and a average on the net revenue with zeros filled in so basically all customers i'm going to remove that customer key so that way we don't have to do a we want to do an average on all of that so i'll call the average function on that first column and an average on that other second column for zero filled in for the null values running this we can see that the averages are quite different right so the first one is around 4,000 and the second one is less than 2,000 now these names for columns aren't that descriptive so i'm going to name the first one as spending customers average net revenue because they've spent money so that's the only the customers that we use for this and the next one is all customers average net revenue now running it more descriptive titles for this and viewing it visually we can see that when we look at all customers the average net revenue is actually less now this was mainly done for demonstration purposes cuz there may be situations where you do want to consider all customers in our case we are going to just consider only the spending customers in our analysis and not necessarily all customers so that coales we're not going to do a real world example of null if because it's going to be frankly very similar except opposite if you will but what i have are practice problems for you to go through now and get familiar with both of these options in the next lesson we're going to be jumping into understanding how and all the different functions for formatting strings so with that see you there and in this lesson we're going to be going over further on data cleanup specifically around strings how to format them we're going to be covering four key functions that i find myself using from time to time and then from there going into modifying our view that we created on cohort analysis specifically we have columns on a first name and last name we're going to combine it into one let's get into it now in the last section we were looking at function operators specifically going down here we were looking at conditional expressions in this one we're going to be going back up into this section on string functions and operators now inside of here there's a host of different functions and operators that we can use on strings and the first one we're going to be jumping into is this one here on lower how to convert something to lowercase and with any of these functions they're going to take string values so in that case i'm going to do that lower function and i'm just going to put a string in in there and we'll just put my name in all uppercase we'll go ahead and run this we can see it outputs it below in all lowercase if we have lower we probably also have something like upper running this we can see that it's all upper this would usually in the case that you have some lowercase values in there and it would raise it all up if there were some lowercase values in there the next is the trim function we're focus on the one up here this one down here is a non-standard syntax so we're not going to use it and from this it relieves the longest string containing only characters and by default it's a space let's actually just look at this real quick to understand what's going on so in the case of our example if we're using this trim right now whenever i run this command enter there's no really change in this whatsoever now let's say that there was a space at the beginning and we'll do a space at the end running command enter you can see that there's no spaces when if we were to just run it without this trim function i'm going go ahead and actually just remove this running command enter we can see that it does in fact enter spaces in there even when i check it so using this function is very important especially whenever you're working with databases with very dirty data and you need to remove any different spaces now let's say that we had some symbols in there like we had dirty data and we had some symbols surrounding this that we wanted to actually remove in this case i have two amperands around each when i run this command enter we can see we have this but we want to remove that from here well going back to this definition of it we can specify whether we want to trim based on the leading trailing or both being the front or the back of a string and by default it does both is well both is the default then we can specify the character text which is what we want to remove and then from the string text so inside of here i could do something like both i want to remove that amperand sign and i remove it from this let's go ahead and run this and it removes the both those amperands on the front and the back also i notice that my e is missing in here now going ahead and run it boom now that's what we have so what we're going to be cleaning up with this view well if we open it back up go into cohort analysis we can see underneath that data tab we specifically want to focus on this that given name and surname which will be like last name and first name anyway we want to combine these into just one column we don't need them to be separated for our analysis now because of this we're going to have to actually update our view for what we have currently now with this we're going to be basically removing two columns and then adding a new column so because of this we can't necessarily just run create or replace view since we're altering columns we need to run like alter view but that's even going to get complicated i'm going to recommend we just start over with this query and drop this view so i'm going to go ahead and go ahead and copy this command c and then in here i'm going to go ahead and paste it now remember we want to combine our given name and surname i'm actually going to just run this query to show what's going on here press command enter i have this as in front of here that doesn't need to be in front of here so i'll move that up top run it again okay we have all of it as we saw before the given name and surname so we want to combine these two going back to the documentation on string functions operators scroll on down until we get to other string function operators they have on here the concat function this concatenates the text representations of all the arguments null arguments are ignored so for given name and surname what we can use is this with the concat function specifically i'll type out concat open and closing parenthesis around here so we have the given name and surname and we'll name this as the clean name okay let's go ahead and run this and now we can see the names are now combined okay not too bad now one thing to note is well we need spaces in here and sometimes i find especially with text columns there may be extra spaces in here so one let's just add that space i'm talking about i'll do a i'll do a single quote space single quote and then comma run command enter and now we can see that that there's a space in between here but like i said sometimes the names may have spaces around them so just as good measure i'm going to put trim around both the given name and around the surname okay running this pressing command enter we now have all the values cleaned up in here and we did some protection so now we need to actually update this cohort analysis where if we went to look at it remember if we just tried to run this right now so create a replace view as the cohort analysis pressing command enter i'm not going to get it because of the column issues that we addressed before so we need to actually just drop this database or drop this view first so we're going to call that drop view on cohort analysis and then run everything underneath it i'm going to run this all by pressing option x and looks like two queries are done as always i'm going to just close out of this to make sure i have the most upto-date one click inside of here press f5 to refresh and open up cohort analysis scrolling all over we can see we have now that clean named inside of here so our view is good to go now for the project all right you got some practice problems now go through and get more familiar with these text formatting functions in the next lesson we're going to be jumping to another question for a project on customer segmentation looking forward to it see you there welcome to this third and final lesson in this chapter on data cleaning and for this we're going to be focusing on question one for a project specifically this is going to build further on analysis we actually did earlier with segmenting customers and some discussions on customer segmentation specifically we're trying to find out who are our most valuable customers for this we're going to be breaking up our customer into tiers using percentiles into highv value midvalue and lowv valueue customers now this is a very typical business process that you would find yourself doing in order to target certain customers and then distribute marketing that fits their need and so shout out to kelly for coming up with this example because i feel it's a really good demonstration of what you find yourself be doing as a data analyst as always i like to start with what is the final data set we'll be getting for this and so we'll be calculating or actually finding out based on customer key and that clean name that we did in this chapter to determine what is their total ltv lifetime value or their if you will net revenue and then based on these values we're going to use percentiles to categorize customers in either to low value midvalue or high value we'll also take this calculation a step further and also dive into analyzing not only just having those names that way marketing can target these customers but also actually understand these values such what are the percentages of these different segments how much they're contributing and whatnot so let's start a new sql script documenting this analysis similar before we had a our own script for sql here i'll go ahead and we'll start a new sql script then i'm going to go up here and then rename this we'll name this to one_c customer segmentation.sql okay this should be good all right now one thing to note this is going to be inside of our scripts folder but we don't necessarily want it here if i rightclick this and then go into show resources in explore underneath scripts i can see it's right here and i actually want it higher up so i'm going to move it out if you're on windows you do something similar with your file explorer and then down here it's not actually showing up i can press function f5 and that sql script disappeared but now i'm actually seeing this one and also readme is popping up now i guess i didn't uh refresh as well anyway i can now open this back up this is what the sql file is we want to work with so first three columns of interest that we want to get into here so we want that customer key we also want that cleaned name and then from there we want the revenue for each of these customers or that total lifetime value so we use a sum of total net revenue and we'll assign this as total ltv just as a reminder going back to that cohort analysis we could have multiple entries in here like 180 did multiple purchases on different days and we had that in a total net revenue so that's why we're renaming this new column total ltv because now with 180 we'll have the total lifetime value we want this from our view of that cohort analysis and we're doing a um aggregation so we need to do a group by using the customer key and then also clean name so let's go ahead and run this and see what we have i have no active connection so if there was a reminder if you're already connected reconnect to your database and i just need to reselect it up here now everything's looking good let's try to run this again all right bam this is what we want clean name customer key and then the total ltv for each of these i can even i can even do the order by descending or actually i want to do order by ascending and see that hey okay 180 is now combined into one looking good now that we have this total ltv we can now bucket these customers into high value low value and what's the other one midvalue now we're going to be doing this on percentiles using the 25th percentile and 75th percentile so because we're using or running a percentile on this aggregation right here i'm going to put this into a cte we'll call this customer ltv and we'll put this in parenthesis from there on this ct we're going to run that percentile continuous function remember we're doing the 25th and the 75th percentile basically everything between the 25th and 75th percentile is our midle so i'll do 0.25 and then we'll do the nomenclature of within group because we want to group it by a certain thing and we need to order the ltv or the total ld ltv column to make sure we're pulling out the correct um ltv as we go through and sort this and then we'll give this the alias of just ltv 25th percentile let's go ahead and just make sure this is right so i'm going to do a from the customer ltv pressing command enter boom okay 25th percentiles at 843 i did this already i know that this is actually correct so let's now get the 75th percentile i'm going to go ahead and just copy this all because it's just going to take some changing to do to it and then putting it inside of here changing that two in multiple locations to a seven and then running this bad boy boom so just so you understand what's going on here those that spend around $843 are at the 24 25th percentile and so if you spend less than this you're less than the 25th percentile whereas the 75th percentile those are around spending around $5,500 if you spend more than this you're greater than the 75th percentile these will be our high value customers and those less than 25 will be our low value now what we can do is let's just go back and i'm going to run just this query up here now what we can do because we have those percentiles we can use this basically customer ltv cte and we're going to convert this percentiles one into a cte as well we can categorize or bucket them using a case when statement of whether they're high value or low value or mid-value so we're going to be making this into a ct i'm going to tab this over put a comma and we're going to name this customer segments and then as once again opening and closing parenthesis so that way this is in a in a cte from this we're going to select and i want all the customers from customer ltv so i'm going to do a um we're going to give it the alias of c so i'm going do c.star and i'm going to go ahead and just put this down here of from customer ltv and like i said that's going to be of the alias c and it'll clear that little syntax error let's actually just make sure that everything's appearing right down here i have a comma after this so i need to remove that it's appearing down here now we need to go through and now make that case when to basically bucket all of these into their different tiers gonna move this down some so it stops uh so it stops cutting it off okay we're going to do a case and then we'll do when and the first one i want to categorize is everything less than 25% as low value so when total ltv is less than this ltv 25 percentile which i'm realizing now we haven't imported in now we don't necessarily need to do a join with this what i can do is a comma because we're not join it to the data i can just list it of customer segments and give it the alias of cs so now that's available we can say cs.ltv 25th percentile then we want to assign it the low value we're going to do one tac low value we stick a 1 2 3 at the front just to make the easier if we want to ever sort it um if you just do low value only and then try to like sort by the name alphabetically it like throws a fit mainly meaning you can't sort it like that anyway let's go into the next one and i'm just going to copy this because a lot of this can be repetitive and for this one we want to get everything that's midvalue so everything that is less than or equal to the 75th percentile so that mid-range is going to cap encapsulate everything that's equal to the 25th percentile up to the 75th percentile and equal to it and then i'll change this to two mid value now since we got everything underneath the 75th percentile we can now categorize everything else as high value all right let's go ahead i'm going to remove this space here let's go ahead and run this and fingers crossed i see my issue i have a syntax error at or near when and that's because i don't have a comma after here also i never gave this alias or this case statement a name specifically we want to name this customer segment all right let's go ahead and run this bad boy okay it's working now let's make sure that these are categorizing correctly if i remember previously i'm actually going to just select this one time and press command enter oh and i realize now it's uh not going to let me do it like this unfortunately i'm going to just copy this go into a new script because i want to actually show this then paste this bad boy in and run this and i'm silly i need the other ct anyway you don't have to do this main purpose of this is just to see what these numbers are remember it was 843 for the 25th percentile and 5,500 for the 75th percentile now running this complete one are these numbers making sense yeah because 5500 should be the high value it's greater than that for this second entry and then looking at these these fall in between yeah it looks overall let's look for a low value okay we got a low value down here of $23 falling in okay data looks like it's calculating correctly all right so this table right here would be great to now export and send to our business colleagues to basically find those people that we want to target i don't know why this window is so wide like this um but on mac let me know if you're having these same problems it shouldn't be this big anyway i'm not going to do this right now but this would be great to send to business colleagues and for them to actually now send targeted campaigns to these individual items depending on what our strategy is and how they are segmented but let's dive a little bit further first to analyze how much do these different customer segments how much um how much are they contributing what is their customer revenue and also not only what is individual and average customer spending but also what is their total revenue so what i'll do is i'll now convert this into a ct as well cts on cte on cte i'll give this the name of segment values and then as and then once again open and closing parenthesis with this we're going to just go first and get the customer segments and mainly that customer segment column i mean and we're just going to get this first from that new cte of segment values okay let's make sure that this is just working correctly and we can see it's doing this all down here so the first thing i'm going to do is get a sum of all the different revenues so we'll run sum on total ltv and we'll name this conveniently total ltv because we're doing a summation we also need to do a group by on that customer segment all right we'll go ahead and run this all right not too bad so this is telling us the total amount and our high value is at 135 million whereas our low value customers have only contributed 4 million if we plotted this on a pie chart to see how much they actually contributed part i only recommend pie charts if it's three or less values we can see that low value is oh my gosh we got to do something here and target these lowv value customers better so this is great the analysis that we found out of this midvalue is around 33% which you would expect it to be about a third and then the high value is almost 2/3 so that's pretty i mean really high for high value so based on just this little data piece alone this is evidence enough that we need to do some different marketing strategies especially with our lowv value customers now because we put those numbers in the front of this customer segment if we wanted to we could also do an order by and we could do that on customer segment and we could put it in descending order running this we're now getting those high value up top mid value low so that's why we put those numbers in the front for those columns so we can sort it more easily anyway let's calculate a couple other things mainly i want to find out what are the number of customers in each one of these segments so that way we can then go through and find out what is the basically average ltv or lifetime value for a customer in a certain segment so getting the count first we're going to be using the customer key for this we'll assign this as customer count running this we got the counts and as expected one and three should be equal because there's 25% in this 25% of that and then this one right here should equal the basically double of these um because it's the 50th uh it's 50 percentile or 50% in between so now that we have this total ltv and this customer account we can divide these to get the average customer value so i'm just going to take this and then divide it by this value right here and this will be our average ltv bam all right now i'm going to close out of this right here so this is pretty interesting our highv value customers are submitting or submitting on average around $11,000 whereas our low value are only around $350 pretty substantive probably why those low values only contributed around 2 or 3% of that total revenue now there's a host of different marketing strategies you could go about doing this you can feel free to pause the screen and look at into each one of these of what you could go for doing this isn't necessarily a master class on marketing strategies it's on data analytics we're not going to spend too much time on this but i did want you to understand what are some different capabilities we can do with this powerful data now there's no practice problems for this lesson but i do expect you to go through and update your project readme specifically i added a little description of what we're doing for all the different segments i added a link to our sql script along with that visualization that i showed you earlier breaking down those things because that was the key insight we got from this i broke down the statistics talking about the highv value midvalue and low value segments how much they contribute of each and what their contribution of revenue is and the big disparity there and then i wrapped it up with business insights what could we potentially do to target highv value midvalue and low value customers all right it's your turn now to go through and update all that we're jumping into the next chapter in query optimization where we'll not only go query optimization but also answering our third and final project question see you there welcome to this chapter on query optimization for this we have three lessons we're going to be going to the first two are going to be focused on understanding how to use the explained keyword along with some query optimization basics the second one's going to jump into more intermediate and advanced ones and then finally in the third lesson we're going to wrap it all up with our final problem for our project now in this video in the next one we're going to be going over query optimization techniques and i have a list here of beginner intermediate and advanced techniques you should be familiar already with beginner ones but we will do a refresh during this video and then in the next one and the next lesson we'll be jumping into that intermediate and advanced and this will all be done while using explain and analyze to break each one of these down now for the third lesson it's conveniently on the third question or the last question in our project specifically we're going to be doing retention analysis analyzing who hasn't purchased recently we're going to get a visualization similar to this and wherever we break it down by the different cohorts years and see how many customers are active and how many are turnurned or didn't purchase something recently this is a super common business concept to understand and coincidentally kelly was just telling me that she actually was implementing it in her job today so let's get into breaking this down we're going to be using these keywords of explain and explain analyze for each of these they just go at the beginning of a sql command whether what you're using but what's the difference between these two well explain demonstrates the execution plan without actually executing it whereas explain analyze basically means like it's going to analyze it and it actually does execute it so we understand what the execution times are so say we have this simple query of select star from sales i could use explain at the beginning of this running command and enter and it's going to tell me the query plan i'll break this down in a second now we also could use something like explain analyze right and remember this one is one row when i run explain analyze we have two more rows mainly this one has the execution time it has well not only the planning time but also execution time so you may be like luke when the heck would i use explain and when would i use explain analyze like why would i want this if this doesn't even tell me the execution time well let's say you're working with an extremely large database like millions or even billions of rows it could be extremely cumbersome to run this query and cost a lot of money not only time but also money so there may be cases that we would only want to use explain but since this database we're working with is so small we're going to always be running explain analyze with all the queries we do that way we can also see the execution time so let's break this down on what this is actually providing so this first row here says that it does a sec scan which means it does a sequential scan basically going row by row by row and it specifies that it's doing this on the sales then from there we have three variables inside of parentheses here cost is just an arbitrary unit and that's just assigned by postgress if you want to be real about it it's just made up but quantitywise it remains consistent the one thing to remember is that these numbers just because say you have a cost of 500 and then another one you have a cost of a,000 the query is not going to necessarily take double the time it's just going to take longer anyway with this cost you can see it has the syntax of this starting value then this dot dot dot and then the next value after this this is the start cost and then this is the final cost so ultimately this query cost in this case that is shown in demo 18.5 next are the rows and so that's the estimated number of rows and then finally the width is basically still that row size but what it is in bytes going back to our original query we can see that we have a cost a final cost of 4500 there's almost 200,000 rows and the width is 68 bytes ultimately this took this query took 30 milliseconds to run and the planning time or what it was going to do whenever it was planning on how to execute this query took less than a millisecond now all the times with dealing with all of this are going to be in milliseconds so you may be like luke what does this even matter like we're talking about 30 like half a second well once again this is going to come into play whenever you're dealing with databases that are millions and billions of rows those milliseconds aren't going to be that anymore and i've had queries run as long as an hour so query optimization is a must for you to understand now expand this out we can actually see there's some other things as well to cover here specifically going along with that execution time we have the actual time the rows and then loops rows remain the same in this case loops is a more complicated topic and really relies on if we're basically performance and sort of recurrent loops especially whenever we do joins we're not going to worry about that too much but what i do want to focus on is this actual time right here and this tells us that hey we started this query at 017 milliseconds and it ended at 14.8 milliseconds and then the time after this of the execution was the time it took to display it and do all that now it's important to understand that with explain whenever i run this one it does not have that extra parameter over there as well i didn't have it uh pulled out at the moment but we still have that same thing of cost rows and width okay so let's build on this further by calculating the net revenue per customer and seeing how this query plan changes for this we'll say we need that customer key and then like usual we're going to be doing a sum and this will be summing up the quantity times net price times the exchange rate and we'll give it the alias of net revenue okay let's go ahead and run this bad boy and i got ahead of myself we got to do a group by anytime we do an aggregation so i'll be specifying that customer key and then running this okay now with this what's going on here is there's actually two steps here each of the steps are denoted by this basically this arrow right here so this right here is a step and then this up here these three rows are now a step the first step would sort of counterintuitive but that first step is that sequential scan on sales it's the most indented one in the fact that we're doing the scan on the sales table getting all the rows we needed of 199,000 and we can see that that took 9 milliseconds to do then the next step goes into here performing the hash aggregate basically it does a hashing system in order to perform the aggregation we're not going to go into hashing right now the important thing is understand that we are doing a sum function which is an aggregation it tells us it takes from 54 to 56 milliseconds so about 2 milliseconds to do this and this is done on only 49,000 rows and that's because that group by or that customer key condenses it down the amount of rows under this it has other other information like group key and then also how much memory was used ultimately this query ended up taking slightly longer than we did previously now we're up to 57 milliseconds total i'm going to add just one more thing to this and let's add a filter to it and we'll say we want orders only from 2024 so we'll say that it's greater than or equal to 2024 january 1st okay let's go ahead and run this and we got an error for this why do we get an error right here well it's because i have it out of order and that's because even in the execution aspect we're going to filter our sales table by that in 2024 and then perform our group by and aggregation and we can actually prove this by our execution plan in the fact that in our first step right here our sequential st scan it actually goes through and filters by those dates so that are in uh 2024 then after this which after we filter it for 2024 we can see that we have down to 10,000 rows then it sends it into our aggregate to do the group y and our sum and this takes 27 to 28 so this is like well even less than a millisecond and ultimately because now we do this wear clause we have a much shorter execution time so not only are we learning about how to read query optimization we're understanding why do we have the order of these keywords such as where and group eye anyway it's important to note that we've been using explain and explain analyze but over here on dbver they have an explain execution plan now if i try to run this with this explain analyze up here and click this and click okay it's going to give me an error because we already have explain in there so it's important that you select what you want to use and then use the explain execution plan or the shortcut of command shift e now this is going to pop up and asking what you want to do if you don't want to do the explain analyze you want to leave that unclked i usually just maintain all of these clicked including the analyze and then from here click okay this i find is slightly less descriptive but it is more ordered in the information that it provides it's still pro uh still in that same order of sequential scan is the first step and then the aggregate is next starting at the bottom going up but as far as the times and every and the cost they're actually more in a readable format than that execution plan anyway is available if you want to use that way we're going to compl continue to use explain analyze throughout the reigning of this video and the next video cuz i find that post useful last thing note is if you encounter any keywords you don't know the best thing to do is just go ahead and copy this all and go into your favorite chatbot just paste it in and have it go through and actually explain what's going on here step by step by step so in the remainder of this video we're going to be going over some beginner optimization techniques that we've touched on briefly throughout this course but actually using explain analyze to prove why you should be doing these in the next lesson we're going to go into more intermediate and also briefly cover some advanced techniques in order to further level up your optimization skills for this one on basics we're going to be covering three ones and we're going to be going over examples on the first two on why we use select star why we use limit and then for the third one we're just going to just briefly discuss of using where instead of having we'll start with the easiest one to actually prove it's efficient and that's using limit if i just run select star from sales on the entire sales table running command enter it's taking around 18 seconds and what what you will notice for this i'm going to run this a few times the time actually jumps around so on average it looks like it's around i don't know around 24 or so anyway i have this in a not not optimized sql query i'm going to come over here to the optimized query so we can compare them before and after and we'll put a limit statement we'll just say we want 10 and this bad boy we get it in 03 milliseconds run this a few times yeah it's maintained pretty consistent around 03 compared to our previous of almost 21 so there's your actual proof that those limit statements are very helpful in minimizing the amount of data and saving you time next is on select star you've heard me time and time again saying "hey i don't recommend using select star to select all the columns of the table." so let's try to optimize this and you know what i'm just going to list one column and that's it customer key whenever i run this we can see actually i need to run a few times we can see that it's over 30 seconds whereas the not optimized one whenever i'm running this one it's less than 30 seconds what the heck's going on here well in some cases like this one postgress makes it super efficient for them to retrieve data using this select star nomenclature and so yeah it is more efficient in some cases to use select star but i'm still sticking with it in the fact that i do recommend especially when you get into bigger databases millions and billions of row i would still stick with only listing one or however many columns you need for your analysis and not using select star now the last one to look at is using where instead of having and unfortunately you may not always have control over just easily switching from where instead of having so let's say we have this query here where we're actually getting the customer keys along with all their net revenue for it now if we wanted to filter this data based on the net revenue and maybe get some that are higher than or less than a certain value we would use a having clause in this case and we say hey we want that having greater than 100 or greater than a thousand and if we remember from order of operations that we first do that sequential scan so we get all 199,000 rows then when we're doing the aggregate we then are doing the filter in this step with that 199,000 rows now we're moving back to that original query to demonstrate this now with a wear clause unfortunately the having benefit is using it in aggregation but we may have a case where okay we can alter this instead to have it to where we want to get customer keys that are we'll say less than 100 so not using that aggregation but more posts on a customer key anyway with this one the main point is that the filtering is done in that sequential scan out in the beginning and as we can see from this because it limits with how many rows are done our execution time is a lot shorter now you may be like luke you're using where with a customer key of less than 100 and you're using having the aggregation that we did of greater than a thousand yes i know these aren't necessarily comparable but there may be a situation where yeah you know you want these net revenues a certain value and thus you want the maybe the order keys of a certain thing and you could then filter instead by the certain order key values or specifically not order keys sorry customer keys mainly what i'm getting at is if you have a choice that you could potentially modify to do where instead you need to take advantage of it all right you now have some practice problems to go through and get more familiar with using explain explain analyze explain feature inside of dbver along with testing out some of those basic techniques that we just went over the next lesson we're going to be going over some intermediate and also advanced techniques along with a real world price problem all right with that see you there welcome to this lesson on optimization techniques we're going to be starting by jumping back up where we picked off last and jumping into intermediate intermediate techniques we also briefly cover advanced techniques but overall they're going to be outside the scope of this course and you'll see why then at the end of this we're going to be going into optimizing our query that we built in the last chapter on data cleaning basically applying all these techniques we've learned into how we can make a query run faster so what are we going to be covering for these intermediate techniques well we have four ones to cover but the first one we've really been covering in the last lesson and in this lesson of using query execution plans basically using that analyze basically using things like explain explain analyze or even dbver's built-in options now we're going to be going over three other scenarios besides this on minimizing group buy reducing joins when possible and optimizing order buys so let's say we have this query here where we're going through and getting things like the customer key order date order key line number and then also getting an aggregation of the net revenue which i need to add an alias of net revenue okay let's go ahead and run this just to see what's going on anyway we can see from this that well it's two main steps of doing a sequential sand and then actually aggregating with our group five but our execution time even running this a few times is pretty high up there i mean sometimes getting as much as 100 milliseconds once again we're dealing with milliseconds but if you have databases that have millions and billions of rows this can easily turn from milliseconds to seconds soon as those queries get longer than something like one or two seconds they get annoying af anyway with this query itself i'm just going to select it and run it we may not necessarily need to find out every individual line number and so if we were to go ahead and remove something like n line number in our case i'm going to go ahead and move it over to this optimize area and then take it out here along with taking it out here and then running this oops got a little bit of typo try again okay we can see that now we have a much lower execution time here at 67 seconds but consistently less than 100 that was just by removing one group buy so an important concept to understand is it really necessary to do all those group buys it can get costly over time and in this case we're almost saving half the time if you will just by removing one group by next concept to get into is minimizing the number and also types of joins if you will on when we're doing a query in this case let's say we're pulling in multiple tables into our sales table we're also pulling customer product and date table i'll go ahead and just run just the query itself and we can see we have yeah a lot of information in here now running this full query to actually run the explain analyze we can see that this query takes over 100 seconds running this query a few times we can see it runs around h around 80 milliseconds anyway pretty intensive anyway if we go back to that original query in this case we can see that we're pulling in the year from the date uh table what happens if we remove this interjoin and just added in a way to extract out the year out of the order date so i'm going to go ahead and just copy this all paste this into the optimized i went ahead and already moved the inner join down here and then we're going to just write an extract function using year from s.order date and we'll give this the alias of year okay and we can clearly see whenever we run this this is providing us the exact same information of that year running the query completely now i need to move it up some running this one we can see well it's looked like it's maintained around 70 so the other one i already forgot what it was running around it's running around 80 or so 80 or 90 so this one is i mean almost we're getting a 10% gain or increase in performance just by removing an inner join and instead doing a function instead now it's also important to note that establish the number of joins but also the type of joints and we'll get into that with the practice problem coming up in a little bit last major concept to cover is optimizing your order buys sometimes order buys are not something you can just like negate but if you can they will save you some times specifically here well let's just go ahead and print out this query pressing command enter i'm getting the customer key order date order key and net revenue and in our order by we can see we order by our net revenue first followed by customer key then the order date and then the order key rarely do i find ordering by all columns is really necessary anyway there's a few different ways we can optimize our order by the first one the easiest is just limiting the number of columns in order by the second is avoid sorting on computer columns or function calls the third probably the most intuitive is place the most selective columns first in the order by basically if it's doing something that's going to filter out other rows you'd want to use this one first and then finally use index columns for sorting to leverage existing database indexes unfortunately we don't have control over indexes indexing usually a database administrator does but if you did you'd want to use it in that case anyway let's go with one of the recommendations of removing function calls so we're going to go ahead we have explain analyze up at the top i'm going to run it and i'm seeing it around 90 milliseconds or so now i'm going to take this exact query put it over here and we're going to remove that well the net revenue the aggregation up here on the sum let's go ahead and run this and this one i'm seeing a lot less usually around the 70 to 80 millisecond range so we just cut off about 10 to 20% doing that alone now maybe we find out even more that we can remove the order by let's say we want to just remove all the way to the customer key itself and then run it in this case i'm not finding that much of a difference even seeing it get as high as 90 so i'm not finding that this one is as great as the aggregation but overall we can see that it does have an appreciable impact compared to our notoptimized query so let's get into optimizing our view that we've previously worked with and that was with cohort analysis we can go into it under databases right underneath our data set underneath schema public and then our views ourselves that cohort analysis now we can actually look at the query itself going under source and we have it all here i'm actually going to go ahead and just copy this bad boy and put it into this script here on not optimized i don't actually create or replace any view so i'm going to actually remove this and instead put up at the top of explain analyze so i'm not liking how this is formatted so i'm going to highlight it all go to format and go to format sql and it's going to break everything out more of like how i like it now let's go ahead and run this explain analyze on here to see what we're working with for our current execution time um and i have a typo because i didn't get rid of this as in the front of here going to go ahead and run again okay boom we have this all and wow there's a pretty hefty query plan and we can see this is some of our highest execution times that we've seen so far around 200 milliseconds to get this bad boy done it looks like it has a total of 1 2 3 4 five six different steps and that looks about right with the ctes and all the different group eyes we have in here now recalling back to what we just previously covered of how we can improve a query just looking at this we can see that the cte well it has a a bunch of group eyes also has a join so that may be able to be optimized and then we look down here to this bottom one this bottom or the main query itself and there's not a lot of different techniques that i feel i can put into this as this is just doing a simple select and then from so primarily we're going to be focusing on inside of the ct of customer revenue and the first one we're going to focus on is the join now previously we discussed about minimizing joins but actually which is also just important is understanding when you should be using what type of join now a lot of our course and the previous course used either left joins or inner joins with left joins they're specifically used in the case of for table a you want to keep all values in it and if there were maybe some null values that matched up with from the b table for the a table you would want those null values to fill in for a so we didn't remove any of the a well if we knew what we were matching on had all contents from both the a table and the b table thus there are no nulls inner join is slightly more efficient to use because we no longer have to do this null check before actually joining so what i'm going to do is i'm going to copy this query and put it over into the optimize section and i'm going to change this here from left to inner the first thing i want to show though is the actual query itself the output pressing command enter what i can do down here is i can i want to see the total row count so i click this okay that's around 83,099 if i go back to our one with a left joint i hope that whenever we run this query i'm going to select it all and then press command enter we can see for this one the row count is 83,999 so the same so they're still doing the same thing where this one has a left join and this one has an inner join so now the question is is the explain analyze whenever we run this on the nonoptimize and on the optimize is it going to be quicker well for the not optimize or with that left join it's around 200 milliseconds and that of optimize is well around the same thing around 200 milliseconds so although it didn't work in this case to further optimize our query and just works out to basically break even there are cases where using an inner join instead of left join when appropriate can save you potential time in your query execution all right so we talked about joins the other last thing that we can do in order to optimize this query is has to do with this group by look at this we have a bunch of group eyes and when we actually look at it based on what we're aggregating by we're having a lot of repeating values and country full age given name and surname what do we mean by this well let's actually look at the query itself for customer say 180 yes the order date is going to change yes the total net revenue number of orders but as far as things like their country full the age their clean name their first purchase date or even the cohort year that's not going to change so why are we doing a group by on things that aren't going to necessarily change we really care about just grouping it by the customer key and then also the order date so what we can do for this is like we said we want to minimize those group by but this query is not going to work if we go to run it right now it's not going to work anymore what we can do is we can do an aggregation function with this we just need the max value from each of these or you could do min whatever not it's very popular just to do max and i'm doing this for age the given name and then the surname with this it's also important that you give it back its alias so i'll be giving it for country full age given name and also surname so first let's go ahead and just run this query and make sure that it's working properly scrolling on over here we can see that everything's remained the same for the country full the age and then also the given name or that that final clean name that we got to i think i called out first purchase year and cohort year when i first talked about this they didn't have anything to do with the group by they're done lower below anyway let's go ahead now and run this query see how long it takes and with this one we can actually see that we've now got the execution time down to 160 milliseconds where previously it was around that 200 milliseconds so now that we have this optimized query that's taking less time we can go ahead and update our view and we'll be using that create or replace view cohort analysis as so we're not changing any columns with this so technically this should work without doing a drop i'll go ahead and run this it's now telling me that it's changing data type so in fact we do need to delete it or drop the view first of cohort analysis we're going to now just go ahead and execute the entire sql script said it ran both the queries come over here press f5 to make sure that we have it refreshed open up cohort analysis and i can see that we have those max values and we've minimized that group by along with changing our join and it looks like it actually saved it to a simpler join now now that we were using that inner join to just join alone which that's the default so makes sense now the last last thing to cover are these advanced optimization techniques we're not going to be walking through any of these but they are ones that you should be aware of specifically they have these three major ones of using proper data types so basically not referencing data uh integers or numbers over something like a string using indexing in order to speed up your queries so basically relying on certain columns with indexes built in to sort them more quickly and then for large tables you can have partitioning built into them to per uh improve their performance now all three of these are controlled by database administrators specifically they control the data types they control whether columns have indexes and they control how data is partitioned so we're not going to go into this because it's going to be really specific on whether your database administrator has done this now if you ever run into a situation where you're finding queries are running excessively long and you plug into chat gbt and you can't find any results except for this around indexing or partitioning or data types this is when you're going to have to go to your database administrator and ask them to make changes to your table that way you can get more efficient queries hopefully you have a good database administrator i have in the past and been able to go directly to them and get what i needed out of it and saved me a lot of time in the long run all right you have some practice problems to now go through and get more familiar with those intermediate techniques and optimizing your query along with using explain and analyze again all right with that i'll see you in the next one where we're getting into our third and final problem in this project see you there welcome to this last lesson in the chapter and for this we're going to be tackling our third and final question looking into performing retention analysis specifically we're going to be looking at who hasn't purchased recently we're going to use terms such as active and turned customers we're going to look at it totally overall and then from there actually break it down into the different cohort years to see how it's actually trending over the years as we have these different cohorts so we're trying to identify which customers haven't purchased recently and the technical business terms for this would be we're trying to act identify active verse churned customers for us active will be those that have made purchases within the last 6 months whereas churned are those that haven't made a purchase in over 6 months now 6 months isn't necessarily something you're always going to use as the hard and fast to delineate between active insurance it's really going to depend on your industry and maybe even other factors as a general rule of thumb i have these four different areas and contazo falls into an e-commerce and typically we'd see them use a 6 to 12 month period since last purchase for this whereas something like a mobile app is going to be much more quicker with turnover so they're going to have a 7 to 30 day since last session to identify active verse churned customers and now you may be wondering why the heck this even matters well we can send off this data that we end up calculating on finding out if a customer is active or churned and we can do specific targeting marketing campaigns in order to get them to re-engage also when we look at this holistically towards the end to get these like percentages per cohorts and stuff we can understand maybe the effectiveness of previous campaigns that we've used in maintaining activeness and preventing churn overall this deals with tracking our customer retention and also engagement which is necessary because we know we have customers that have bought from us before and they're likely to do it again so we need to take use of that so what are we going to be working towards well we want to basically build this table here which is going to have information like our customer key and our clean name along with calculating things like when was their last purchase date and ultimately was this in the last 6 months and then classifying this as either churned or active to make this easier we're going to be using the view that we've been using of cohort analysis because this has all the information that we need from it in order to extract out this information now like our last two problems i want to be working in a script that we're going to be saving as our final script to upload into our project we put on to github right now we have our um one and also question two what i'm going to go ahead is go to vs code and then inside of here i want to create a third file and i'm going to name this retention analysis remember this is a sql file so i do dossql and then click enter we're not going to edit it inside of here i just want to actually create it and then now going back inside of dbever and clicking inside of here i'm going to run f5 to refresh it and we now have our sql file right here that we're gonna be working in now what are we gonna be quering well let's go back to that cohort analysis and open it up things we definitely want are the customer key we have this clean name that we saw additionally we're going to be using that order date and we can also use the first purchase date which will be used for some filtering that i'll explain later so let's start defining all this we'll start with a select statement we'll define that customer key the clean name the order date and that first purchase date and we want this all from cohort analysis okay let's go ahead and run this looks like i don't have any active connections so we'll update it real quick to select the right data source and run all right so now we want to get well we'll target specifically this customer 180 right here we have these two purchases we already have a column for the first purchase date but really we want to know when was their last purchase made in order to understand if they bought it within that sixmonth period so we need a way to go through and basically identify in a numerical way what is the most recent purchase and we can do this using row number and partitioning so right after order date i'll enter inside of here and we'll do row number we want to do a partition so we use the keyword over then we put inside parenthesis the partition by and specifically we want to partition it by that customer key and then we don't want the we don't want it assigning just numbers willy-nilly we want to actually specify it depending on the order date so we'll do an order by specifying order date and we'll name this as row number which you'll typically see this written as rn okay let's go ahead and run this and not bad now looking at customer 180 again we can see that we have the most recent purchase is actually number two so it goes one two we actually want it opposite of this so like their most recent purchase is numbered number one so i can change this order by to descending run this again pressing command enter and now we have it in that manner and it does order it we can also double check on some other ones here at 387 everything's looking good so not looking bad we're almost to what we need out of this mainly i don't need any more duplicate duplicate entries i just want to get the most recent purchase out of here and that can be done by basically filtering for row number equal to one so what i'm going to do is put this all into a cte and then pull out what i actually need so i'm going to tab that over and we'll give this the name of customer last purchase put it all within opening and closing parenthesis and then from here do a select statement we want that customer key clean name order date and we're getting this from the customer last purchase remember we want to filter this right where that row number is equal to one go ahead and run this have a little typo in here put something i didn't need to go ahead and rerun it all right looking good now order date technically now this order date is i mean it is an order date but it's actually now the last purchase date so i'm actually going to rename this with an alias of last purchase date looking good we now need to get into classifying each one of these customers as whether they're active or churned now i need to show you something real quick so i'm actually going to do underneath this another query we're just going to look real quick at the order date column and that's from the sales table i'm going to go ahead and just run just this query then i'm going to filter this in descending order and so what we can see is actually the most recent date this data ends on the 20th of april on 2024 now i can actually just query this by doing a max of the order date run controll enter and it's still that 420 so why am i telling you this well as of filming this it is march of 2025 so we're almost a year ahead anyway if we went back from my time now six months none of the customer there's not any data in the system within that six-month period and so we'd have like a 0% act well we would have a 0% active rate so the point is we have to do this from our last data point of 6 months past this 420 or 6 months before 420 so what we're going to do is this we want to now actually let's go back and run up this query we want to now categorize these using a case statement looking back seeing if they're within 6 months of april 20th 2024 and if that case classified as active otherwise classified as churned so in our main query down at the bottom i'm going to go ahead and put a case in and then we'll do a when then we want to use this column we know it's called order date when order date is less than 2024 of april 20th and specifically minus 6 months so within 6 months of that period so we're going to do an interval specifying 6 months in this case we want to classify it then as churned so better said going back or doing this actual calculation right here if there are any purchases before october 20th 2023 we're going to classify them as churned else we're going to mark them as active and then we're going to end this case statement also i want to give it an alias of customer status okay let's go ahead and check this out i'm going to remove this extra line right here and now run command enter okay we got this error message right here invalid input syntax for type interval specifically right now for this we're trying to do a comparison for order date basically 10 20 of 2023 this sees this as a string we need to actually cast this as a date using this operator go ahead and run command enter and now it is working now we have this extra column we can go through and actually double check it we can see here november 2023 is active december 2023 active and looks like it's matching up pretty well now there's one other thing or one minor detail that we need to actually filter correctly for to make sure we're getting the right calculation i'm going to go ahead and filter this data to show what i mean all right this is it anyway these are the last purchase dates and actually this isn't actually showing what i want to show we want to show first purchase date i'm going to show it right next to this so i'm going to call it first purchase date right here remember we have it up in the cte up above so i can just call it right here and whenever i run this it's now here let me actually now filter by this okay this is what i'm trying to show their first purchase date was or these customers however shown was firstly on uh april 20th and there is active so if we keep scrolling back all of these for a number of period are active and so they haven't been a customer really for six months yet so they've never even qualified to become churned i would argue that this would cause a bias to increase our numbers especially in 2024 for making it look like all of our customers are actually active actually and that's what will happen in 2024 if i scroll all the way through here and we get to the beginning of the year all of these customers in 2024 will remain active when i run a percentage on this it's going to say 100% active in 2024 which is completely useless we need to actually go back and remove everybody until 10:20 because that's when it's keeps on being active but then here this is when we actually start getting churn customers cuz we actually get customers that have been part of the system for greater than 6 months so all i'm going to do is we're going to modify this query here at an a end and we want this for first purchase date basically all of this to categorize that a period of october 20th 2023 i'm going paste it right here now let's go ahead and run this and i'm once again going to filter by that first purchase date and we don't have a first purchase date now until after october or before october 20th when we actually start having churned and active customers so we're going to have a lot better description or actually key statistics that actually match up with the data now we have this cleaned up we actually don't need this first purchase date anymore that's not something that marketing may necessarily need we now have that final table that i was getting at that we needed to get with customer key clean name when was their last purchase date and whether they are active or churned one minor note on this query i am not a fan of hard- coding values into a query because say we got a like a data dump and we get more recent data files this number may change in the system instead we could use something like a subquery so i'm going to put this all within to parentheses and then i'm going to just take this command exit and i'm going to place it right here and then also right here okay and then now whenever we run this we have exactly the same results that we had previously and this is much cleaner so way if those dates ever do change or we get new data into the system it will automatically update so we have the data or the table we need for marketing but now i want to take it a step further and actually perform an analysis on this and one we're going to look at just holistically overall what is the active and churned rate for everybody and then we'll finally break it down by cohort year to see how the cohorts are trending over time so let's just get a percentage for the active and then the churned i'm going to make this now into another cte i love me some cte and we'll call this one churned customers i'll put that into open and closing parenthesis then for this we're just going to start simple first i just want the customer status and i want to count of these on active verse churned for each so i'm just going to do a simple count and we'll do this of the customer keys itself and we'll name this as numbum customers we're gain that from that churn customer cte and we did an aggregation so we also need to do a group by on customer status okay let's go ahead and run this all right so not bad looking like we have around 4400 active and 4,200 churned i prefer percentages so we're going to move forward to calculating that one thing real quick we don't need to necessarily run distinct on this if i ran command enter still going to be the same values and that's because in here we ended up filtering down to only where row number equals one so there should be technically only one customer key so i'm not going to i think that's just a little unnecessary we're not going to include it so now we need to get basically another column if you will of total customers i want to basically add these together should be around like 46,800 but if you see here we're doing group by well we can actually use window functions to expand bigger because window functions are done after the aggregation so what we can do is we can use this once again this count here to get the counts of these keys but then we want to do a sum of this using a window function and we're going to do this over but we're not going to partition by any anything because we want to do all of this and we're going to name this as total customers okay let's go ahead and run this all right and we have i was little off on the math that's actually 46,913 now we need to have both of these right if i just did sum and tried to run this we get an error along with if i just ran count we'd also get an error we have to do the count of them and then we want to do a sum of this to do all of this in order for it to work so now what we can do is we can divide these two values to get our percentage so i'll take this first value here paste it down right here we're going to divide by the total customers which is this command c command v and we'll name this as status percentage okay let's go ahead and run this and not too bad numbers there's a lot of numbers here so i could just do a round function around all of this and i only really care about two decimal places we'll go ahead and do this and now it's down to 9% and 91% which comparing this to the industry i'm using some like perplex perplexity which is a chatbot that searched the internet to get some values anyway i asked it what's a typical turn rate in the e-com in a e-commerce company and it says hey a turn rate of under 5% is considered good however the average turn rate for the e-commerce industry is around 22% so this one's pretty or our company is a lot lower than industry standards all right let's just take this one more step further and now we're going to be forming or finding out what is this active versus churned rate for our cohort years and see how it progresses over the years now all we need for this is well we need to add another column on cohort year um but the problem is we actually need to import it higher up specifically it's inside of our cohort analysis if i actually look inside of here you can see cohort year is there so after that first purchase date i'm going to add in cohort year and then in our second subquery i'm also going to add it in here now because we added these extra parameters up here we needed to add it into our group by to make sure that it's working just fine specifically i'm going add in cohort year and i actually want cohort year before this okay let's go ahead and just run this this isn't going to be the correct calculations just yet and so we do have that cohort year inside of here we have that active verse churned as we can see we have our number of customers but our total customers are 46,000 the entire time basically this is all of our total customers and then this is driving our percentages down so for 2015 we have 1% and 6% together these two rows should equal 100% the problem is we're dividing this 237 by 46,000 we don't want the total customers per se we want to be the total customers of 2015 so conveniently all we have to do is inside of our window functions we just have to add in a partition by and we want to do it by cohort year so we add in cohort year go ahead and run command enter and i need to learn how to spell partition okay we got good syntax highlighting now all right and i'm seeing this this is it looks like it's adds up to the correct amounts all right looking good i'm actually going to take this partition by and also throw it into our status percentage below so we have the correct status percentage calculated and now whenever i add up these two values it will equal 100% for all of these and as we we can see it goes from around 8% up to in more recent years up to 10% graphing it visually we can also see this trend that it's slowly going up over the time from that 8% up to even 10% all right so only one thing left to do now is update our readme we already have our uh third sql file in there and actually i need to make sure that it's saved so i'm going to command s it and now when i go inside of here i can see that okay we have our entire sql file next thing is our readme i'm going close out of this and make this more viewable so what did i add to the readme first i attached the readme which is linked apparently not correctly and it looks like i spelled analysis wrong so always double check your spelling anyway when updating it now whenever i click on this it actually directs right to it so it's always good to go through and actually click any different hyperlinks or links that you attach with this next thing i went in to do is attach a visualization this one i had generated by chai gbt just copied and pasted it in and it puts that graph in that we previously had from there i moved into the key findings talking about how our churn rate stabilized around 90% for the last 2 to 3 years and then studying the fact of that retention rates are consistently low 8 to 10% way less than what the industry normal is and then finally cap it off with that newer cohorts are showing similar churn directories and basically we need to take action now to start improving these churn rates so what can we do with this data well we can work in the future to basically target those within the first year or two to improve that active rate from churned we can also combine this with other analysis and re-engage not only our churn customers but also our highv value churn customers so we can be really specific with our targeting taking this a step further we could use this analysis in predicting future churn rates and how a customer may act that goes more into data science and machine learning we're not going to go into that but it is something that we could take away from this analysis now that we have our third and final question done we now need to get into finalizing our readme packaging it all up and putting it on github and then finally sharing on linkedin which conveniently we doing in the next chapter with that i'll see you there and we don't have any more practice problems for the remainder of this course so congratulations to everybody that's been doing those practice problems with that see you in the next one welcome to this final chapter where we're now going to go into sharing our project and first of all i want to congratulate you for making it this far and getting through this entire project it's been quite an accomplishment thus far now this chapter only has two short lessons the first lesson which this one right here is going to be about how we can create our github repo and then our next lesson will be in actually sharing this github repo onto platform like linkedin so dialing into this lesson we're going to be focusing on two core technologies that you may or may not be familiar with the first is git git is a version control system similar to like track changes in microsoft word anyway it tracks our changes and you can install on your computer and use it to track changes within files we'll go more into depth of this as we go through this video but we're going to be using git to create a repo or repository and we're going to be pushing it into github github is an online platform that allows you to share remote repositories and remote being you can access it from anywhere and what's great about this is it allows us to then share our project so what are the steps we're going to be going through in this video well first thing we need to do is actually clean up that readme after that we're going to do a deeper dive if you're unfamiliar with repos git and github we'll do an explanation of all of this thirdly we'll move into installing git on your computer and getting this repo set up for those then to put onto github and then we'll in the fourth and final one we'll be synced between the two and i'll show you how you can manage it so if you've been keeping up with your readme so far with all of the different analysis that we've done since the beginning there's not a lot that we need to do to update this specifically we need to fill in an overview business questions and then finally any strategic recommendations we have from this the overview i just have this one sentence of hey it's analysis of customer behavior retention and lifetime value for e-commerce company to improve customer retention and maximize revenue for three questions we bucket them into these three that we've gone into inant detail on each one of these not going to rehash it feel free to pause this video now copy whatever you want off of it after this you should have that analysis approach that we've gone through after each one of the questions and actually updated it to include everything we need and then finally our strategic recommendations i went through and bucket these based on our three different questions and i've outlined a lot of the key tactics that we can take away from it within those questions so i'm not going to rehash it here again but i highly encourage you to brainstorm and think through what are some strategic recommendations that you would take and then from there put it into this section final and last sections on technical details i just have information on what technologies were used so people are aware that yes i use postgress for this and i use chatgbt for the visualizations so this is looking good i'm going to go ahead and press command s or control s to save it and now we can get into our next steps of initializing a repo or publishing to github but i want to just cover some background knowledge first so the first concept to understand is what is a repository or as i'll call it going forward a repo this is a personal library for your project and basically allows you to keep manage and record every change to all your different files within it like i hinted to before it's like track changes in excel except we're using a version control system like git in order to manage these changes and thus we can go and revisit previous editions of it if we need to now in order to create this repo we need to use a technology like git git is a free and open-source distributed version control system and it's designed everything from small to large projects i use it all the time for my version control we'll be installing it here in a second so what exactly is going on and how is this becoming a git repository well here is the files that kelly and i are working together for to build this course and it has all our different lesson plans in it we actually used git for the version control for building this course anyway on the surface we can see that it just has some folders and some files in there nothing special but whenever i go to unhide the hidden files on a mac i can do that by pressing command shift period it shows that there is other files in here specifically this git folder orgget folder dot being at the front mean that the file is hidden that's why you can't see it that's why i unhide it anyway this tracks all the different changes inside of here of what's going on inside of my project we don't even need to attach that or touch it in order to adj make any adjustments it does it automatically as we go through and make changes and update the git but i just wanted to show it so you understand what is going on there i'm actually going to go back and hide it so now that we understand that git is the version control system and is what is used to create a repository we understand that there are actually two types of main repositories one is a local repository and the second is a remote repository local is as suggested it's local it's stored on your computer what i just showed you right here is in fact a local repository because it's local you don't need any internet connection it's super fast and you can do this and it's very common to do this for your initial development now remote repositories are stored on a server they obviously because they're on a server and not local you need internet but because they're now in a remote location they allow you to collaborate with others so once again i have that that local repository here but also that same repo is on github and this is a remote repository and github allows me to work with kelly as a contributor for us to work back and forth on different files so frankly it's more than just a version control system it's also great at collaborating with others anyway this is now bringing us into github which is one of the most popular tools for using git specifically it allows you to store those git repos here and then share them with the world so by the end of this video we'll be publishing your project to github so you can make it publicly accessible so if it's not clear git is the version control system it maintains that local repo and it also has uh some different command line tools we can actually type into it we're not going to go into that in the video that's out of scope and it's open source and free so that's why we're using it github hosts these repositories it's located via web browser interface and we can access it remotely so that way we can also collaborate with others let's now get into actually setting up this repo and sharing it with the world we're going to go through four different steps first we need to install git if you don't have it second we need to create the repo within our project folder third we need to set up a github profile if you don't have one already and then finally the fourth step we're going to actually share it to github so let's see if we have git before installing it i'm going to open a terminal that's going to be what you use on a mac and on windows you should have the terminal available or you can open up a command prompt inside of here all i'm going to type is get and then d- version and click enter in my case i do have it installed so it's going to say what the version is i don't need to install it on the machine now if it's not installed you're likely to get an error message and next we can now go through actually installing it so navigate to get.sem.com and from there you're going to download for your appro appropriate uh operating system for windows machines you're just going to be going through the setup for this most modern computers are 64-bit so you should be fine with installing this if you have a newer computer anyway it's going to walk you through a guey just leave all the default values the same and click okay all the way through for mac they have a couple of different options for you to install and all of them are through that command line or that terminal my recommended option is through homebrew and if you don't have homebrew not a big deal you go to this link up here and all you have to do to install homebrew is just copy this entire code here and then paste it in and actually run it i'm not going to run it because i have homebrew installed then after it's installed all you need to do is copy this command of brew install git and execute that in the terminal you'll have git installed now once you have it installed you need to verify is it in fact installed all we have to do is run git attack version and it should output the version there now some of y'all may get an error when running this it's going to say something like this please tell me who you are and it gives you the instructions run this get config message in order to provide your email and your name all you got to do is copy it and then paste it into your terminal you need to actually go through though now and actually update things like your name to what actually your name is along with that email address above and then all you got to do is press enter you're done now we need to move on to the next steps of creating our repo creating our profile and sharing to github conveniently these steps specifically steps two and four can actually be done together if i navigate to vs code specifically you go to this source control tab right here they have two options inside of your project itself you can either initialize the repository or publish to github but publish to github actually initializes the repository and then publishes to github the important thing though is you need a github profile before you click this so if you navigate over to github.com on the homepage right there it has a sign up for you to enter your email in and then go through the sign up process to create a github account once you have that account all set up then we can proceed forward so now i can go back into vs code and in here remember i'm going to do publish to github because it's also going to initialize our repository at the same time i'm going to click publish to github it's going to prompt me that hey the extension github wants to sign in using github i want to allow this i select my account that i want to be associated with and then i'm navigated back to vs code it asks me this first do i want to create a private or public repository i wanted to do a public repository the next thing this is very important you get this right in the first try it is select which files should be included in the repo i don't care about this ds store any of these hidden files per se which hidden files are anything with a dot in the front of it i do want my sql files in there i don't need that bookmarks folder that diagrams folder that scripts folder i do need my images because they have the images for my analysis you could if you want include the scripts i don't really care about that that's just like throwaway scripts i'm not maintaining that in version control okay i have in my case five files selected i'm going to go ahead and select okay it's be going through publishing and it says that hey this was successfully published to my github now what i'm going to do is navigate back to github and go to my repository i can just get to my repository or get it to my profile by clicking the name in the top upand corner one thing to note if you haven't done this already i recommend filling out any of the different social information along with your name and adding a picture to make it look like you have a legit profile anyway if i want to look at the repos myself i can go up here to repositories tab and it has this project which i titled intermediate sql project i can go ahead and click on it and includes that images folder which i click into it has all three of my images next is my git ignore which basically ignores all those different files that we said hey don't import into git so that's why it's there and you didn't see it before then our three sql files and finally our readme which we talked about before the readme is actually going to appear down below in this section and this has all of our analysis on this homepage this is really great that it is all there this is actually the url that i would want to share to other people to showcase my work now there's one last core concept that i want to go over before we conclude this on github and that's on how to actually sync your profile with github or sync your project with github as you go along let's say that we do have a change inside of our readme file itself specifically i'm going to go under analysis tools and not only do i use postgress but i also use things like dber and pg admin okay so i have these new changes in here i'm going to go ahead and do command s and save it i can now see based on these changes that we have an m next to this this means that this file is now modified and underneath the version control i can see i have this notification here for one and it's telling me that the readme is in fact modified now just to show i'm going to refresh this page real quick just to show if i scroll on down these tools are not present here but we want to get them updated on the technical details up here so what we need to do is stage as you can see this is underneath changes we need to stage these changes in our local repository we need to give a message of what we did here and i'll say you can just leave it super short i'm going to say we updated tools and i'm going to click commit and it's going to say hey there are no stage changes to commit would you like to stage all of your changes and commit them directly basically previously we'd have to go through staging all the changes and then committing them we're going to basically combine this step together and so i'm going to click yes now these are still not up on github this is only committed locally to our repo so what we're going to do is now push it to our github repo by clicking sync changes and it's going to say hey this action will pull and push commits from origin main which is the version that we want to get updated to in our repo we're going to click okay and also i don't want to show this again so i'm going click okay don't show again so now we can see we went here from first commit to now our second commit updated tools and if i navigate over to the readme refresh this page we can now see that it is now added to it so that was an example of pushing our changes we can also pull changes basically if there are changes in our remote repo like when me and kelly work together she pushes those changes up i want to pull those changes but we're going to just implement a change in here and then uh pull it from there so what i'm going to do is i inside of github here i can actually edit this file oh and let me actually show you what i'm going to edit so these images i have them around 50% i'm going to update this one this one's just too big i'm going to update it to 50% i'm going to click edit and then i'm going to go to the code now this is slightly this is html code that i'm using for this i'm using an image tag you don't need to necessarily do that all you need to understand is that i want to update the percentage and this is actually the image for question number three i want question number two oh which isn't in an image tag so what i can do is copy this code from down here put it down and this says image what from our source specifically i want this image right here from this source so i'll put this in we want the alt tag or the name of it to be cohort analysis and then for the width 50% height auto i'm going to go ahead and do this by the way you don't need to necessarily use this uh html formatting for image this is just sort of fancy dancy i'm using to get this 50% anyway now that we've done these changes in the readme i'm going to go ahead and go hey i want to commit changes for the commit message i'm actually going to change it to more descriptive to update second image size if you want to do an extended description you can and i want to commit directly to the main branch i'm going to go ahead and commit changes now scrolling on down in that readme i can see that we have that image formatted correctly so our remote repository is updated but if i actually view our readme from our local repository we can see the number three image is pretty small but the second image is pretty big and it's still using that other sorry it's using our just markdown format for having our images so going over to version control we want to pull these changes down from github so i'm going to select more actions and specifically pull and we can see down below underneath it we not only have first commit update tools we now we have update second image size and closing this out and looking at this we can see from this that all the image sizes are now formatted correctly and i have that updated code here so remember there's two main concepts that we got went over here one is pushing the changes and this is sending our local repository changes up into our remote repository in github and then secondly we can pull changes if there are any changes on the remote repository i can then pull them down into my local repository and then my local repository can be up todate now that was just a brief intro into git and github if you're new to this and you want additional resources on it i have an entire youtube video on it a link up above and you'll be able to go through and see even more detail to understand the ins and outs of git all right in the next lesson we're gonna now that we have this public repository on github we're going to go forward with sharing it on linkedin so with that see you in the next one all right welcome to this very last lesson once again congratulations for finishing that project it's now time to get into sharing that github repo onto linkedin and we're also going to be going through how those that purchase the course perks can also upload their course certificate which you're going to get after completing your end of course survey so let's navigate on over to linkedin you should have a profile if you don't i highly recommend that you create one this is where employers are at and this is where they're checking your work anyway i'm here on my profile in here they have sections on and about featured activity but what i care about is this section here on licenses and certifications this is where we're going to be uploading your course certificate remember you complete the end of course survey i'll be emailing it to you you'll not only get a link but you can also download the physical certificate as well now if you're not seeing licenses and certifications you can actually come all the way up to the top click add profile section and underneath recommended you can click add licenses and certifications anyway let's add that certificate to it by clicking this plus icon from there you'll go through and fill this all in here i filled in with intermediate sql for data analytics put me as the issuing organization i have the uh issue date of march 2025 the certificate never expires so you don't put an expiration date in there's a credential it id located on the certificate so put that in along with the url underneath skills you can list up to five skills i'd recommend these of postgress sql git github and dbaver finally i also like including the image of the certificate itself so i select add media and then add media and then attach the file itself in there give it an appropriate title from there click apply all right once you have everything in there all you got to do is click save and it's there the next thing you'll do is actually update this project because we have our certificate but we also part of this course a project so underneath this project section i'm going to click the add icon from there i give it a name of intermediate sql sales analysis short description that i stole from our readme fill in the five core skills that were saying from that certificate next is the media and i like to include a link specifically a link to the github repo so what i'm going to do is just grab this url right here and copy it and then paste it right into here and then click add i'm liking everything that it has for this i'm going to go ahead and click apply now that we have our media in we can do our start date i started this bad boy making this back in october of 2024 and i just finished it this month in march of 2025 if you worked on with somebody like kelly mcklly so i could add her as a contributor and i can associate other projects i don't have that okay now i'd go through and click save and my project's going to be updated to my linkedin the last thing i'm going to recommend to do is sharing a social media post on linkedin or making a post if you will of this project completion to let everybody know that you've done this in it i'd call out that you completed the course and also did the project don't forget to tag me and kelly in this as it's super awesome i love going through this and being able to see all the different work and i can comment on it as well once again congratulations on all the work that you've put into completing this course and also this project what are the next steps well i have coming up shortly in the next few months we'll be releasing an advanced sql for data analytics course which i'll link somewhere on here that you can go and check it out if you're interested in with that don't forget to follow me on linkedin and smash that like button see you in the next one