Data Science Basics in Python

get started with data science using Python this course covers essential tools like Pandis and NumPy plus data visualization cleaning and machine learning techniques perfect for beginners you'll gain the skills to analyze and interpret data effectively frank Andrade created this course this is my Python for data science course for beginners in this course you'll learn everything you need pandas Numpy visualizations data cleaning machine learning and more in this course I not only teach you how to use Python for data science but we also solve many exercises and work on four projects to put into practice everything we learn i leave the link to the code and my Python cheat sheet I created for this course in the description of this video in my cheat sheet you'll find all the concepts we learn in this course as well as code snippets that you can use for solving exercises and projects all right now let's start with this course anaconda is all about data science it brings all the tools used in data science like Python Jupiter notebook and pandas with just one installed and in this video I'll show you how to easily set up Anaconda on your computer so to download Anaconda we go to anaconda.com and click on get started then we choose the last option download Anacon installers and then we have here the different anacon installers so there are Windows Mac and Linux so in my case I'm going to choose Mac and I'm going to choose the 64bit graphical installer so now I'm downloaded Anaconda and once it's downloaded I'm going to click on it and a message will pop up you just have to click on allow as I'm going to do right now so just click on allow and then click on continue until the installation starts so I just click continue and then agree and then continue and it's going to start installing Anaconda in case you're on Windows and you're installing Python or Anaconda for the first time make sure to check the first box you see now on screen so I'm going to speed up the video now okay the installation is almost done and now it's telling me that an account that works with PyCharm and now I'm just going to click on continue to finish the installation so I click on continue and then we'll see just a summary of what was installed and now I'm going to close this window and I'm going to open Anaconda so I'm going to locate the icon it's a green icon uh this one that you see here and I'm going to open Anaconda i'm going to wait a couple of seconds and let's see what was installed so here we have the Jupyter Lab and Jupyter Notebook which are widely used in data science so I'm going to launch Jupyter Notebook so here it's opening Jupyter Notebook let's give it a second and now we open a new notebook with Python 3 so Python 3 was installed too and here I'm going to import pandas so I write import pandas as pd and if the code runs fine then the installation was successful so let's wait a couple of seconds and now as you can see it's working so I can create a data frame without any problem so now let's go back to Anaconda to see if Jupyter Lab is working fine too so here is Jupyter Lab and I press on launch and just wait a couple of seconds and now Jupyter Lab is open so here is the file I was using before and it has the data frame and everything is fine finally let's see how to install a new library so we go to environments and on the right side there is a search box where you can write any library you want to install so here I'm going to check if pandas is installed so you can see pandas and then there is also numpy and scikitlearn and if you want to install a new library you just have to click on the dropdown and choose the option non-installed by the way you can install a new environment using the Anaconda navigator you just click on this create button and then you will see this window and here you just have to write the name of the environment and choose the Python version you want to install in this environment and that's it now you can start working in your data science project in this video I will introduce you to the Jupyter Notebook interface jupyter Notebook is an open-source web application that allows us to create and share documents that contain live code equations visualizations and text this is the perfect text editor for doing data cleaning and transformation data visualization and data analysis this is why Jupyter Notebook is widely used in data science and also machine learning as you might remember we installed Jupyter Notebook and Python with the Anaconda Navigator and this means that we already have installed some popular libraries used in Python for data analysis by the way one of the alternatives of Jupyter Notebook is Jupyter Lab both are similar but we're going to use Jupyter Notebook in this course because of its simplicity so let's open Jupyter Notebook and to do that we have to click here on the launch button so I click here and now we wait a couple of seconds and now we have here the interface of Jupyter Notebook so I'm going to maximize this and by default Jupyter Notebook opens the root directory of your computer it's a good idea to create a folder where all your Python scripts will be located in my case this folder is called Anacon scripts so I click here and now I can navigate through the folders and the folder I'm going to use for this example is this one that says my course and here we're going to create our first Python script to do that we click here on the new button so click here and we have to click on the first option that says Python 3 there are other options like text file folder or the terminal but we're not going to use these options in this course so click on Python 3 and now we have a Python script powered by Jupiter notebook so here on the right you can see that it says Python 3 and also there is the Python logo and on the left you can see here the Jupyter notebook logo and also the name of this Jupyter notebook file we can change the name of the file by clicking here on untitle so I click here and I can change it to let's say example so I write example and I click on rename and now we rename this Jupyter notebook file all right now let's navigate through this menu bar that we have here in this Jupyter notebook file so the first option is the file in here we can create a new notebook with Python 3 so if we click here we're going to open a new Jupyter notebook file from scratch as we did before then we have that open and in this case we can open a Jupyter notebook we created before we can also make a copy to a Jupyter notebook and then change the name we can save a Jupyter notebook file and rename the file as we did before we only click here and rename the file then we can save all the progress we make in Jupyter notebook for example after writing many lines of code you can save all the progress you make by pressing control S or command S on Mac and you're going to create a checkpoint and later you can revert to a previous checkpoint by using this option here so here you will see many checkpoints and you can revert to a previous checkpoint by the way by default Jupyter Notebook makes saves every 30 seconds or maybe 1 minute so there is no need to press Ctrl S every time so keep that in mind then we have other options that I don't use so much like print this Jupyter notebook or export that Jupyter notebook file to HTML or PDF and so on okay now let's see the second option that says edit and here we can edit all the cells we have here in this Jupyter notebook by the way here what you see here on the screen is a cell so we can edit with this edit option for example we can cut cells we can copy cells paste cells above and delete cells on the right you can see the shortcuts that we're going to see on the next video in detail and well you can check all the edit options that you can perform on Jupyter notebook here then in the view option we can toggle the header the toolbar and also line numbers so here if I click on toggle header the header is going to disappear and if I click on toggle toolbar this toolbar disappears too also here in toggle line numbers we can show here line numbers so if I write anything we can see that it says 1 2 3 and so on and I'm not going to use this for this course i'm going to leave it with the default options so here I'm going to revert to the original option so without line numbers and I want to show the header and also the toolbar but you can personalize it as you want next in the insert options we can insert cells above or below we only click here and well we're going to see the shortcuts later in the next video then we have the cell options we can run cells or run all the cells in this Jupyter notebook file and then we have the kernel option and a kernel is a computational engine that executes the code contained in a notebook document when we open Jupyter notebook a kernel is automatically launched and we can interrupt this kernel by clicking here so by interrupting we can pause the execution of a code we can also restart everything and do more things here sometimes for example I interrupt the kernel when a line of code or a cell takes too much time to execute and well you can do the same here with restart or interrupt then we have the navigate option that doesn't actually have anything here widgets that I don't use so much and well help that I think it will send you to the documentation of Jupyter notebook and you can read it if you want all right here then we have the toolbar and here you will find some shortcuts of the menu bar that we've seen before for example here you can save and make a checkpoint so here I click here and as you can see here it says checkpoint created or something like that yeah checkpoint created and the time that it was created then you can here with this plus button insert a cell below so I click here and as you can see we can insert a cell below and also you can use shortcuts but that I'm going to show you in the next video then we can cut selected cells with this button we can copy a cell with this button and also we can paste cells below also we can move a cell above or below for example I'm going to write anything here and this cell I can move it above with this button or below as you can see here then we can run this code for example I can write the number one and then run the code and as you can see here the code could ran and it shows the number one and well those are some of the frequently used buttons in the toolbar and that's everything you need to know about this Jupyter notebook file okay now before finishing this video I'm going to show you some other options that you can find here in the Jupyter Notebook interface and here you can see that there are some other options so right now we are in the files tab and we can change to the running tab here and here you can see all the currently running Jupyter notebook processes for example we can see here the Jupyter notebook file we created and that we opened so you can recognize that a Jupyter notebook file is open or that is running because here the icon will be in green so here if we go back to the files tab we can see that this Jupyter notebook file which by the way has the IP Y andB extension is in green so the icon is in green so this indicates that the file is running and well it was opened so here we can see that is open and we can shut down this file and this is different from closing this file for example here I have the file and if I close this file here we can see that the file is still running here we see running and it's in green and in the running tab it still shows up so if we want to shut down this file we click here and it says that there are no notebooks running and we can see here that the notebook has a gray icon all right then we have the clusters tab and this tab I don't use so much and actually it doesn't show anything here and then we have the NB extensions tab here you can install any extension to personalize Jupyter Notebook even more and we're going to see some cool Jupyter Notebook extensions in the next videos and by the way this NB extensions tab doesn't show up in some versions of Jupyter Notebook but we can easily install it and we'll also see how to install this NB extensions tab in the next videos finally we have this box that shows our directory so here this folder indicates the root directory so if I click here we are not in the root and if I click on the folders anaconda script and then my course I go to the folder where I was before and that's it these are all the things you need to know about the Jupyter Notebook interface okay in this video we're going to see some cell types and cell modes in Jupyter Notebook so first we're going to open the Jupyter Notebook file that we created in the previous video which is this one example ipynb so we click on it and here we have the Jupyter notebook file opened and here by default we have this first result in command mode and we can say that this is command mode because here this blue color indicates that the cell is in command mode and when we are in command mode we can do things outside the scope of any individual cell so basically all the tools we see here in the toolbar we can apply it in command mode also in command mode we can apply some shortcuts that I'm going to show you later and for example if we want to see the shortcut window we press the letter H in command mode and we can see the keyboard shortcuts here so here you can see all the shortcuts and all the shortcuts that you can apply in command mode now I'm going to close this one and also you can apply different shortcuts like for example if you press B in the command mode you will see that there is a new cell because B is the shortcut that introduces a new cell below now if we press enter you're going to see that the color is going to change to green so here we have green color and this green color indicates that we are in edit mode and the edit mode is for all the actions you will usually perform in the context of the cell for example introducing text or writing code so here I can write say 1 two three so if I write one two three and then I click on this run button I'm going to run this cell and as you can see here I run this first cell and also after running the cell you can see that we are again in command mode so to go to edit mode we press enter again and now we can edit the numbers we introduced so for example I can write four five six and then run again and here you can see that the output shows one two three four five and six by the way if you try to use the shortcut in edit mode it won't work here I press enter and now I'm on edit mode and if I press the letter H you can see that nothing happens we don't have the shortcut window and if I press the letter B you can see that we don't insert any cell below this happens because those shortcuts work only on command mode so to escape this edit mode we have to press the escape button so I press escape and now I'm again in command mode so if I press H we have here the keyboard shortcut and if I press B you can see that we inserted a new cell and that's it for the command and the edit mode now we'll see the cell types in Jupyter notebook in Jupyter notebook there are three main cell types and we can see all of them in this dropdown here right now the type of this cell is code so here it says code but if we press here you can see other cell types like markdown and row NB convert so we're going to see first the code cell and it already has the check so this one is a code cell so now I press here and now well it's in code cell if I press enter I'm in edit mode and here I can introduce any code I want so here I can write uh any number 999 and if I press control enter we can see that here this is the input and here we got the output of this code we're going to see how the code cell works throughout this course but now it's time to see how the markdown cell works in Jupyter notebook so here I'm going to this cell and now I'm going to change the cell type so I press here in the dropdown and now I select markdown in the markdown cell we can introduce any type of text we want for example we can introduce titles so if I uh delete this and press the hash sign we can get title so one hash it means title so here I press space and now I write title now I press Ctrl enter or this run button to run this cell and here we got the title by the way you shouldn't get this wrong number because I just modify the default behavior of Jupyter Notebook so mine enumerates the titles and subtitles but in your case you will see only the word title and if you want you can introduce also subtitles here so for example I'm going to insert a new cell with this button this plus button and now I'm going to move this cell up with this button here so I press this and now I'm going to change the cell type from code cell to markdown cell so I go to the dropdown and select markdown and by the way you can change the cell type also with shortcuts so if you're in command mode you can press the Y button to change to code cell so I press the Y button and as you can see here it says in and this in with the square brackets indicates that this is a code cell so here I can press enter and introduce any code here I introduce numbers and press the run button and here you can see that we have an input and an output so this is a code cell but now we can press the M button to make this cell a markdown cell so now we press M and here we are in command mode so now we can get this markdown cell and here you don't see the in word with a square brackets anymore so now I'm going to edit mode so I just press here or well you can press enter to go to edit mode and now to introduce a subtitle I'm going to write double hash sign so I press hash sign twice now a space and now I'm going to write the subtitle so I write subtitle i press Ctrl enter or the run button to run this cell and we got here the subtitle and we can also introduce text so I'm going to introduce a new cell with a plus button and you can also do it with a B shortcut so I'm going to do it with a B shortcut right now I press B and here I got this new cell and we can move this with this button here and now we have this cell in the position we want it so here I can introduce text by converting this cell to markdown so here I choose markdown now I press enter to go to edit mode and here I can introduce any text for example I can write hello i press Ctrl enter and now we can see that we have here this text and finally the last type of cell is the row NB convert and this type of cell is not evaluated by the notebook kernel so if we convert this code cell to a row cell this cell won't be evaluated by the notebook kernel so let's try here i press row and be convert now we can see that this looks like a plain cell well this type of cell is not used that often actually we're going to use only the code cell and the markdown cell in this course and that's it in this video you'll learn the cell types and cell modes in Jupyter Notebook okay in this video we're going to see some common shortcuts used in Jupyter Notebook and we're going to start with the F shortcut and by the way to use these shortcuts you have to make sure you're in the command mode and to verify you're in the command mode make sure that the cell has this blue color okay now that you're in the command mode you can press the letter F and you're going to see this find and replace so this first shortcut allows us to find a word in a cell and then replace it with another word for example I can write here the word hello and here it found the word hello inside this hello world sentence and now I can replace this word with the word say hi for example so here I write hi and now in red we can see the match and in green we can see the word that we're going to insert so here let's click on replace all and now you can see that it doesn't say hello world anymore but now it says hi world so now I press ctrl enter which is another shortcut to run the cell so you can press here on run or only press ctrl enter to run this cell so I press Ctrl enter and now we run this cell and another way to run cells is to press shift enter but in this case we're going to run and insert a new cell below so now let's see i press shift enter and out here it ran this cell because now it says in and three inside square brackets and here we can see that we have a new cell okay now another shortcut that is often used is the Y and M shortcut so now this cell is a code cell and if we want to make this a markdown cell we only have to press the M letter so we press M and this is going to be converted to a markdown cell and if we press the letter Y this is going to be converted to a code cell and also you can change the heading here you can make the heading bigger or smaller so here I'm going to locate this cell and now to make this one smaller we can press the numbers so if we press the number two we can see that this one gets smaller and if I press number three the title gets smaller for smaller and so on so as you can see the more hash signs the smaller the text so here I'm going to delete this hash signs and one hash sign represents the biggest font size which is the title so now I press Ctrl enter and now we have this in heading one but if I press number five and then press Ctrl enter we can see that now this cell has heading five and it's smaller so now I'm going to revert to heading one so I press one and then Ctrl enter okay now we can navigate through the cells by pressing on the up or down keys on our keyboard and as you can see here we can navigate through all the cells here or we can also press with the mouse we can press on the cells we want okay now we can insert a new cell above by pressing the A key so if I press A we get here a new cell above and if I press now B we get a new cell below now if I press X we're going to cut this cell so I press X and you can see that the cell was cat and now if we press V we paste that cell below so I press V and now we got this cell and if I press shift + V we get the cell pasteed above so I press shift and V and we get this new cell above this cell I have here okay okay now I can delete cells by pressing D twice so I press D two times and as you can see here the title disappeared so now I try it again and we don't have the title anymore but now if we press the letter Z we can undo those changes so let's undo what we did before i press Z and we get here the title back okay another useful shortcut is that Ctrl S that allows us to save the changes we made in this Jupyter notebook file so I press Ctrl S and you can see here that says checkpoint created so I'm going to press again Ctrl S and here it says checkpoint created and here also says the time and that's it these are some of the most common shortcuts used in Jupyter Notebook but you can see other shortcuts by pressing the letter H so I press H and here you can see more keyboard shortcuts or you can also go here to help and then go to keyboard shortcuts here and you get the same window so here you can see a list of shortcuts for command mode and also for the edit mode you can see the description of a shortcut and also how to do it in your operating system one of the typical ways to get started with a programming language like Python is printing a simple message you can write any message you want but it's traditional among coders to start with a hello world so let's try it let's print our first message using the print function the print function prints a message to the screen so I'm going to write here print and then I'm going to open parenthesis every time we use a function in Python we have to open parenthesis well in this case for the print function and as you can see here the functions get a green color in Jupyter notebook so that's how you can identify them so inside this parenthesis I'm going to write the message so in this case it's going to be hello world so this is our first message now to execute this first line of code we have to press control and enter or command and enter if you're on Mac so I'm going to press this and as you can see here we have our first hello world another way to run this first cell is pressing here on the run button it's going to have the same effect so I pressed and it ran so as you can see here it says in which it represents a code cell and this is a markdown cell as we've seen before one of the advantages that Jupyter notebook has is that it allows us to print the last object in a code cell without specifying the print function so for example here I can print this hello world without uh writing this print function so I'm going to copy this hello world message that it's uh inside quotes and I'm going to run this code so just controll enter and as you can see here we have this message printed so this is one of the advantages that has Jupiter notebook if you do this in another Python IDE it won't work so here you can try yourself you can write any message you want apart from the first hello world you can try with your name so we write print then parenthesis and we open quotes because uh we need to define a string i'm going to tell you about strings a little bit later but just so you know right now and here for example I can write my name so my name is Frank and I can print my name then I can print also numbers so I print my age 26 and it's going to work too and besides writing code you can also add comments comments are a useful way to describe what we're doing in our code so here we can use comments we just have to write the hash sign which is this one so you write hash sign and then you write the comment in this case I'm going to write my name and I'm going to say printing my name so we know what our code is doing here in the first message we wrote we can also add a comment so we write hash sign and then we can say printing my first message as you can see here the comments also have a different color so so far we have three colors this color for the comments uh green color for the functions and red color for the string this is just a useful functionality most text editor have that allows us easily read code it now and now I'm going to write any message so I'm going to write uh for example again uh hello world and again to verify the type we can use the type function parentheses run this code and we got the str that represents a string and one cool thing a string has is methods we can apply different functions to strings as we will do in Microsoft Excel for example however in Python we use methods a method is a function that belongs to an object to call a method we use the dot sign after the object let's see some string methods to change the case of the text so here I'm going to write again hello world but now I'm going to use some string method so I write hello world and in this case I'm going to use the upper method to make this uppercase so I'm going to use the print function but actually we don't need to use the print function because as I told you before uh in Jupyter notebook we don't need to use the print because it automatically prints the last line of code so since this is the only uh line of code in this cell block it's going to print it automatically so we just run this uh cell and we have hello world in upper case so as you might expect now we can also change the case of the text in this case it can be on lower case or title case so I'm going to just copy and paste this twice and here I'm going to write instead of upper I'm going to use lower and then title so you can see how it's going to change the case so here I'm going to run and let's see what happens so as you can see here it only printed the last one uh because as I told you before it only prints the last one and if we want to print the three of them we have two options so we can maybe here cut and paste on each cell or what we can do is to print each of them so here for example I can do print here and I can do the same for them so instead of using more cells we can print all of them and here we can print this one too actually we don't need them we don't need it because it's going to print the last line but just for the sake of this video I'm going to print the three of them so here I'm going to run this code and as you can see here the first it has an uppercase the second has lower case and the third has a title case so that's how you do it on Python other string method that you can find Python is the count method so I'm going to delete this and actually this one too and we're going to see this now so first I copy this and now I paste it here and here I'm going to use the count so the count method so I write count and then here I open uh single quotes and I write the letter that we want to count so here for example I'm going to write the L letter and what this uh string method is going to do it's going to count how many times this L letter is included in this string so as we can see there are two L's so it should set to time so I run this code and actually is three because there are two in hello and one in world so I was wrong and here uh another string method that you can use is the replace method so we can replace one letter for another so here let me copy this and I'm going to paste it here and instead of writing count I can write replaced so here the first letter that we're going to see here is the letter that we want to replace so in this case I'm going to change the L with O and the second letter is the letter that you want to put in that string so I'm going to use the U so I'm going to change every time that an O appears here in the string we're going to replace it for a U vowel so let's try so I run this code and now it says well hello world but with U and these are some of the most common string methods in Python okay now it's time to learn something that you're going to see often in Python which are variables variables help us store data values in Python we often work with data so variables are useful to manage this data properly a variable contains a value which is the information associated with a variable to assign a value to a variable we use the equal sign so let's create a message that says I'm learning Python and store it in a variable called message_1 so here I write message_1 and we set it to the string I'm learning Python so I open uh double quotes and here I write I'm learning Python so this is string we've seen this before and this is the variable and we assign this value to the variable using the equal sign now I'm going to run this and as you can see nothing happens but actually we just assigned that string to the variable message_1 now if we want to obtain the message I'm learning Python we only have to type the variable name and then execute that code so I'm going to copy and paste it here and then we run this code and as you can see by running this cell we obtain the content inside the variable message_1 we can create as many variables as we want just make sure to assign different names to new variables so let's create a new message that says and it's fine and store it in a variable called message_2 so first I write message so message and underscore 2 and then we set this equal to uh open double quotes and write and it's fun this is my second variable and I'm going to run this cell so as we can see the string was assigned to this second variable and if I copy and paste this variable here and run this code we can see that the message it's there by the way if you're using single quotes instead of double quotes as I'm using in this video probably you had the following uh issue so here I'm going to copy this one and paste it here so you can see what I'm talking about so let's say you're let's say you're using single quotes instead of double quotes so you get this this is a problem that you will have when using single quotes because in the English language we use these apostrophes often so an simple way to deal with this is using double quotes so as you can see here if I use double quotes everything is okay everything remains as a a string but with single quotes uh it doesn't happen so only the I gets this string but the rest it doesn't get a string value or the string data type so just make sure you use double quotes every time you have these apostrophes and that's it okay now let's put these two messages together so message one with message two I want to put them together so this is called a string concatenation if we want to put message one and message two together we can use the plus operator and we can just do this so I'm going to copy message one or the variable message one and now I'm going to copy the variable message_2 and I use the plus in the middle to concatenate this first message with this second message so I run and let's see what happens so here we can see that the two messages were uh concatenated but here there isn't a space between these two messages so this is the first message and this is the second and there isn't uh any blank space in the middle so what we can do here is to just uh add a blank space so I'm going to copy this one and paste it here and show you how to do it so here I add a new plus operator and in the middle we open a string so with single quotes or double quotes in this case I'm going to use single quotes here and to create this blank space I'm going to press space and here we have our blank space here and then we run this code and now let's see and here as we can see there is a space so between python and the and we have this blank space and if we want we can assign this new message to a new variable so I'm going to assign this to a variable called message and I write message here and I include here below the code and here I can print this so as you can see if I run this we can see that the message is there okay now let me show you an alternative way to join two strings so this is called the f string and it works like this you write f and you open a string so we write a single quotes here so one and two and here as you can see the whole uh the whole thing is red so it's like everything is a string in here inside we can write the message so let's see uh let's say we write I don't know a simple hello world so hello world and we run this and as you can see here this is a string it just has this f uh in front of that string and here uh one of the advantages that this f string has is that it can have variables inside the string so here for example we can write a variable opening this curly braces so this curly braces can have variables inside it so here I can write message uh_1 and we can print it so if we print we have this string I'm learning python and now if we want to concatenate this first message with our second message we just have to include curly braces again i put it here and now I write message two and between message one and message two I just have to press a space and we have this so I'm learning Python and it's fun so here we just press space and the space also appears here so for example if we add some random text let's say ABC we get this ABC between Python in between and so this is how F string works you just have to write the f then open single quotes and inside you can write any message and to include any variable just you have to open this curly braces write the variable name and that's how you join strings okay now it's time to see a data type that is used often in data analysis i'm talking about lists in Python lists are used to store multiple items in a single variable lists are order and mutable containers in Python we call mutable to objects that can change their values that is elements within a list can change their values to create a list we have to introduce the element inside square brackets separated by commas so let's create our first list first we have to set the name of the list in this case I'm going to name it countries and now to create the list we have to open square brackets as I said before so here we open square brackets and here we have to write the elements so I'm going to include in this countries list just strings and they're going to be uh names of countries so the first one I'm going to write uh United States so this is the first element in my list and to write the second we have to use the comma so here comma and now the second so let's write India uh two more so now China and finally Brazil so these are the four countries as you can see here uh this is a list so we have the square brackets that represent the list and we have four strings and this is how you define or how you create a list so now I'm going to run this one and to see the content I'm going to paste the name of this list and now I run here I include only strings but keep in mind that lists can have elements of different types so for example one string and the other an integer and then a float and so on and also lists can have duplicated elements so for example I can have here uh United States uh written twice so here for example I can write United States twice and that's okay because lists can have duplicated elements but I don't want it that way so I'm going to delete it and leave it as it is okay now if we want to get an element inside this list we have to use something called indexing by indexing we can obtain an element by its position so each item in a list has an index which is the position in the list python uses zerobased indexing that is the first element so United States has an index zero the second so India has an index one and so on to access an element by its index we need to use the square brackets again so let's see some examples let's start by getting the first element so United States so what we have to do is to write the name of the list in this case countries and then open square brackets and inside square brackets we have to write the position of this element so it starts with zero so we write zero to get the first element and then we run this code and as you can see we got the first element so if we write here country square brackets one we get India and if we write countries square brackets two we get China and if we do this with the number three we get Brazil so to verify this I'm going to print each of them so let's see what happens so here print and finally print this one and now I'm going to run and we should get uh each element of the list from United States to Brazil so let's try out so here we have each of them United States the first one then India then China and then Brazil so it's correct so this is the most common way to use indexing but there is also negative index this help us get elements starting on the last position of the list so instead of using indexes from zero and above we use indexes from minus one and below so let's get the last element of the list but now using a negative index so we want to get uh the last element which is Brazil and we did it before with uh countries square brackets three but now we're going to do it with negative indexing so here I'm going to write countries i copy and paste it here and now I open square brackets and instead of writing three we're going to write minus one and this minus one represents the first element starting from the last position so Brazil will be minus one China is minus2 India minus3 United States minus4 and that's how it works so I'm going to run this one country square brackets minus one and we should get Brazil and we got it so let's do this one more time and in this case I want to get United States which is minus1 2 3 and four so it's countries minus4 so we run this and we got United States but now using a negative index okay now let's see something called slicing the slicing means accessing parts of a list a slice is a subset of list elements a slice notation takes the form of list so the list name and then a square brackets and the start then this colon and stop this start represents the index of the first element and stop represents the element to stop at without including it in that slice so let's see some examples so I'm going to use this countries list again i just I'm going to copy this one and I'm going to paste it here so this is the name of my list and now I open square brackets and we're going to get uh let's say we're going to start at position number zero and then column and let's get from zero to the position number two so we have to write three because it stops at three without including this element in the position number three so let's run this one and as you can see here we have index zero index one and index two so it didn't include index number three and now let's say we want just the first element so we write from zero to one so it's only zero and one no because it doesn't include one and it stops at one so here I run and we got only United States so now let's try something different let's say we want to get uh elements from index one to the last one so let's say let me see here we want to get from India to Brazil so it's one two and three so we have to write four because it stops at four and we got three so let's write here one two 4 and we should get yeah India China and Brazil so this is one way to do it but another way to do it is just delete this and leave it as it is and then run the code and as we can see we got the same result so every time you want to get from one position to the last one you can omit the stop element and just leave it without that element so just as we did here and the same goes for the start so let's say we want to get from the first position so index zero to two so we don't include the start element and we write only colon and two so we run this and we get United States and then we get India because this is the first and this is the second so every time we want to get from the first element or until the last element we can omit the start and stop elements as we did in these two examples okay now let's see how we can add elements to a list there are different methods that help us add a new element to a list so let's have a look the first one is called append and we're going to use the counters list as an example so I'm going to write countries just so you can remember and here it's countries and as you can see it has four elements and let's say we want to add a new country to this countries list so what we can do is just write here or paste here countries and now add append or that append and here as you can see is this is a method so inside parenthesis we can write the new country we want to add to this list so let's say we want to add the country Canada so we write Canada and now we run this code as you can see nothing is printed but if we print the counters list again we see here a new element so as you can see here the append method adds a new element at the end of the list so this is by default at the end but what happens if you want to add an element in a different position so here you can use another method which is called the insert method so let me show you here i'm going to copy countries and now I'm going to use the insert method so I write insert then parentheses and this one uh accepts two arguments the first one is the index so the position of the element you want to insert so let's say we want this at the first position and the second argument that it takes is the new element you want to add so in this case let's say we want to add the element Spain so this is uh another country and it's going to be uh in the first position so index zero so let's try I run this one and again nothing happens apparently nothing happens and here if I uh run this countries list again we can see that there is a new element and this element is Spain and it's located in the first position unlike Canada that was placed in the last position this is one of the difference between the append method and the insert method so with insert we can specify the position we want to insert this new element but with append the element is added at the last position another thing you can do is to join two lists using the plus operator we use the plus operator to concatenate strings before but you can also join two lists so let me show you here i'm going to create a new list just to show you how it works so my new list is going to be called uh countries_2 so I'm going to include different countries so in this case it's going to be uh the UK then Germany and let's write Austria so we have three countries in this new list and now I'm going to run this one and if we want to concatenate this uh first list countries with this second list countries too we can use the plus operator so here I write plus and then I run this one and as you can see I got the five elements from the first list and the three elements from the second list and another cool thing you can do in Python is putting these two lists inside another list which is called nested list so let's try out so here I'm going to create a new list and it's going to be called nested list and here I'm going to open square brackets to create a new list and as elements I'm going to write countries which is my my first list and then comma and then countries_2 and this is my second list so as you can see here this uh elements inside this list the first is a list and the second is list so we have a lists inside another list which is called a nested list so I run this one and then I paste nested underscore list and we run and we get here the first list as first element and the second list as second element you won't see this nested list so often but you will encounter this a couple of times so it's good for you to know so now we're going to see the opposite of adding an element to a list which is removing an element so here I just pasted the countries list we had before and what we're going to do is to remove some of the elements of this list so there are different methods that help us remove an element from a list one of them is the remove method so to remove an element using this we have to first write the name of the list and then use that that sign and then write remove and write parentheses and inside here we have to write the element we want to get rid of so first it's United States so I write United States and let's run this one and as you can see apparently nothing happens but if we paste countries here we have uh all the elements but United States is not there so as you can see the first matching value was removed but you can also remove an element by its index so this is accomplished with the pop method so I'm going to copy all of this and now I'm going to paste it here so instead of writing that remove I'm going to write that pop and here I'm not going to uh use the name of the element but it's index so I write the index in this case let's remove the last one so it's going to be index minus one and what pop is going to do is to remove the element with index minus one and then returns this element so this element is Canada i didn't run this code here so you can ignore it so I'm going to comment this one and our reference is going to be this this list and to verify we just write countries and then run and here as you can see there isn't Canada anymore and that's how you remove an element using the pop method but there's still another way to remove an item using an specific index and it's the Dell so I'm going to show you here Dell it's uh the function Dell function and here we have to write the countries list and then again open square brackets and here write the index i write here the index and unlike the pop method we're not going to get the name of the element we're getting rid of but just deleting the element so I run this one and here we didn't get anything and I'm going to print this so countries and the element at index zero was removed so Spain because that's the first element so we deleted or we remove the first element so we only got India China and Brazil and there you have it three different ways to remove an element from a list okay now let's see how to sort a list we can easily sort a list using the sort method let's create a new list called numbers and then sort it from the smallest to the largest number so here first I write numbers and then open square brackets so I'm going to write uh some random numbers so first uh four then three then 10 then seven one and then two so this is my list so I run this code and now to sort it from the smallest to the largest number we write numbers then sort then open parenthesis and by default this is going to be sorted from the smallest to the largest number so I run numbers again and here it starts with one and it ends with 10 and as you can see it's from the smallest to the largest number so that's the default behavior of the sort method but we can control how this works so we can add the reverse argument to the sort method to control the order so if we want it to be descendant we set reverse to true so here again I'm going to create again the numbers list and then write numbers do sort and inside parenthesis I write the reverse argument and I'm going to set it to true here and then I'm going to print numbers so here I got an error because here it I wrote number and it's numbers so here I'm going to add the s and here s2 so run again and here we have uh from the and here we see that the list is sorted from the largest number to the smallest number so as you can see the default behavior of this sort method is reverse equal to false so you can control it here by writing reverse equal to true as we did here okay now let's see how we can update values in a list to update a value in a list we use indexing to locate the element we want to update and then we set it to a new value using the equal sign so let's say we want to update the first element of this numbers list so now it's four but we want it to be uh let's say 1,00 so we write here numbers and we use indexing so we write numbers the first element has index zero so we write numbers square brackets then zero then we set it equal to the new value we want to include so in this case I'm going to write 1,000 and now I'm going to print the numbers list to see the results so run this one and as you can see here the numbers list we got is from the last change we made so the one that starts with 10 so it's not this one but this one because it's the last one we ran so instead of 10 we replace this one with 1,000 because this is the first element with index zero so we did numbers square bracket zero and we update that first element with 1,000 okay finally we can make copies of the lists we created so there are different options to create a copy of a list one of them is the slicing technique so as you might remember to do a slicing we have first to write the name of the list which this case is countries and then we open square brackets then we're supposed to write the start and stop so in this case we're not going to write start and stop but only colon so if we don't write start and we don't write stop it means we want the whole list so let's try this out i'm going to run this one and as you can see here we got the whole list so the contours list doesn't have the original values because of the changes we made when we added and remove elements so I'm going to paste the original countries list with the four original values that are United States India China and Brazil and here let's see the changes and now we test it out and as you can see we got the whole list so from the first element United States to the last element Brazil because we're slicing the whole list so if we write here new_list and we set this equal to countries with this slicing what it's going to happen is this new list is going to have the same values as the country list so I write here new list and as you can see here it has the same values so we created a copy of the counters list so this is one way how you can create a copy and the second way is more straightforward or is more explicit so is using the copy method so we write again countries the name of the list and then we use the copy method so we write copy and then parenthesis so with this we create a copy of this list so let's run this code and as you can see here it returns the list but if we assign this to a new list we're going to create a copy so here I'm going to write new list_2 so here we assign this copy to this new list so I'm going to copy this new list and paste it here and as you can see here we have the values of this list which are the same as the original countries list that is here and that's it that's how you make a copy of a list so now let's see how dictionaries work in Python in Python a dictionary is an unordered collection of items used to store data values and a dictionary contains a key and a value so this is what you will often see in a dictionary so here for example the name of my dictionary is my dict and to create this dictionary we have to use this curly braces so we open curly braces and inside we write our first item and the first item consists of a key here on the left and then a value here and it's separated with the colon so here we have the key then column and then the value and then we have here the second item so the second key and the second value so now let's create a dictionary that has some basic information about me so I'm going to name this dictionary my data and now to create this dictionary I'm going to open curly braces and the first key is going to be name so I write name and it has uh a value that is my name so I'm going to write Frank so I open single quotes and then write Frank and then I'm going to add a new item so I write comma and then the second key is going to be H and the second value is going to be my age so in this case I'm going to write my age which is 26 so as you can see here the first is a string the first value is a string and the second is a integer so we can mix different data types so now I press ct controll enter to run this code and we created this dictionary so now I write my underscore data and here you have the dictionary we created so here we can get the keys of this dictionary we only have to write my underscore data keys so this is the keys method so we run this and we get this dict underscore keys and the values are name and h which are the keys of this dictionary we created so name the first key and h the second key now we can get also the values so my name and my age so we just have to use the values method so I'm going to paste this one here and instead of writing that keys I'm going to write that values and now run this and we get my name and then my age so next I'm going to get the items so as I said before an item is this so this is the first item and this is the second item so we can say that the item is a pair of key and value so we can get this by using the items method so instead of writing that values I'm going to write here items and then run this one so here we got the first item so the first pair key and value which is my name and well the key name and then my name Frank and then the second item so the key name H and the H which is 26 now we can add a new pair of key value in this dictionary we created so let's say we want to add my height so I write my data and let's say we want to add the key name height so I write height so we use square brackets here and then we set this to the value so let's say uh it's 1.7 so I write my data and uh then square brackets then hide inside it and then equal to 1.7 so if I run this and then I run the dictionary we can see that there is a new item and it's the height so height uh column and then 1.7 this is how you add a new item to the dictionary and now we can update this height so let's say uh I'm not 1.7 but I'm 1.8 m so what we can do is to use the update method to update this uh value so I write my data and here I can use the update method so I write update and then inside parenthesis we have to open curly braces to update this new item so I'm going to write the key which is height and then I'm going to set the new height which is 1.8 so let's try this out i run this and then let's see the values so let's see if it was updated so I run this and we got the height 1.8 so it's perfect so now let's see how we can make a copy of a dictionary the same way we did before for the lists so to make a copy we just have to write the name of the dictionary in this case it's my underscore data and then just as we did for the list we can use the copy method so we write copy with parentheses and then we create a new copy so here you can see the copy and now I can assign this to a new dictionary so I'm going to write new dict and now I'm going to copy this one i'm going to run and then I write new dict and run this and as you can see it has the value of the my data dictionary and something I didn't tell you when I make a copy of the list is that if you change the data inside the my data dictionary so the old dictionary the effect is not going to be seen in the new dictionary so for example if we write one nine and here I update this in the old dictionary so here you can see height 1.9 and if we run this new underscore dict we can see that after running this height remains with the same value 1.8 and it doesn't change to 1.9 this doesn't happen if you make one of these copies most people do so let me show you what I'm talking about so most people just make a copy doing new dict equal to my data so this is the old dictionary and this is my new dictionary so what happens if I run this and then I going to show you the values of this new dictionary so this is 1.9 and if I update this to let's say 1.95 so update here update here here is 1.95 and if I run this new dict_2 we can see that the value was updated too and this shouldn't happen so if you want to create a new dictionary that works independently from the old dictionary you should use the copy method and this is the same if you're making a copy of a list finally let's see how to remove elements from a dictionary so just like we did with the lists we can remove an item in a dictionary so there are different options first we have the pop method so we write my data i'm using the old dictionary we've been using so far so my underscore data and I'm going to write that pop so this is the pop method so here I can write the key so in this case I'm going to write the key let me see here uh my underscore data the key name so I write uh pop then parenthesis then name so as you might remember the pop method returns this value of the key uh before we did with the list and it returned the list element in this case it returns the uh value of the key so this is the key name and it returns the value so if we print this my data dictionary we see that this pair um key value isn't here so we successfully remove this item another way to remove an element or an item from a dictionary is using the delt function so we write dell and then we write the name of the dictionary so my underscore data and then we have to specify again the name of the key so we open square brackets and uh open quotes and here let's say we want to delete the or remove the H key with its value so we write H and we run this and then if we print this dictionary again we get the dictionary and we see that the H key was removed and also its value and finally you can remove all the items in a dictionary with a clear method so we write my data and use that clear with parenthesis and now if we print this dictionary you can see that this is an empty dictionary because we remove all the elements from this dictionary now let's see one of the most common statements used in Python this is the if statement the if statement is a conditional statement used to decide whether a certain statement or block of statements will be executed or not here you can see the syntax of this if statement and as you can see it starts with the if keyword followed by the condition so if the condition is true this code here is going to be executed and if the condition is not true so it's false the code here in the al if it's going to be tested so here in this al if block this new condition will be tested and if this is true this code below will be executed but if it's not true then the else block will be tested and here this is the last block and automatically this code will be executed so here one little detail that most beginners forget to write is the column so it's sometimes easy to forget it's there but you have to include it and one other things some people miss is this indentation so here there is an indentation you have to include after the column so every time you write here column you press enter and you automatically in most test editors you're going to get this indentation but if for some reason you don't get that indentation and you get something like this you can indent this uh line by using the tab key in your keyboard so just press tab and it's going to indent this line so make sure you write the column and you include an indentation for each code that will be executed so here here and here so now let's have a look at some examples to see much better how the if statement works so first I'm going to create a new variable and as you might remember to create an variable you have to write a name of this variable in this case I'm going to name it H and then you have to set it a value so in this case this is going to be a number so I'm going to set this h to the number 18 and now I'm going to write this if condition or if statement so I write if h is greater than or equal to 18 then colon and then this code is going to be executed so if this is true I'm going to write print and then a message so if this person or if the age is equal or greater than 18 I'm going to write the message you're an adult and as you can see here I'm using single quotes and I wrote the apostrophe so I'm going to use double quotes and everything is fine now so here print uh then the message you're an adult so if this isn't true I write else and then column and print uh here a new message which is you are a kid so let's see this again so if the age is equal or greater than 18 then we print you're an adult but if it's uh less than 18 we print you're a kid so here we run this code and we should get this because 18 is equal to 18 so let's run and as you can see we got the message you are an adult so now we can play with this we can change the age value so here I'm going to set it to 15 so I run and as you can see here 15 is less than 18 so this is false and this code is executed so this block here it's going to be executed so we got you are a kid so we can try this one more time so in this case I'm going to write another age so 30 and again 30 is greater than 18 so this is executed so you're an adult so now let's add a new block and I'm going to use the lif so I write l if and then h and then greater than let's say 13 and then column press enter and we got this indentation and then we print another message so if the age is equal to or greater than 13 we write the message you are a teenager so teenager so if it's between 13 and 17 or well less than 18 it's going to be you're a teenager but if it's less than 13 it's going to be you're a kid so let's try this out so I write first 10 and then we get you're a kid because it's less than 13 then we'll change this to 14 and then we get you're a teenager because 14 is greater than 13 and finally we write 20 and we get you're an adult because 20 is greater than 18 and that's it that's how the if statement works now it's time to see one of the most common loops in Python this is the for loop python for loops are used to loop through an iterable object and performs the same action for each entry one example of an iterable object is a list so we can loop through each element of a list and perform the same action on each element of that list here you can see the syntax of the for loop and as you can see here is the for keyword and then we have to use a variable then we have to write the in keyword and then the iterable in this case as I told you before the most common is the list so you have four variable in list i'm going to write here list so you can see much better and then we have to write the column and then after a column it goes an indentation so here we have the indentation and the code that will be executed for each iteration here that we make with a for loop so to see this much better I'm going to use the countries list we created before so this is the countries list and I'm going to loop through this list so I write for and then we have to set a variable that is going to be just just temporarily so this variable is going to be called country so this variable doesn't exist we just create it temporarily so for country in and then we have to write the name of the iterable which is in this case a list so countries so for country in countries and then column and then enter and we get this indentation then we say print country so for this variable in this iterable which is a list print each element this is what we're saying in this um for loop so we run this and as you can see each element of the list country is printed so we're looping through the count's list and printing each element so the first is United States then India then China and Brazil and this is how the for loop works now let me show you a new function that you can implement along with a for loop and it's called enumerate so I'm going to write here enumerate and here I'm going to put this countries list inside this new function so what this enumerate function does is to enumerate each element of the countries list as we loop through the list so I'm going to add here a new uh variable and it's going to be I then comma and then country so this enumerate will return two elements the first one is going to be the number of the loop and the second one is going to be the element itself so here I have to print uh apart from the country the I variable that I just created here or it's just temporarily here so I write print I and then print country so here we're going to print here the number of the iteration and the element so I run control enter and here we got it so first is United States in the iteration uh the first iteration which which is zero then we got India in the second iteration which has uh one and so on so as you can see here uh the i starts with the zero so this is how enumerate works it starts with the number zero and it returns the number of the loop and the element and finally let's loop through elements in a dictionary so let's use the dictionary we created before that was my underscore data well this is empty so I'm going to use the original dictionary so here I have the original dictionary and it's here so I'm just going to print it so this is the dictionary and now we're going to loop through this dictionary so let me show you here first we have to write four and then we write key and value because one item as you might remember is made of a key and the value so key and value so we say for key value in and then the name of the dictionary so we write my data and now to get the items of this dictionary we have to use the items method so we write dot items and then parenthesis then we uh write column and we press enter so here we can print the key and we can also print the value so key and value and then we run this code and as you can see here we get the key the first key and we get the value we get name and we get Frank and then the second key H and the H26 so this is how you loop through elements or items inside a dictionary okay now let's see how functions work in Python a function is a block of code which only runs when it is called you can pass data known as parameters into a function so here it's the syntax of a function and as you can see here we have first to set the keyword defaf to create this function and then we have to write the name of this function and inside parenthesis we define the parameters of the function that we're creating then we write column and below you have to write the code and every function should return something so we have to use the return keyword and then return something like a variable for example so now let's create a basic function so first we write defaf and then we write the name of the function so this function is going to do something really simple it's going to sum the values we pass into it so it's going to be named sum underscore values and as parameters we set a comma b then column and press enter then what this function is going to do is to add the a + b values and we're going to set this equal to x so we write x equal to a + b and as I told you before you should return something after we finish the function so we write return and here we're going to return the x variable so we write x and that's it that's how you create a function i ran this uh code and as you can see apparently nothing happens but this function was created so to use this function we have to call it so to call this function we have to write the name of the function and then we pass some parameters in in this case it's called arguments when you call the function so I'm going to write argument one and argument three so once you call this function it's going to go to the function here and it's going to set this one equal to a and this 3 equal to b so you have 1 + 3 and this is four so x is going to be equal to four and then this function is going to return the value of x which is four so this is supposed to return the value of four so we run this and we got the value of four so this function is working properly okay now let's see some built-in functions that Python has python has lots of builtin functions that can help us perform a specific task let's have a look at some of them so let's start with the lend function we only have to write the word len and then we open parenthesis and as you can see here Jupyter notebook gives the green color to functions and now let's calculate the length of the countries list so I have here the countries list and now I'm going to copy this one paste it inside parenthesis and what the len function is going to do is to calculate the length of any iterable object in this case the countries list is an iterable object and now I'm going to run to calculate the length of this object so I run this one and as you can see here the length is four and this is how the length function works now let's see a different function and in this case I'm going to create a new list that contains only numbers so I'm going to write random numbers here 10 63 81 then uh one then 99 so this is my new list and I created this list with only numbers to try the max and min function so the max function is this one we write max and then parenthesis and this one returns the item with the highest value in an iterable so my iterable is this list and we're going to get the highest value of the elements inside this list so we run this one and as you can see here the maximum value is 99 and we can do also the mean function and it's going to have the opposite effect in this case we're going to get the minimum value of this list so we run and we get one okay another common function used in Python is the type function and this function give us the type of the object we only have to write type and what this function does is to return the type of an object so in this case let's copy and paste the countries object and if we run this we can see that this object is a list and that's correct because here we created a list with square brackets so that's what the type function does and finally the last function we're going to see is the range function this one returns a sequence of numbers that start with a number and ends with another number so let's see how it works here so this one has three arguments first the start number this one I'm going to write one then the number where the sequence stops in this case I'm going to write let's say 10 and then the last argument is the increment so how the this sequence is going to grow by how much so in this case I'm going to say that this sequence is going to grow by two so I write two now I run and as you can See nothing happens we only get the same text here but if we make a loop here so I write for I in wrench now print this I so this is a for loop we saw this before and here we run and as you can see here we're iterating over this range and we're getting the elements inside this range so the first element is one the second is incremented by two so 1 + 2 is three then 3 + 2 5 then 7 and then 9 and then we should get 11 but the last element here it's 10 so this sequence stops at 10 so we only get until number nine and that's how the range function works in Python and that's it now you know the most common builtin functions in Python okay in this video we're going to see what are modules in Python in Python modules are files that contain Python code a module can have classes functions and variables and even runnable code and to get access to a module we have to use the import keyword this one and to see a module in action we're going to see the OS module and this one comes with Python so you don't need to install it so to get access to this OS module we have to write import OS and that's it we only write this and now let's see some functionalities of this module so the first one that we're going to see is the get current directory method so to get access to that method we write os then get cwd and then parenthesis so this cwd stands for current working directory so we're going to get the directory where our Jupyter notebook file is located so this file I'm working with right now so let's run and let's see what happens so as you can see here I have the path where the Jupyter notebook is located so this is the complete path and you can see it by using the get cwd method so now let's see another method and in this case we're going to list all the elements in the folder where this Jupyter notebook file is located so here to do that we're going to use the method list dur so this means list directory and I'm going to run it and as you can see here I have this Jupyter notebook file that is named untitled as you can see here the name of my file is untitle and these other elements you can ignore it they are not files they are just some hidden elements in my folder but they don't matter so right now the only file I have in this folder is this untitle file so this is what the lister does so it lists all the elements in the folder where this Jupyter notebook file is located and now let's see the last method which help us create a new folder so this method is called make doors and we have to write os make dos and then parenthesis and inside parenthesis we have to write the name of the folder we want to create so in this case I'm going to name it new folder simple as that and now if we run we're going to see that nothing happens but now if we use this list dur method to list all the elements in my folder we can see that there is a new folder so here if we compare this result we got before with this new result we can see that there is one new element and this element is the new folder element which is the folder we created using the make dur method and that's it those are some basic things you can do with the OS module in the following videos we're going to install different libraries packages and modules so we can do even more things in Python in this first introduction to pandas we're going to learn what is pandas we're going to compare pandas with Excel and then we're going to learn what are pandas data frames so first pandas is probably the best tool to do real world data analysis in Python it allows us to clean data wrangle data make visualizations and more you can think of pandas as a supercharged Microsoft Excel because most of the tasks you can do in Excel you can also do it in pandas and vice versa that said there are many areas where pandas outperforms Excel so before you learn pandas let me show you why you should learn pandas especially if you already know Excel so there are some benefits that pandas has uh over Excel or Python has over Excel so before dedicating time to learning pandas and also Python let's see what are these benefits so first limitation by size excel can handle around 1 million rows while Python can handle millions and millions of rows another benefit that Python and pandas have over Excel is the complex data transformation so in Excel memory intensive computations can crash a workbook while in Python uh when you work with pandas you can handle complex computations without any major problem also Python is good for automation while Excel was not designed to automate tasks you can create a macro or use VVA to simplify some tasks but that's the limit however Python can go beyond that with its hundred of free libraries available and finally Python has crossplatform capabilities this means that Python code remains the same regardless of the operating system or language set on your computer okay before we start writing code let me explain to you the core concepts of pandas so we're going to start seeing the concepts of arrays so arrays in Python are a data structure like lists so you can find like one dimensional array or two dimensional array also known as 2D array and uh the two main data structures in pandas are series and data frames so the first is a onedimensional array while the second a data frame is a two-dimensional array in pandas we mainly work with data frames but if you didn't understand so much the definition of a data frame with arrays let me show you another definition this one using Excel so a pandas data frame is the equivalent of an Excel spreadsheet pandas data frames just like Excel spreadsheet have two dimensions or axis so there are two axis and one is the row and the other is the column so the column is also known as series so what we've seen before this one dimensional array series is a column this is another name to call the columns in in a pandas data frame on top of the data frame you will see the name of the columns and on the left side there is the index by default index in pandas start with zero the intersection of a row with a column is called a data value or simply a data we can store different types of data such as integers strings boolean and so on right now you see on the screen a data frame that shows the US states ranked by population i'm going to show you the code to create a data frame like this later but now let's analyze this data frame so the column names are also known as features so our features here are states population and postal while each row value is known as observation we can say that there are three features and four observations because there are three columns and four rows keep in mind that a single column should have the same type of data in our example the states and postal columns only contains strings while the population column only contains integers we might get errors when trying to insert different data types into a column so avoid mixing different type of data so now let's see that terminology translation between Excel and pandas so as I mentioned before in Excel we work with worksheets and in pandas we work with data frames so the columns in Excel are also known as series in pandas but we also mention or we also say often the word columns and in pandas we worked with index so the index are those numbers that are on the left and in pandas we also say rows we have many rows well observations too but rows are fine and finally in pandas we work often with this nan that stands for not a number and this is the equivalent of an empty cell that you might find in Excel so that's it for now in the next video we're going to learn how to create a pandas data frame from scratch welcome back in this video we're going to learn different ways to create a pandas data frame so as you might remember a data frame looks like this it has columns and rows and the columns are series so uh series are a 1D array and arrays is how we create a data frame so this is the first way to create a data frame with arrays so these are arrays we have 1D arrays 2D arrays and 1D arrays are basically columns while 2D arrays are data frames so usually to use arrays we use a library named numpy and numpy is uh what is under the hood of pandas so to use numpy we have first to import numpy we're going to do that a bit later when we write code but just to give you an idea what a numpy array looks like here I wrote a basic array uh we have to use nparray to create this data frame that you see on the right and well this is one way to do it you can also use lists as I'm showing you right now and as you can see here in the second option when you create a data frame with lists you don't need to use numpy arrays because you're using some kind of list arrays so we're going to write the code to create a data frame with arrays but let's see the second option to create a data frame so the second option is dictionaries you can create a data frame with dictionaries and as you might remember a dictionary has a key and a value so we can use the key as column name and the value as the data so the value can be a list so this uh data will be uh many elements inside a list so a pair of key and value is known as item in a dictionary and in this case is going to be a series because it's one column what we have here so this is the second way to create a data frame with dictionaries and we're going to see that with code a little bit later but now let's see the third way which is with CSV files so CSV files are files that can be open in spreadsheets like Excel and this is the easiest way to create a data frame because we only need to read the CSV file and then the data frame is created and that's it so now let's go to Jupyter notebook to create a data frame writing some code okay now we are on Jupyter notebook and here we're going to write the code to create a data frame and we're going to use the three ways I showed you before so the first thing we're going to do is to import the libraries we're going to use to create a data frame so that's the first line of code and I already wrote that so it's here so first we import pandas and then we import numpy so import pandas is pd pd is just a convention to name pandas and np is a way to name numpy so to run this code just press control enter and now just wait and we imported pandas in numpy so let's see the first way to create a data frame so the first is with arrays and to create an array we have to use a numpy this is the first option so we write np which is the short name for numpy and then we use the array method so we write array open parenthesis and inside we write the array we want to create so I'm going to create uh I'm going to write random numbers just for the sake of this example so I open double square brackets and then let's write let's say one and four and then let's say two and five and the last one it's going to be three and six so each pair of uh let's call it list actually they are lists uh each list represent a row so this is the first row or this is going to be the first row this is going to be the second row in our data frame and this is going to be the third row so here we can name this array and I'm going to name it as data so data is equal to this numpy array so I'm going to execute this code and now we have this data so we created the array using numpy now let's create a data frame with pandas so to create a data frame with pandas we have to write pandas in this case I can write pd because I name it like this here in my first line of code so I write pd and then to create a data frame we use the the data frame method so we write dot data frame and then we open parenthesis and here we have to fill some arguments so the first one and that's what something that you always have to include in this data frame method is the data because you cannot create a data frame without data so first we include the data so first copy here our array and then you paste it here that's the first argument so you can create this data frame as it is i'm going to show you here just control and enter so as you can see here here is my data frame but as you can see it's uh full of numbers and the column names also have numbers and the row names also have numbers so to make it more understandable we can rename this um these column names and row names or index actually the name of the row names are index so first we can name this index as rows for example we you only need to add the index argument as I'm writing right now and then you have to specify the names you want to set so you have to open a list so this uh first or this second argument has a form of a list so the first element is going to be the first index so here zero so in case you don't want it to be zero you can set here another name so in my case I'm going to set it as row one then comma to set the second index as row two and the third as row three so now we can add uh also or we can modify also the column names we have to use the column argument and here we write it columns and then we open square brackets because it's a list here that we're going to edit and uh in this case we have to modify only two elements so the first going to be I want to name it uh call one and the second call two so I'm going to write this one and actually I'm going to name this data frame so I I'm going to set it to a variable and this is going to be equal to DF df it's the common way to name a data frame so DF stands for data frame so I'm going to run this code now and as you can see here it ran and now to show the data frame I can write here DF to DF and now we have here the data frame and as you can see here the first row 14 it's my first uh my first list and the second is the second uh row and the first column well that's a series as we discussed before so we have also the column names that we modified and the row names so now let's quickly see how to create a data frame with arrays but in this case without numpy so I'm going to copy this line of code and I'm going to paste it here option two so here I'm going to paste this because this is the base of uh this arrays with list shape and I'm going to just delete this so I don't want numpy array anymore just this double square brackets so I run this now to create the data frame is the same way we did before so just copy this and paste it here so run this and now I can run the I can write the f and now execute this code so as you can see we have the same result i'm just showing you the second way so you don't have to worry about learning right now numpy okay now let's create a data frame from a dictionary and we're going to use lists in this example and we're going to create a data frame using more meaningful data so in this case to create a dictionary I'm going to use two lists the first is going to be a list named states and the second it's going to be the population and it will contain the population of each state so the first list is states and I'm going to write it here and I open square brackets because this is a list and now I write some states in in the US so the first is California the second is going to be Texas uh let me write it here the third's going to be Florida and the last one New York so I quickly write it here and now I'm going to create the population list so in this case I'm going to pay So in this case I'm going to paste this data so I pasted the population on each state and now I'm going to create a dictionary from these two lists so I'm going to write the name of the dictionary so uh the name is going to be dict states then this is a dictionary so I should use um square brackets sorry curly braces and now I'm going to set the name of the key so the first key is states then colon and now the element or the value so this is states the first value and the second key and value is population i'm just going to set it to uh with capital letter and the second is the list population that we have here so with this we create our dictionary so I'm going to run these two and now we have lists and the dictionary so now we can easily create a data frame using the data frame method that we used before for the first option when we create a data frame with an array so to do it just write PD then data frame and now we have to write inside parentheses the name of the dictionary so I'm going to copy dict states and I'm going to set this to a new variable so I'm going to name this df_population so uh data frame about population so now I run this and here I got an error because I didn't write data frame correctly here is in capital letter so I run again and now everything is okay so now to show the data frame I just paste this one here and now I run so here we have this data frame and as you can see here my first key states is the name of my first column and the data inside the states list is here so here is my first column or my first series and the same goes for population with its data so here we created a data frame using a dictionary okay finally let's create a data frame from a CSV file uh to create a data frame from a CSV file we have to use the readers CSV method so first we write as usual pd that stands for pandas and then we use the method so we write read csv open parenthesis and then we have to write the name of this CSV file here I'm going to paste the name so it's name students performance.csv and to download this data you can check the notes of this video and actually we can have a look at this data before importing into pandas so it's here I have it in Google Sheets and as you can see here we have the scores of some exams math reading and writing and we have some other data so we can import all of this data all of the 10,000 rows in pandas so all of this is going to be here so here we only have to define the name of this data frame so here I'm going to name it DF_exams so now I run and to show now the first five rows of this data frame we can use a method named head that we're going to see later but just to give you an idea of this we can write that head and we get the first five rows so as you can see here we have the first five rows of this um Excel or actually CSV file and you can see here for example the first row it says female group B and math score 72 so let's check if that data is the same here so we have female group B and math score 72 so we have all this data here in this data frame so if we want to see all of them all of the rows here we can uh forget about the head and now we have all the rows well here uh we cannot see part of the rows i'm going to show you how to see that part later in this course but now as you can see if we run this dfc_exams we can see like the summary of this data set or well data frame in this case by the way in pandas or when we work actually in python we usually call this type of CSV files we call it uh data sets and when we read a data set using uh well pandas the result is a data frame what we have here so the CSV file it's a data set and this uh when we read it with pandas is a data frame and that's it these are the three ways to create a pandas data frame okay now it's time to see how to display a data frame in pandas so here I have the CSV file we used before to create a data frame and a little detail I forgot to mention before is that this CSV file should be located in the same directory where your Jupyter notebook script is located so what I mean by the Jupyter notebook script is what we're seeing right now i mean the what we're working right now is a Jupyter notebook script this this file that we're working right now so what you have to do is to download this CSV file and place it in the same folder where your Python or your Jupyter notebook script is located in the same folder and this is how you're going to read this CSV file using the readers CSV method so just make sure both the CSV file and the Jupyter notebook script is in the same place in the same folder okay now I'm going to run these first two lines of codes that we seen before so the first import pandas and the second reads this CSV file so I run this and now we have this CSV file stored into this DF_exams this is my data frame so now let's see how we can see this data frame so the easiest way to see this data frame is just copying this name this variable and now pasting here now I execute this and now we have the data frame actually this is a summary of the data frame because not all the rows are seen here so here if we scroll down a little bit we can see here that there are 1,000 rows and eight eight columns so here we can see all these rows uh and the columns but as you can see here in the middle we cannot see the the rows so it's until four and then it continues with 995 so usually when we work with pandas we don't need to see the data one by one so row by row that's not how we do it with pandas but if for some reason you need to see all the data in pandas as you will do it here in Excel or in Google Sheets I'm going to show you a way to do it a bit later but first I'm going to show you uh different ways how we usually uh display a data frame in pandas so the first way to do it is using the head method so here to use the head method we only have to write the name of the data frame in this case df_exams and then write head then parenthesis then we run this and this is how we get the first five rows in a data frame so as you can see here we have uh from row zero to row four and this is how we got these first five rows so this is the head method in the same way we can get the last five rows of this data frame by using the tails method so here we only have to write again the name of the data frame in this case well the same dfc_exams and then write tails then parentheses run this and actually I think it's tail yeah it's tail in singular and now we get this uh we got the last five rows so it's from 995 to 999 so these are the five rows the last five rows and now in case you want to get more rows so not only the first five or the last five rows you can add an argument to the either the head or the tails method so I'm going to use here the head method as an example so here I copied this and now I'm going to paste here so let's say now we want to get the first 10 rows so we write here inside parenthesis 10 and now we run this and I scroll down here and we can see that the first 10 rows are here and we can do the same with tail so here I write tail and as we can see the last 10 rows are displayed here so you can specify the number of rows that you want to display and that's how you do it so now I'm going to show you how to display all the rows of this data frame as you will do it in Excel or in Google Sheets to do so first we have to know how many columns this data frame has so an easy way to get the number of columns is using the shape attribute to get the shape attribute first we write the name of the data frame so in this case DF_exams and then to get to this attribute to get access to this attribute we use the that and then the name of the attribute in this case shape so now we run this and we get 1,08 the first is the number of rows and the second is the number of columns so we have 1,00 rows so now to display all the rows we have to use the set underscore option method so we write pd set option and inside parenthesis our first argument is going to be the following display.mmax rows so here we have to specify one more argument and this is going to be the number of rows we want it to to have so here it's 1,000 because we have 1,000 rows and we run this and as you can see here nothing happened because we only modify the default behavior of pandas so if we want to get the the data frame we just press enter and execute this data frame i'm going to scroll down and here as you can see here uh there are all the rows of this data frame so I'm going to scroll all the way down here and as you can see it says 999 so all the rows are here displayed and that's it for this video in the next video I'm going to show you the different attributes methods and functions a dataf frame has in pandas welcome back in this video we're going to see some basic attributes methods and functions that we can use in pandas but first let's learn what are each of them so first attributes are values associated with an object and they are referenced by name using that expression so to get to an attribute we have to use the that dot sign so for example below you can see that we have a data frame named df and to get to the columns we have to use the dot columns so columns it's an attribute and that's how we get this attribute of this data frame so now we have a function a function is a group of related statements that performs a specific task so we've seen functions before in Python we've seen some Python built-in functions like the max that gets the maximum value of a list or min that gets the minimum value or length that gets the length of a list so those are some Python built-in functions that we can use in pandas 2 and finally methods are functions which are defined inside a class body so we haven't talked anything about classes because it's not the main topic in this course so just keep in mind that functions are inside a class so when the creators of pandas built pandas they use many classes and those functions inside some classes are known as methods so for example below you can see the head method and we've seen also the tail method and some other methods so far as a rule of thumb when we use methods we have to write the parenthesis but when we want to get access to attributes we only write that and the name of the attribute so the method it's with that and parenthesis and the attribute is with only that and the name of the attribute so enough talk now let's write some code in Jupyter notebooks so here we're going to use the same CSV file we used in the previous video and we imported pandas as we did before then we read this CSV file with the readers CSV method and now we show the data frame simply by writing the name of the data frame so we've seen this before i'm just uh reminding you and now we'll see some basic attributes methods and functions that we can use in pandas so first let's check some attributes of this data frame so first I'm going to copy the name of the data frame and now let's check so the first attribute it's going to be the shape so we've seen this before I believe and to get to the attribute we write the dot and then we write the name of the attribute so it's shape so DF exams shape and we get the name of the attribute the first is the number of rows and the second is the number of columns so let's go to the next attribute the next attribute it's going to be the index attribute and as you might expect we have to write only the name of the data frame then the dot and now index and this is how we get the index of this data frame so as you can see this has uh some form of a range a range as you might know has three arguments and actually two are uh necessary the first is the start in this case starts in zero and the second is the stop so the last element and stops at 1,000 so this is uh true because here my data frame starts with zero and and finishes with 999 well it's 1,000 because stopped one before 1,000 and here it increases by one so zero one two and three and so on so the step is one so this is my my index attribute so now let's continue and now let's get access to the column attribute so to do so we write the name of the data frame and then we write the name of the attribute so in this case column and it has to be written with s so in plural so we run this and we get the name of the columns so as you can see here we have eight columns the gender race ethnicity and so on and we can use this attribute even to modify the name of the columns but we'll see that later and now let's see how we can obtain the data types of each column to do so we have to use the dypes attribute so we write well the name of the data frame again and then the types and this is going to give us the type of each column so the gender is object and uh actually from the gender to the test preparation course are objects while the math score reading score and writing score are integers so numbers by default anything that says object is some kind of string so I'm going to print this so you can see much better so here is the data frame again and as we've seen before from gender to test preparation has the type object and as we can see here all of them are strings so we can say that objects are the same as strings here and also anything that says a score here represent uh some kind of number so that's why we get here integers so in 64 so these are the most common attributes in a pandas data frame now let's review some methods so first let's see the first five columns and as you might know it's with the head method so we only write the name of the attribute sorry the name of the data frame and then we write the head method so head and parenthesis so we run this and we obtain the first five rows so we can also obtain some summary info of the data frame by using the info method so here we write the name of the data frame info parenthesis and execute this so here we have some information about this data frame and here we have again the data type here and also how many rows are nonnull so as you can see here all the data that we have in this data frame are nonn so there isn't any empty data here in this data frame okay now if we want to get some basic statistics of a data frame we have to use the describe method so we write the name of the data frame and write describe parenthesis execute this so we run this code and we have some basic statistics so first the count so this indicates how many rows each column has so uh each of them has 1,000 rows then we have the mean so it's basically uh they assume each of the the data here the numeric data and then divide it by 1,000 because there are 1,000 rows then the standard deviation the minimum value uh for example in math score the minimum value was zero then 25% represents the percentiles so this is Q1 25% Q2 is 50% and Q3 is 75% then we have the maximum value on each score on each exam and we see that the maximum score is 100 on each of them so the describe method is a useful method whenever we want to get some basic statistics of the data frame especially of the numerical data that we have in our data frame okay now let's see some functions that we can use in pandas we can use some built-in functions uh that Python has in pandas for example if we want to get the length of a data frame we only have to write length and then inside parenthesis the name of the data frame so we run this and we obtain that the length of this data frame is 1,000 actually the length of a data frame indicates only the number of rows so here I made a mistake is rows and this is how we obtain the number of rows of a data frame so also we can use other built-in functions that Python has like the max function so we write max then the name of the data frame we run and in this case we didn't get anything anything meaningful because we get like a string but if we write here the index and we write max as you might remember if we use this um this attribute we're going to get the list of index so if we use the max function we're going to get the maximum or the highest index here so run n is 999 so we can also get the lowest index of a data frame we only have to copy this and instead of writing the max function we write min so in this case we get the minimum index n is zero so now we can obtain the data type of the data frame well the data frame has a data frame type but we can verify that using the type function so we write type thenh sorry we write only the name of the data frame and we run so here as you can see the type of this object is a data frame and finally we can use a common function that is the round function so we write only round and this has two arguments so first the object that we want to run and in this case is our data frame and the second argument is the number of decimal points that we want to have so in this case I want two decimal points so we run this and we're not going to get this number of decimal points in this particular example because the the numerical data we have here it's u integers so they are not floats so this doesn't have any effect but if you have a data frame with float numbers you can round those numbers using the round function and that's it these are the most basic attributes methods and functions that we will see often in pandas all right now it's time to learn how to select a column from a data frame so here I have the same CSV file we've been using in the previous videos and well let's import pandas and let's read this CSV file so I have this in the same data frame and I'm just showing the first five rows so now to select one of the columns of this data frame we have two options so let's see the first option the first option is using the square brackets this is the preferred way to select a column in pandas and let's see how to select the gender column so the first one here so the first thing we have to do is to write the name of the data frame in this case df_exams and then open square brackets so I open square brackets and now we have to write the name of the column so we open quotes and now here I'm going to copy the name of this column and I'm going to paste it here so we have here the name of the data frame and then the name of the column we want to select so now we press Ctrl enter to run this code and as we can see we have the first column of this data frame so here we have this and as you might expect this is an array so this is a 1D array and as we discussed before in previous videos 1D arrays are series so we can verify if this is true so we can do this with the type function so I'm going to copy this U column this selection and now what we're going to do is to use the type function so we write type then open parenthesis and then inside parenthesis we write the object we want to evaluate so in this case it's this and now we run this and as you can see here we get a series and series just like pandas dataf frames have attributes and methods so we can access those attributes and methods and actually the attributes and methods between a series and a data frames are very similar so for example if we want to get the index attribute of this series we only have to write the name of the series and then write and the name of the attribute so index so we run this and we get this index in form of a range that starts with zero and ends with 1,00 so another method that share pandas in series is the head method so we can also get the first five rows by writing dot head and parenthesis so as you can see here we get the first five rows of this series all right that's it for the first syntax this is my favorite syntax and actually most people use it because it's the most practical and now it's time to see the second syntax to select a column from a data frame so this syntax involves writing the dot sign which is here so let's say we want to get the same gender column so we write the name of the data frame followed by that and the name of the column so gender in this case we don't need to open quotes and we don't need the square brackets so we run this code and we get the same series so it's here and probably now you might be thinking that this is more practical than the first syntax but this syntax has some pitfalls so now let me show you here so what if you want to get uh one column that has two words for example what if you want to get let me show you here this column that has as name math score so now let's try to get access to this column i'm going to copy this column name and now scroll down and and now let's try so I'm going to write first the name of the data frame and now the dot so to get access to this or to select this column we have to write the column name so this is the column name but as you can see uh if I run this we get an error because Python doesn't work like that in Python when we have two words we usually add a underscore so that's how Python understands this that this is a variable but if it's like this Python will not understand what you're trying to do however if you use the first syntax so the square brackets you won't have this problem so let me show you here now I'm going to write this i'm going to copy and now I'm going to paste it here and instead of having this only notation I'm going to open the square brackets open square brackets and then add the quotes so as you can see here the column names has a string type and now Python know that this is a string and now if you delete this dot sign and you execute this you get this column without any error so this is one of the advantages that the square brackets has over the that dot sign and that's it in this video we learn how to select one column from a data frame and in the next one we're going to learn how to select two or more columns from a data frame okay in this video we're going to learn how to select two or more columns from a data frame so as usual we're going to start by importing pandas and reading the CSV file we've been using so far so we execute these two lines of code and we get here the data frame so what we're going to do in this video is to select two random columns from this data frame so first let's pick some columns so I'm want to select the gender column and also the math score column so to select these two columns we have to use the square brackets again so here in this case we have to use two square brackets to select two or more columns so to do this we have to write first the name of the data frame so it's df exams and now we open square brackets so we write one and two twice so we have two pairs of square brackets and inside we have to write the name of the columns we want to select so we said that uh we wanted the gender column so we write gender and the second column that we chose was the math score so I open these quotes and now I write math score so here I have these two columns and by the way the order that we write these columns is the same order that we're going to get that data frame i mean we can define the order of the columns inside this square bracket so here we're saying that first is the gender column and second it should be the math score column so now let's run this and as you can see here we obtain first the gender column and second the math score column so here we can see that it's a data frame and there are 999 rows so now we can verify that this is actually a data frame by using the type function so let's check if this selection is a data frame so now I'm going to copy this in here let's check out the data type of this selection so here I paste it and now we use the type function we open uh this parenthesis and now we execute this code and as you can see here we got that this is a data frame so here one little detail I want to tell you is that when we use these two square brackets or two pairs of square brackets we're always going to get a data frame but when we use only single pair of square brackets as we did in the previous video we get a series so one pair of square brackets is for a series and two pairs of square brackets it's for a data frame okay now to continue with the video I'm going to select two or more columns using these two pairs of square brackets so now let's choose the columns that we're going to get so in this case I'm going to get the gender column and all the scores that we have here so the math score reading score and writing score so to do so first I'm going to copy this first selection with it to have it as a reference and now I'm going to paste it here so here so far we have two columns so let's add the two remaining columns so here an easy way to to write these columns is just by copying this in the data frame and here we can paste it so instead of writing those names we can just paste it here now I delete and I put it inside quotes so here inside quotes and here we have it so here as I said before we can change the order of the columns we just have to for example here I cut this and let's say we want to have the writing score in the beginning so here I paste writing score and now what we're going to get is first the gender column then writing score column and then the math score and reading score columns so now let's run this code and as you can see here we have this data frame in the order that we defined here okay now you might be thinking if there is a way to select two or more columns using the dot sign so let's check if that's possible here for example let's say we want to get the gender and the math score column using the that dot notation so here I have it and as you can see here this doesn't look right because you have two strings separated by a comma but you don't have a list you don't have square brackets this is probably gonna fail so let's check i'm going to run this code and as you can see here we got an invalid syntax so it's a syntax error so as you can see we cannot select two or more columns with the dot sign and this is one of the disadvantages that the dot sign has over the square brackets this is why most people prefer to use the square brackets instead of the dot notation and that's it for this video in this video we learn how to select two or more columns from a data frame okay in this video we'll see different ways to add a new column to a data frame so here is the same students performance data frame and as you can see we have three columns with scores math score written score and writing score so let's say we want to add a new score so in this case let's add a language score so to add a new column in spreadsheets like Google sheet or Microsoft Excel we'll simply insert a new column and that's it but in pandas we have to use different methods or different ways to allows us to insert a new column so let's see how to do it here so first let's add a new column with a scalar value so a scalar value is simply a single value and in this case it's the column is going to have one single value so all the rows is going to have the same value so to do so we're going to have to select this imaginary column because this column doesn't exist so far so what we're going to do is to select a column as we will do with any other column so first we write the name of the data frame in this case df_exams and then we open square brackets and open quotes as we will do in any uh column so here instead of for example writing math score I'm going to copy this instead of selecting math score we have to write the name of the the column we want to create so in this case let's write language score so this is a new column we want to create and now we have to assign to this new column we have to give it a new value or a new scalar value in this case I'm going to add a value of 70 so now if we run this code we're going to see that nothing happens but if we now show the data frame we're going to see that we have a new column and this column is named language score and the value that this column has is uh the same value so it's 70 in all its rows so we have 70 in row zero and if we scroll down we're going to see that it's 70 in all the rows so even in row 999 but it's a bit weird that in an exam you will have uh all the students with the same score so what you will usually do is to add some different values to this column so to do this we have to use arrays and to create arrays we have to use numpy so here in the second way to add a new column we're going to use arrays so in this case we have first to see how many rows this data frame has so in this case it has 1,000 rows and this is important because the number of rows has to match with the number of the array we're going to create so let's create this array and first let's import numpy so we write import numpy as np so we run this code and now we imported numpy so now we have to create an array of 1,000 elements and to do so we're going to use a method called range so it's written like this arrange and this give us an range of numbers that start with the first argument that I'm going to write zero and the last argument that in this case it's going to be 1,000 so these are the limits of my range so I execute this and as you can see here it starts with zero until 1,000 so to verify the length of this uh range we have to use the length function so as you can see here the length is 1,000 so the rate has 1,000 elements so now I'm going to assign this to a new variable and I'm going to name this variable language score so language underscore score so we execute this and here I was planning to see the length of this array so I quickly do it here as we did it before so length and now we have the length of the array so now we have to add a new column to a data frame with this array and to do that we have only to use the same way we did before so first we uh write the name of the data frame and then we make the selection so this uh selection is going to be with the new column well in this case it's not new because we already created it but let's imagine it's a new column so it's language score and now we have to set the array to this column so we write language score here and we set it to this new column so now to see the results we only show this data frame and as we can see here we have a new column and this new column starts with zero and it ends with 999 so it doesn't have a single value anymore but now it has a range of values and now there is a little detail we have to take care of so scores are supposed to be between 0 to 100 and we have here from 0 to 999 and also here we have a sequence of numbers so it's from zero and then one and it increases by one and usually in scores you will see that students have random scores so we have to create here an array with random numbers and to do that we have to use numpy again but here we have to use a different method in this case the method is named random rand int so let's write it here np random rand and then int so the first argument is the lowest value of these random numbers and by the way these are random integer numbers because scores are usually integer numbers and in this case I'm going to set this this to one and the second score is the highest number or value in these random numbers and I'm going to set it to 100 and the third argument is the size in this case we want an array of 1,000 elements so we set the size to 1,000 now we execute this we run this and I'm not going to see this uh rate again i'm just going to check that it has the length we want by using the length function and here we have 1,00 elements so now let's create a new variable and store this in the variable so here this is going to be int and then language underscore score and this is going to be our new variable so here one little detail you should know is that the first argument is inclusive and the last one is exclusive so this means that if we here let's say we get the minimum value of this new uh array we're going to get that the minimum value is one because this first argument is inclusive which means that it can be included in this new array however if we print now the maximum value of this uh array we're going to get that 100 is not there because it's exclusive which means that this second argument shouldn't be included in this array okay finally let's insert this random integer numbers in the new column that we created so we have to just use the same way we did it before so here I copy and now I paste it so here instead of assigning this language underscore score I'm going to use this int language underscore score so here I'm going to run this code and as you can see here we have this the same column and we have now this data that is random random integer numbers from the row zero to the row 999 so now this new data looks more like scores like real scores because these are random numbers and these are between zero and 99 and that's it now one more little detail I want to share with you is how to create random float numbers because before we created random integer number but if for some reason you want to create random float numbers there is a way how to do it with numpy so we only write np then random then ununiform and the arguments are the same so the minimum value then the maximum value then the size which is 1,000 then you run this and well it's similar to the one we got before but now we have float numbers and that's it in this video we learn different ways to add a new column to a data frame in this video I'm going to show you two more ways to add a new column in this case we're going to use the assign and insert method so first I'm going to import pandas and read the same CSV file we've been working with in this course and here it is so it's the students performance that CSV file so it's here and now let's add a new column so here we're going to start with the assign method and first let me explain you when to use the assign method so it's a good idea to use the assign method when you want to add multiple columns and you only want to use a single line of code and also the sign method is preferred when you need to override the values of an existing column this is kind of a best practice so now I'm going to add a new column using the assign method and first I'm going to create a series and to create a series first I'm going to create some random numbers so here I have the code so we don't waste time doing this again i'm just going to quickly explain what this does and first we import numpy and then we create random numbers using the run int method so we create 1,000 numbers from 1 to 100 because those are the scores and the scores are between one and 100 so we create two variables because we're going to add two columns and then we create a series we create a series because we need index so apart from the random numbers we need to add an index so we create a series and the index are from 0 to 1,000 because uh 1,00 is the number of rows that this data frame has and that's it so I'm going to run this code now that I explain it to you and I'm going to show you how this series one looks like and it's this one so basically this is one column and series 2 is the second column so we're going to add this to the data frame so to do that first we're going to write the name of the data frame so we write dfc_exams and then we use the assign method so we write dot sign open parenthesis and inside here we're going to write the name of the columns we want to create unlike the previous methods we don't need to open quotes to create a new column so it's only the name itself so for example I want to create a column named score one so we only write score one and then that's it this is my new column and now I'm going to assign this to the series one so I copy series one and here I paste it then I create my second column so it's going to be score two and I'm going to assign this to series two so what we're doing here is creating two columns score one and score two and we're assigning these new columns to the series one and series two so we run this to see the results and as we can see here right now here we obtain the data frame so we have the math score reading score writing score and we have two additional columns so the first one is a score one and the second one is a score two so we successfully assign these two new columns to the data frame and as you can see here this is only a copy so if we run the df_exams data frame you can see that the values weren't updated so the data frame is the the original data frame so to update the values we have to overwrite this data frame so we write df_exams equal to this so the sign method only creates a copy and we have to overwrite the values of the data frame to update the data frame so here I run this now I copy the f_core exam i paste it here and now I show you the results and as you can see here the two new columns were added okay okay now let's see the second method to add columns to a data frame and this method is the insert method the insert method allows us to add a new column at a specific position or index and to use the insert method we only have to write the name of the data frame so df_examinsert open parenthesis and now let's check the the arguments this one has so first we have to introduce the index of this new column in this case I'm going to add it in the index number one next we have to write the name of the column we want to create in this case I'm going to name this column as test and then we have to write the value so in this case I'm going to use the series one which is here as the value of this test column now to insert this new column we only have to run this and as you can see here apparently nothing happened but actually the values were updated in this case the insert method updates the data frame or the values of the data frame and it doesn't create a copy so unlike the the assign method we don't get a copy with the insert method so the sign method returns a new object or a copy with all the original columns in addition to the new ones and that's why here we had to overwrite the values because we only got a copy so to update it we have to override this data frame but this isn't the case for the insert method because we ran this and after that only the values were updated and we didn't get any copy so now let's verify this we run this so as you can see here we got a new column named test and is in index number one so this is index number one and by the way index zero is where gender is located right now and here in my new test column we have the values inside the series one variable so here is all the random numbers we created before and that's it that's how you add a new column with the sign and insert methods all right now it's time to see some operations we can perform on data frames so here we have the same data frame DF_exams and here we can apply some common operations to the numerical columns like math score reading score and writing score so let's see how to do this in pandas so first we're going to see how to make operations in columns so our first task is to calculate the total sum of a column so let's pick first the math score and let's calculate the sum of this column so to do that we have first to select a column and as you might remember to select a column first we have to write the name of the data frame in this case DF exams then we open square brackets and then write either single or double quotes then we have to write the name of the column and in this case it's this one math score this is the column we want to select And now instead of selecting we're going to perform a operation so in this case I want to calculate the total sum of this column and we have to use the sum method so we write sum and parenthesis and this is how you calculate the total sum of this column so to verify this we run this code and here we got 66,000 and this is the total sum of this math column great now we can make some other operations you will do in Excel for example we can calculate the the number of rows using the count method so here we can easily do that i'm just going to copy this one and now instead of writing the sum method we write count so here count and now let's see so we see 1,000 rows and yeah this is correct because this data frame has 1,000 rows so now we can calculate the mean of this math score column we have to copy this one now paste it and instead of writing count we have to write mean and here we got the average value of this math score column so to get the average we have to sum all the rows in this math score column and then divided by the total number of rows in this case 1,000 and this is how you get this mean value then we can get other uh other operations using the method so here for example we can get the standard deviation by writing std so we execute this and the standard deviation of this math score column is 15 and we can get also the maximum and minimum value let's do it quickly here so first the max and then the min value you can actually do it with the Python built-in function but we can also do it with the methods so here I run and as you can see here the minimum value of the math score is zero and the maximum is 100 okay now I'm going to show you a quickly way to make the same calculations using that describe method i think we saw the describe method in previous videos but in case you don't remember it I'm going to write here uh the name of actually we only need the name of the data frame we don't we don't need the name of a specific column we only need the name of the data frame and now we can use the describe method so we write that describe with parenthesis and now we got like a summary table with some important statistical values and here we have the count the mean the standard deviation the minimum and maximum value and as you can see here we got all of these with one method okay so far so good now instead of making operations in columns we're going to learn how to make operations in rows so now let's calculate uh let's say the sum of the math score reading score and writing score to do so we have to make some selections and in this case we have to make some independent selections so to show you I'm going to copy the name of these three columns i copied it and now I paste it here and now we have a math score reading score and writing score so now let me delete that sign and now we have to make some independent selections so first we write the name of that data frame so DF exams and now to make the selection we open square brackets in quotes so now let me do this quickly in the others now here so I open square brackets and now let me do it here too and now it's ready so here we made some independent selections and now to make uh to calculate the sum in a row we have to use the plus sign so here the plus operator we have to write it here and here so basically here we're making uh sum in each row so to verify this we run this code and as you can see here we got the sum of the scores column so here let's verify fast the sum of the first row and it's 72 + 72 + 74 so 72 with 72 is 144 and with 74 is 218 so here we have it and it's correct so now let's do something else so now instead of just summing these three rows or actually these three columns what we're going to do is to calculate the average to get like an average score so here let me copy this and now here we're going to calculate the average by summing this and then dividing this by three so this is how we calculate the score and now let's assign this result to a new column to do so we only write equal in them as you might remember from previous lessons we have to add a new column by writing the name of this column so we do that writing the name of the data frame and then making like a selection so we open square brackets then open quotes and here we write the name of the column that we want to create so this is the same as we did in previous lessons where we added a new column so in this case I'm going to name this new column as average and I'm going to execute this and now to verify that this new column was created I'm going to show this data frame here below and here is our data frame so now in the last column you can see that there is an column named average and it has the average value of this math score written score and writing score and now here we can control the number of decimals we can just use the round function and write the number of decimals we want to get so in this case I want only two decimals so I run this and as you can see here our data frame looks much better because we only have two decimals and that's it in this video we learn different ways to make operations in columns and rows on data frames all right now let's have a look at the value counts method so so far we have seen how to count the number of rows in a data frame so for example if we want to count the number of rows in the gender column we either use the length function so we write length then the number or the name of the data frame and we only have to write the name of the column so as you might remember this give us the number of rows and we can also use the count method so here we write count and we get the number of rows but what if we want to count the gender elements by category so female or male what if we want to know how many female and how many male elements are in this gender column so this is when the value counts comes in handy so we can use this method to count each category of the column so to use this method we only have to write the name of the data frame followed by the column that we want to count so in this case is the gender column and then we have to use the value counts method as you can see here so now we execute this and as you can see here we have not only the total rows in this gender column but now it's divided by category so we have that there is 518 females and 482 males so this is how the data is spread in the gender column so now we can do more with the value counts method so we can get the percentage that each category represents in the whole column so here I'm going to copy this and now to calculate the percentages also known as relative frequency we have to add an argument named normalize so we write normalize equal to true and then we execute this and as we can see here female represent 51% of the total observations in the gender column while male only represents 48% of the total observations so as you can see here the value count method is useful when you want to have a look at the data by category okay now let's see another example and in this case let's pick a different column so here I'm going to choose this parental level of education column i copy this and now let's calculate uh let's count the elements by category so here I'm going to write the name of the data frame DF exams and now I open square brackets quotes and here I paste this column now to count the elements by category in this column we use the value counts method so we run this code and here you can see how the data is divided in this column so most people have some college level of education while just a few people have a master degree and now if we want to get the percentages that represent each category we again use the normalize argument so we write normalize equal to true and now we're going to get the percentages so we can see the percentages and if we want to round this to two decimals we use the round method so we write round parenthesis and now two decimals and as you can see here we round it to two decimals and that's it now you know how to use the value counts method okay in this video we're going to see how to sort a data frame using the sort underscore values method first let's import and read the CSV file that we've been working with in this tutorial and now let's sort the data frame so here we have the the data frame and as you might remember it's it has these three numerical columns and now I'm going to sort it using one of these columns so let's use the sort underscore values method and first I'm going to write the name of the data frame which is df_exams and then write sort values now I open parenthesis and now I can use this help here and as you can see the only mandatory argument is by so we can use this one by and this one we have to specify the name of the column we want to sort by so in this case I want to sort by the math score so I'm choosing this numerical column to start with so I'm going to write math score actually I'm going to copy this one and paste it here so by math score and sorting this data frame is as simple as that now we can run this code and as you can see here the data frame was sort ascending by default so it starts with zero and it ends with 100 in the math score so this is how the sort underscore values behave by default and here one little detail you don't need to specify the byte word we can omit it and we run this and as you can see here it still works so here we can modify the default behavior of the sort underscore values method we only have to add a new argument and is that ascending argument so let me show you here i'm going to copy this one first and show you here so in this case we're going to sort descending by the same column so we only write comma and then we specify the sending argument so we write ascending equal to and here I want to show you something in this little help here here the sending is set to true by default this means that is ascending by default but we can change this default behavior by setting ascending equal to false and that's what we're going to do here ascending equal to false so it means descending and now I'm going to run this one and as you can see here is sort descending by the math score column so here it starts with 100 and it ends with zero but that's not all we can do much more with the sort underscore value method so first I'm going to show you here how to sort by two different columns so here let's copy and paste this one so in this case we're going to sort descending by multiple columns so instead of writing only math score we're going to add here one more column and it's going to be the reading score column so here I copy this one i'm going to copy and paste it here but first we have to add the square brackets because as you might remember when we write two or more columns we need the square brackets now I write comma and I paste this written score now I add quotes and that's it that's everything you have to do to sort by multiple columns now I'm going to run this one and as you can see here it was sort descending first by the math score column and then by the written score column so the priorities are set here in the list that we included here so first is the math score column first priority and the second priority is the written score column and that's what you can see here now I'm going to show you a little detail here let me copy the df_exams and if I print this one you can see that the changes we made weren't updated so this here the math score column has the original values this happens because the sort values method like many other pandas method only creates a copy of the data frame so here we obtain a copy this one is a copy but it doesn't update the values of the data frame unless we add a new argument which is the in place argument so I'm going to show you here but first I'm going to delete this uh df examp and now I'm going to copy this one and show you how to update the values of this data frame so here I'm going to copy those are the same values but now I'm going to add a new argument which is the in place argument so here write in place equal to and now I'm going to show you the default value so here the default value of in place is false this means don't update the data frame but only create a copy but if we set it to true it means update this data frame so here I'm going to set it to true to update the data frame so here I write true and now I run this and apparently nothing happens but if now we print the df_exams data frame we're going to see that we have the data frame sorted in case you don't want to add the in place argument and you want to update the values of the data frame you have another option that we used before which is overwriting the values of this data frame so for example you can only delete that in place argument and write df_exams equal to this so this is overwriting the values but in this case we're not going to do that we're going to add the in place argument as you can see here finally we're going to see how to sort but now not with numerical data but with text so as you can see here we before sort this data frame by the math score column and this one has this numerical data but in this case we're going to sort it by the race ethnicity which has this text so we're going to sort this one so first we're supposed to get group one and then group B C D E and so on so let's do this here i'm going to scroll down and first we have to write the name of the data frame uh followed by the sort values method and I'll specify the name of the column so here I'm going to copy race ethnicity here let me copy here and it's done now I have the name of the column I'm going to set to ascend into true and now the new argument we have to add to sort this is the key so I add key then equal to and in this case we're going to use the lambda function i'm not sure if you're familiar with the lambda function but it works similar to an average function we've seen before in the Python crash course but in this case it's going to behave a little bit different so let me show you here first you have to use the lambda keyword so we write only lambda and now we should write the object that is supposed to return in this case I'm going to write call that stand for column and then we have to write column and specify the operation we have to make over this variable so in this case I want to write column or call and then access the string attribute so I write str and then use the lower method so what we're saying here is get the string values of the column and then transform it to lowerase so here we get the textual data in lower case and with these three arguments we're saying sort the values inside the race ethnicity column and sort it ascending and then sort the textual data of this column in lower case so here we have this A B CDE E in uppercase but we're going to get it in lower case and sort it by this text data so now let's run this one and let's see the results so as you can see here we have this race ethnicity column and its order ascending so here we got the A and B and C and D and so on and that's it these are the different ways to sort a data frame using the sort underscore values method okay in this video we're going to learn the set index and sort index methods the first one is going to help us set a column as a new index and the second one is going to help us sort the index so let's get started so as usual we're going to import pandas and read the same CSV we've been working with so here we have the student performance CSV and now I'm going to create an index so we can set this as a new column and then set it as an index and then we're going to sort those indexes with the sort index method so first we're going to import numpy and random so here I'm going to import this and now I'm going to create a non-repetitive values for the index so it has to be nonrepetitive values because index in a data frame doesn't allow duplicates so the values have to be different from each other so here to create different values we have to use the arrange method from the numpy library so we write np then range then open parenthesis and specify the first and last element so zero and 1,000 as you can see here we get an array from 0 to 999 so now I'm going to set this to a new variable called new underscore index so it's here and now I'm going to set it here and we have our new index variable now I'm going to shuffle this index so we can sort it later so to shuffle this we have to use the random library to do so we only write random here and use the shuffle method so we write shuffle then open parenthesis and here we have to include the object we want to shuffle in this case is this new underscore index variable so we write it here and now we run this and apparently nothing happened but this new underscore index was shuffled so now let's verify here i'm going to write new underscore index and as you can see here it doesn't start with zero one and two anymore so it's shuffled now I'm going to delete this one and now I'm going to set this new underscore index variable to a new column of my original data frame so here I'm going to copy the name of the data frame and now I'm going to create a new column as we did in the previous videos so we're going to use the simplest way i'm going to write the name of the data frame now square brackets in here the name of the column we want to create so in this case it's going to be new index and I'm going to set it equal to this new index variable so this is the variable that contains my shuffled numbers from zero to 1,000 so I'm going to run this one and before I do that I'm going to show the new data frame so here I run and as you can see here the last column is new index and it has these shuffle numbers so great now I'm going to show you how the set index method works so here what we're going to do is to set this new underscore index column as the index of this data frame so instead of being a column like right now this new underscore index is going to be the index of this data frame so it's not going to say here 0 1 2 and so on anymore but the values of this column like 342 are going to be the index here so let's do it here so I scroll down and now we only have to write the name of the data frame here df_exams and then we use the set index method open parenthesis and write the name of the column we want to set as index so in this case is new index so this one here and then we run this so that's it we run and as you can see here my new index is the new underscore index column so it's here 342 and all the random numbers we created so this is my new index and as you can see here the set underscore index method creates a copy and if we want to update the values of these DF_exams we have to use the in place argument as we've seen in previous videos so I write in place equal to true and now I'm going to show this so now df_exams data frame is updated now I'm going to show you how to sort the index of this data frame with the sort_index method and this is going to work similar to the sort underscore values but in this case for index and now I'm going to show you here so first we write the name df_exams this is the name of the data frame then sort to sort the index and that's it so we run this and as you can see here now it's sort ascending so from 0 to 999 and if for some reason you want to sort descending you only have to add the ascending argument as we did for the sort underscore values method so we only write ascending equal to and as you can see here I'm going to show you here the default value is true so we're going to set it to false so I write false and now is sort descending so it starts with 999 and ends with zero and we can even save the changes of this sort index by adding the ascending sorry the in place argument so we write in place equal to true now we run this and let me show you how this looks like right now and all the changes were saved and the index is sort descending and that's it that's how the set index and sort index methods works in pandas okay in this video we're going to learn how to rename columns and indexes with the rename method so first we import pandas and this CSV file students performance and now I'm going to show you how to rename a column and the first column we're going to rename is the gender column so in this case I want the G in capital letter and we can do that with the rename method so we have to write DF_exams which is my data frame and now I write rename parenthesis and now we have to specify the columns argument so we write columns and then write equal and this has the shape of a dictionary so we have to open curly braces so here I have the curly braces and now the first element is a key because this is a dictionary and the key is going to be the actual or current column name so in this case is gender and now we write gender and the value is going to be the new column name so in this case it's going to be gender but with G in capital letter and that's it that's all you have to do to change the name of a column so now I'm going to run this one and as you can see here the name of the column is gender now with G and capital letter but this is only a copy so we can override the values of this data frame by writing DF_exam equal to this i run this one and here you can see that the value of the data frame was updated so here gender is with G in capital letters so now I'm going to show you how to update two or more columns with the same method but now I'm going to use the in place argument so for this example I'm going to change the name of the math score reading score and writing score so in this case I think these are long names so I'm going to use only the first letter of each word so for example for math score it's going to be MS and then for reading score it's going to be RS so those are the new column names so here to change the column names we only write the same method so here I copy and paste it and now I'm going to write one by one the name of the column we want to change so first math score here I made a mistake i'm going to change it fast so math score now the new name is going to be MS and now the second element it's going to be I'm going to copy and paste it twice so the second is reading and the third one is writing so here reading and here writing now I'm going to delete those blank spaces and it's ready so now I'm going to add the comma here and write RS and here WS and I think it's ready now now to save the changes we're making here i'm going to add the in place argument equal to true and now everything is ready so I run this now let's see if the data frame was updated so here I have the data frame and as you can see the three last columns has the values MS RS and WS and actually it looks much much better because the numbers fit much better in this narrow column and that's it for the columns argument now I'm going to show you how to change the index using the same renate method but in this case using the index argument so here I'm going to change the name of the index here for example instead of zero I want it to be a and instead of one b and then z and so on so here I'm going to write the same uh the same syntaxes so df_exams then rename then parenthesis and here I'm going to use the index argument so it's here and by the way you can see all the arguments here by pressing shift tab and here it's the index and columns that we've been using so far so here I'm going to add that dictionary shape and then I'm going to set the key and the value so the key is going to be um here zero and the value as I told you before is going to be a so zero the current index and a the new index so the new name of the index then one to B and then two to C so that's how it's going to be and now to update the the values I'm going to set in place equal to true as I'm writing right now and now I'm going to run this one and let's check the result so here DF_exams and as you can see here the first three rows have different indexes so A B and C i'm going to use the head method so you can see much better head three and here are the three first rows with index A B and C so we successfully renamed the first three indexes and that's it that's how you rename columns and indexes using the rename method in pandas welcome to our first pandas project in this project we're going to learn how to make web scraping with pandas doing web scraping in Python usually involves learning libraries such as beautiful soup selenium or scrapey that you can perform basic web scraping using pandas and in this project I'm going to show you how to do it and also we're going to see some methods we learned in this course so far so let's start by importing pandas as pd so I import pandas as pd and here I'm going to read a CSV from a URL using pandas and now let's perform some basic web scraping with pandas web scraping consists in extracting data from websites so instead of doing it manually we can automate it with some web scraping techniques and in this video we're going to extract CSV files from a URL using only pandas so here is the target website we're going to scrape and it's this one so this website contains data about football matches of different leagues so here you can see a lot of leagues and now I'm going to choose the first one that says England football results and here we're going to see some data about Premier League and other leagues that England has and if I want to download one of these files I will have to click on any of those and as you can see here I downloaded the CSV file of the first listed here so this one corresponds to the season 2122 and is from the Premier League so instead of manually downloading each file we can use a specific pandas method to read these files from the internet and also by using the for loop we can automate this and download all the files that you can see here so instead of clicking one by one we can download all the files listed here so there are a lot of them and we can download it just with pandas and a for loop in python so let's do it here and now I'm going to show you how to extract data from a single CSV file from this website so to do that we have to use the read underscore CSV method so we write PD readers CSV open parenthesis and we've used this method before that when we use it we read some data that was in the folder where we were working so in the folder where this Jupyter notebook file was located but in this case we're not going to read anything inside our computer but we're going to read data that is in a website so instead of writing the path of the file in your computer in this case we're going to write that link of the file so here I'm going to show you this file has a link so if we want to download we have to make a request to that link to get that file so I'm going to show you here i'm going to rightclick and now I'm going to copy the link address so I copy and now I'm going to paste it here and now I press enter and let's see what's going to happen so I press enter and as you can see here instead of going to the website it downloaded the file so this means that this link contains the data we want to struct so we're going to use this link so I'm going to copy again here copy i'm going to make sure this is the address so copy link address and now go back here open quotes paste that link and this is the link we want to extract because it contains here the that CSV so this means that this is a CSV file and as you might remember we read here a CSV because we're using the readers CSV method so that's everything you have to do to read this CSV file that is stored in this website so now I'm going to run this and let's see the results so here as you can see all the data was read here with a read CSV and it was successfully loaded here so now I'm going to set here a new variable and it's going to be called df premiere 21 and as you can see here this belongs to the 2122 season because here in the date it says 2021 and also this is Premier League because the teams belong to the Premier League you may know if you're familiar with this competition but if you're not it doesn't matter so now let's continue so here I'm going to set this data frame to this variable so I press Ctrl enter and now I'm going to show here this data frame that we saw already and now I'm going to rename some columns because some column names aren't so obvious so maybe you will struggle to understand what this column means for example and I'm going to rename some of them so let's do it here fast and we also practice that rename method so here I'm going to copy here now I write the name of the data frame rename open parenthesis and now we want to change the columns so I write columns equal to then open the dictionary and here the key is going to be let's say we want to change only the these two columns so I'm going to tell you what they mean so here I can write the name of the column we want to change then column then the value and now comma and now the second key or the second key value pair so here is the second element and now I'm going to paste this one so here it's the second you know this first FTHG stands for final time home goals so it means all the goals scored by the home team so I'm going to replace this name with the home goals name and here is final time away goals so I'm going to write only a way goals and that's it those are my new names now to update the column names I'm going to write in place equal to true now I run this and now I'm going to show this data frame updated so now this column is named home goals and this one is away goals and that's it in this video we extracted a single CSV file from a URL with pandas in the next video we're going to learn how to extract many CSV files from multiple URLs using pandas okay in this video we're going to see how to extract CSV files from multiple URLs with pandas so in the previous video we extracted this CSV file that is listed here we extracted the Premier League CSV file and to do that we got the link address and then we used the PDR read CSV method so we use this link that we have here and what we're going to do here now is to use this link again but now we have to concatenate strings so let me show you here what we're going to do so this is the link that we used before so now instead of using the same link we're going to insert variables so for example here I have the link structure so this is the same as this to show you i'm going to run this and as you can see here I have the link so this is the same but I divide it by some elements because I want to show you something i want to show you how this link is structured so first we have this chunk that is the root of the link so this is what all these links here have in common so if I right click the second one and click on copy link address you're going to see here that it has a pattern so I'm going to show you here so it's here so this is the first one and this is the second one so first and second so the two links are almost identical but the only difference is the name of the link so in this case is is zero and this one is E1 and we can also notice some important data like this 2122 that represents the season so here I have season 2122 and this is represented by these two numbers so 2122 is the season so here I divided by some important parts and this one is the root as I told you before and this one is the season this 21 22 and then the slash that is here is only uh what separates the season from the name of the league so here is zero is the name of the league and then you have the that CSV so this is the whole link and now instead of writing is zero we're going to create a variable so we can dynamically extract all this data so instead of only extracting one link we're going to extract all the links in a for loop so here I have all the links that belong to all these five CSV files and instead of extracting one by one we're going to extract it in a for loop so now let's do this and to do that first I'm going to create a root variable and the root is going to be this chunk so I'm going to write root equal to and now I copy this and I paste it here so this is my root because this is what all these links have in common and actually this is kind of static this is not going to change no matter what so here I ran this and now my root is equal to this chunk of link so now to make the for loop we're going to scroll down here and now first I'm going to create the list so this is going to be a leaks list so I write leaks and then equal to open square brackets and now here we're going to write the name of the leaks that these files have so here instead of saying Premier League Championship League One the names are E 0 E1 E2 and so on so I'm going to write E1 or E0 then E1 then E2 then E3 and then E C so we have the five elements and they are in this links list so now I'm going to loop through this list so as you might remember from the Python crash course we loop by writing for leak in leaks as it's here so now I'm going to press enter and here we have this indent and here we have to write the code that will be performed on every iteration so here we have the code already and it's the same code we used in the previous video it's this pd read CSV with a link so I'm going to copy it and now I'm going to paste it here inside the for loop and now I'm going to replace some parts of the link for the variables we created so here instead of this I'm going to write the root variable so here I press the plus and now I write root and now I replace the root with this part of the link so now we have the root now I'm going to delete this because the root includes the last slash as you can see here it has the last slash and now I'm going to separate this season from the slash so here plus now quotes now plus and now quotes so now I have the season here I have the slash and I forgot here is the league so here I have the quotes then plus and then quotes so here I have the season i have the CSV extension so now I'm going to delete this name of the season because now we have this leak variable and now I'm going to copy it and replace it here so now leak and what we're doing right now is to extract the CSV files that belong to these elements in the leaks list so for example we're going to start with E 0 because it's the first element and now we're going to get here the root 2021 and here leak E0 and we're going to extract this one in that way we're going to get the CSV files from E 0 to E C now we have to assign a name to this data frame so I write equal to and DF so df is going to be the name of this data frame and now we have to create a list an empty list to store all this data so I write frames this is the name of my list equal to square brackets so this square brackets represent an empty list so I'm creating here an empty list because I want to store all these data frames inside this list and to do that we write frames.append we use the append method to store all these data frames inside the list so here append and then inside we write df so what we're doing here is to store each data frame on the frames list and we do this because in each iteration we're going to lose the values inside the f for example here on E0 we're going to get that CSV file and this is going to be stored in DF but then in the next iteration when we get the CSV file of E1 this is going to be stored again in DF so this means that we're going to lose the data that we extracted for E0 and this is going to be replaced with the data of E1 and so on so we're going to lose the data of each iteration and to keep the data we have to append these data frames on each iteration into a list and that's what we're doing here with this line of code okay now I'm going to run this code and let's wait a couple of seconds until this is done if you get this error you have to change the encoding of the PD uc CSV here so I'm going to do it here because I got the error so here I have to set the encoding to unicode_cape for this to work properly so now I'm going to run again and now everything was extracted so just add this encoding argument in case you got the error that I showed before so now let's see these data frames and as you might remember the data frames were stored in this frames list so now I'm going to copy the list that was initially empty and now it's not empty anymore because we store the CSV files of these five leaks and now let's check the length so we use the len function and we're supposed to get five elements because we're supposed to get five CSV files because of these five links and now I'm going to run this and yeah we got five and now to check each data frame we have to index this list so as you might remember to index this list we have to write the name of a list followed by square brackets and then the index so if we want the first element we have to write the number zero and it's here so now I press Ctrl enter and this is the first element so as you can see here is E 0 the name of the lake and yeah it's supposed to be E 0 because that's the first element here now let's get the last element that it should be index uh 0 1 2 3 and four so we want this one with index four so we write four here now ct controll enter and here we got this column that says e so this is the element we're supposed to get great so far we extracted the CSV files of these five leagues but now as you can see in the website we only extracted the five leaks from this season 2122 but if we want to get the CSV files of the other seasons for example 2021 and 1920 we have to make another for loop but in this case for the seasons so I'm going to show you how to do it and to do that I'm going to use the same code here i'm going to copy it and now let's make the for loop to extract multiple seasons so now I'm going to paste the code here and now I'm going to organize this one so now it's ready okay and now let's write the for loop that help us extract the CSV files from different seasons so first we have to write the for loop here and I'm going to write this for loop inside the original for loop we got here so I'm going to press enter and now I'm going to write for season in and here to get the seasons I'm going to use a range so I'm going to write range parenthesis this is a range function and here we have to write the minimum and maximum value of this range and in this case I want to struct the seasons from 15 to 20 so I'm going to write 15 comma and here instead of writing 20 we have to write 21 because range excludes the last element so it's like 21 minus one and here now I write the column and now I'm going to introduce this block of code inside this new for loop so I press tab and now is inside and now I'm going to show you what I'm doing here so I'm going to copy this so you can see in detail this so now I'm going to print and write season now I run this and here I didn't write correctly so now season and here as you can see all the seasons are printed so here we got 15 16 until 20 and this is great because we want to build something like this this is the format of the season in the link that we seen before so here we can build this using this season variable because we already have the first year so in this case for example it's 2021 until 2022 or only 2122 so we already have the first one so we only have to create the second one and we can get this structure of year one and year two together by concatenating the seasons so here I'm going to show you how to do it so first we have to make this an string because right now it's a number and if we want to concatenate this this object has to be an string not a number so now to concatenate we use this plus operator and then write str again now parenthesis and now season plus one because the first one is the first year and the second one should be the year plus one for example here 21 22 so now I run this and let's see the result so here as you can see we got 15 16 1617 and so on so now instead of writing 2122 that only allows us to extract the CSV files from this season we're going to copy here this and we're going to replace it so here we delete this and paste that so here we have the format we want it but now the values are going to increase with this for loop so first it's going to be 1516 then 1617 and so on and we're going to get the CSV files from the past six seasons so that's what we wanted so now we have to add one more little detail here and this is a new column because here the data frames don't have a column that help us recognize what is the season that this data frame belongs to so for example here in this data frame we don't have a column that says season or year we only have the date so to easily recognize which season corresponds to a data frame we're going to add a season or year column so here I'm going to do that with the insert method so we've seen different ways to add a new column and we're going to use here the insert method because I want this new column to be in the index one and the insert method allows us to put a column in a specific index so now I'm going to show you here I write df.insert and here we write that index we want to put this new column so here I want it to be in index one then we have to write the name of the new column and I'm going to name it season and then we have to introduce the values so here I only want this season column to have the value of the first year or well the first season let's call it so I only want this i'm going to show you here i only want the first one so the first year that in this case is 2015 or only 15 i want it to be in this season column so not the second one as it was here so only the first so here I'm going to write season and I'm going to get this because I'm doing here a for loop through this range so this is going to be in my season column and finally here to avoid making the video so long I'm going to delete some of the elements so we get fewer CSV files so here I'm going to get rid of some of them so for example here E1 and then EC because before there were five and here there are six seasons so 6 * 5 we're going to get like 30 CSV files and it will take some time so now with three it's going to be faster and now you can also remove this encoding argument because it's going to work fine with these three seasons so it's only for the sake of this video now I'm going to run this code and let's wait a couple of seconds until this is done all right the execution was finished and now we can see the content inside the frames list here and we're supposed to get 18 data frames because there are three elements here and six seasons here so 18 in total so let's verify this by using the len function i use length and here we got 18 so it's correct we have 18 data frames and now I'm going to show you here the first data frame so here I write frames square brackets zero and we're supposed to get the data frame of leak e from season 1516 so let's check here i run and as you can see here we have the season 15 so it's correct and the leak is zero so yeah it's working fine now let's check fast the last element so there are 18 elements in this frames list so the last element should have the index 17 so here I write frames square bracket 17 and we get the season E3 sorry the season 20 and the league E3 and let's check here so E3 is the last element and the last season is 20 so here 20 E3 so yeah it's correct and that's it in this video we learn how to extract multiple CSV files using pandas okay in this video we're going to store all the data frame extracted with pandas into a dictionary dictionaries help us properly manage data and in this case I created a dictionary called dict countries and I have as key the name of the league the actual name of the league and the value is the code that the website this website uses for each league so for example Premier League has the code E and Championship I think it has E1 so here we have the original name of the league and here we have the code so if we want to get access to one code for example we want to get SP1 we can write only the name of the key and since it's a dictionary we're going to get the name of the value first here I'm going to run this dictionary and now I'm going to run this one and here as you can see I got the SP1 because this is the value of the Spanish La Liga key in the same way I created this dictionary i'm going to create another dictionary in this case called dict_historical data to store all the data frames that we extracted here in this previous video where we extracted multiple seasons and CSV files of multiple leaks so all of that is going to be stored into a dictionary so we don't have to use the list indexes as we did in the previous video but only use the key that represents the actual name of the league so the first thing we're going to do is to copy the code here that we use to extract CSV files from multiple seasons and leaks so I'm copying this that we created in the previous video and I'll paste this here so I paste it here and now I'm going to modify this one and we're going to loop through key elements so here before we looped through a list which is this list but now we're going to loop through a dictionary and it's going to be this dictionary I created before so here I'm going to write for league in dict countries so this dictionary contains some leagues like Spanish La Liga uh Bundesliga and English Premier League and also the codes of each league so for example English Premier League has the code E0 and I got these codes manually and you can get these codes yourself by going to the website and copying the link address and then you're going to get for example SP1 D1 and any code of any leak you want now I'm going to show you what we get when we loop through a dictionary because here we're looping through a dictionary so just a recap here so for leak in dick countries now I'm going to write print leak so let's check what we get here if we print this we get the keys so Spanish La Liga German Bundesliga and so on so these are the keys but if we want to get the values we have to use this syntax to get the values of a dictionary which is this and now Ctrl enter and we get the values so as you might remember this is the syntax we use to get the values so now we can use this syntax to replace the name of the leak or the code of the leak so here I'm going to show you this leak variable represented the code of the leak here before we use this leaks list that I'm going to uncomment so we used this links list that contains these codes and we loop through this list so this represented E0 and E2 and E3 but in this case we don't have the leaks list anymore and now we're using a dictionary so to get the same codes we have to use this syntax so we're going to replace this leak variable so I double click here and I delete and now I paste this new syntax of the dictionary so I copy now and now I paste it here so now this link here is ready and now the first thing we're going to do is to concatenate all the seasons that corresponds to a same league so for example if we have the Spanish La Liga we have to concatenate all the seasons from 15 to 20 so to do that we only have to use the concat method we didn't use this method so far but this is really simple so I'm going to explain you how it works right now so to use the concat method we write PD concat open parenthesis and now we have to write the object we want to concatenate in this case what we want to concatenate is this frames list so I copy this and I paste it here and as you might remember this frames list contains the data frames of all the seasons that we got here in this range so 15 16 until 20 so by doing PD.NCAT we can concatenate all the seasons from the season 15 to the season 20 so all of them will be in just one data frame so I'm going to assign this to a new variable that is going to be DF underscore concat so this one represents to all the data frames we concatenated so all the seasons from season 15 to season 20 so now is in just one data frame and by the way we put it here outside this for loop because if it's inside I'm going to do it here if it's inside this for loop we're not letting this frames list to append the data frames from season 15 to season 20 so what we have to do is to put it outside this for loop that iterates over the seasons and now that we have this df concat that represents all the seasons of a league we can assign this data frame to a dictionary in this case I created this dictionary as I told you before d_istorical data and this is going to be my dictionary that contains all the data frames I want so here I'm going to assign this by writing the name of the dictionary and now I open square brackets and now I write the name of the league so in this case the name of the league is stored in this league variable and then I set this equal to the df concat so what we're doing here is assigning a key that has the name of a leak and this key is going to have a value and this value is going to be the df concat so the leak is going to be the key and the df concat is going to be the value and the df concat contains all the data frames from season 15 to season 20 great this is almost done but now we have to make one little change here we have to put this frames empty list inside the loop so here and now I'm going to explain you why we have to do this so if this empty frames list is created outside the loop as it was before what is going to happen is that here when we append a new data frame we're going to accumulate all the data of the previous league so for example we start with Spanish La Liga right then we iterate for season 15 to season 20 and we get all this data and we concatenated it so that's fine but then when we go to the next key for example German Bundesliga what it's going to happen is that the previous data of Spanish La Liga is going to be here and when we concatenate this with the concat method we're going to get a data frame that contains the data of the Spanish La Liga and also German Bundesliga and we don't want that we want the data frames to have independent leaks so what we have to do is to delete the data of the previous league and we do that by introducing this empty frames list inside the loop so here every time we iterate over this dictionary we're going to create a new frames empty list and now when we go over the second iteration for example here German Bundesliga here when we get to this line of code the frames object is going to be an empty list so this means that all the data that was stored for the Spanish La Liga is going to be lost and that's what we want but all the data is actually stored in this dictionary so we are not actually losing the data we're just removing the data from the frames list but all the data is stored in this dictionary great now everything is ready and now I'm going to run this code so now I press Ctrl enter and now let's wait a couple of seconds or maybe 1 minute okay now the execution is finished and now let's check the data that was stored inside this dictionary I created here so now I'm going to show this and we see the data inside this dictionary however it's not readable so here first I'm going to show the keys of this dictionary so I write keys this is the keys method and as you can see here I got the keys so the keys are represented by the actual names of each competition so I have Spanish La Liga and English Premier League so to get access to one of those data frames we only have to use the syntax to get a value from a dictionary so for example let's say we want to get the English Premier League data frame so we only write the name of the key in this case English Premier League and that's it now I run and we got this data frame that contains the data of season 15 16 17 18 19 and 20 and also we can see that this is English Premier League because here we got the E0 that is the code for this competition and as you can see here this is an easier syntax than writing the index because the index actually doesn't tell you much about the data so in most cases it's more convenient to use a dictionary to store data frames because we can use keys and it's easier to memorize a key name and it's not so easy to memorize an index when we store data frames inside a list and that's it in this video we learn how to manage all the CSV structed with dictionaries okay in this video we're going to learn how to filter a data frame based on one condition in pandas we can filter a data frame based on conditions just like on Microsoft Excel but here we use methods so let's start by importing pandas as usual import pandas as PD and then we're going to use a different data frame and in this case is a laptop price data frame that I have here and you can download this data frame in the notes in the description you can find it there and we're going to read it with the pd.csv underscore CSV so I'm using this method and I'm assigning this to a variable named df_laptops so that's going to be my variable so now I run this code and now let's see the first three rows so here we have the first three rows of this data frame and now let's check the columns that this data frame has so first we have the ID usually IDs are unique so you won't find duplicate values and then we have the companies here is the name of the company that this product belongs to then we have the name of the product so here we have MacBook Pro and other types of laptops then we have the type of the laptop so ultrabook notebook and so on then we have the the screen size so this one is in inches and then we have some other values that we're not going to see so much like the screen resolution CPU RAM memory uh GPU and some others uh but this one price in euros we're going to use it so here are some of the columns we're going to use to filter the data frame in this course are the price in euros here and also the company and the product and some others like the screen size in inches so this was just a little review of this data frame and now let's filter the data frame based on one condition so to see how to filter a data in pandas let's solve this task that we have here that says find which rows have the word apple in the company column so we have to make like a comparison we have to compare two objects so here we have to compare the company column with this string that is apple so in Python as you might know we compare with the equal sign so we write the equal sign twice and this means comparison so this is like we're comparing two values for example if I write here one is equal to two and we run this we get the value of false because these two values are different but now if I write one is equal to one now we get true because these two values are the same and the same way we compare numbers we can compare other type of objects like series so here we can write the name of a series so for example I can write the series of the company column so here I write DF and now I'm going to copy the name of the data frame so I write df laptops here and then to create a series also known as a column we open here the quotes and then we write the name of the column so this is a series the company series or the company column and we have to compare this to the apple string so we write Apple and we're comparing these two values so here we want to know if the values inside the company column are equal to Apple so for example let's go to the data frame and here in the first row it's going to compare for example Apple is equal to apple then true because those are the same so we get true then the second apple equal to apple true and then the third one apple equal to HP and then we get false and that's how we're going to build this uh new series because we're going to get from this a new series of values that are true or false so now let's verify this so I'm going to run this one and as you can see we got the values of true and false so the first three values are true true and false because apple is equal to apple true then apple equal to apple true and then the last one is different to false and that's how all the rows were filled so now that we got this series we can filter the data frame based on these results so to do that we have to first copy this code and now I'm going to paste it here and now to filter the data frame we have to follow the following syntax so first we write the name of the data frame and then we put this code inside square brackets so write square brackets and this is the syntax so inside the square brackets we write the condition and outside we write the name of the data frame this is the syntax so now let's filter this data frame based on this condition so I press Ctrl enter and now let's see if the data frame was filtered so here I got this result and let's verify if all the values inside the company column are Apple so here it says Apple and let's scroll down all of them are Apple and yeah all of them are Apple so we successfully filter this data frame now I want to give you one tip so sometimes when we have larger data frames it's not so easy to verify that all the values have this type of string so for example here we could easily check that all the values were apple because it's a small data frame but for bigger data frames this is not going to be so easy so in those cases we can use the value count method and I'm going to show you how to do it here so we actually uh already learned what the value count method does but in case you don't remember the value count method counts all the values inside a column so here we only have to write value counts open parenthesis and inside we write the name of the column so I write company so write company and then we run this and we got here that there are 21 rows with the word Apple in the company column so we can easily see the categories inside the company column and that's it that's how you filter a data frame based on one condition all right now is your time to filter a data frame based on one condition so here I made two exercises that will help you practice what we learned in this video so the first exercise consists in finding which rows don't have the HP in the company column so this is similar to the example we've seen in the video but in this case it's like not equal to but is the different symbol so once you build the code of this condition you have to filter the data frame as we did in the video and then in the second exercise you have to find the laptops with price over €2,000 and in this case you have to use a different column so here you have to use the column price underscore euros for exercise number two and for exercise number one the company column so now you can pause the video to solve these two exercises on your own and after you finish them you can continue watching this video to solve these exercises together okay now let's start solving these two exercises and let's start with exercise number one so here we have to find which rows don't have the HP word in the company column so this is similar to the example we've seen before before we use the equal sign to compare two objects but in this case we have to use the sign that represents not equal to so here I'm going to show you so this one is equal to the sign equal to and this one is the sign not equal to or different so we have to write this exclamation mark with the equal sign so let's test it out i write one not equal to two and it says true because these two values are different but now one not equal to one and now it's false because these two values are equal and now we're going to compare the series with the string so here I write the name of the series which is DF_laptops and then open square brackets then we write the name of the column in this case company and then here on the right we have to write the name of the brand in this case HP so here we're comparing these two values and I almost forget I need to write the exclamation mark so now it says this series is not equal to HP so we're comparing that now I press Ctrl enter to compare these two values and now we got the uh the result that is a series so now let's verify very quickly the original data frame i'm going to show only the first three rows and now let's compare quickly so HP is not equal to Apple then true so we got true then again HP not equal to Apple true and then HP not equal to HP false so everything is correct and we got here the result and we have to introduce this inside square brackets to filter the data frame as we've seen before so here we write the name of the data frame square brackets and inside square brackets this condition now we run to filter that data frame and we get this new data frame and here in the company column we get all the companies except for the HP company so here we can see that there is Apple Acer Lenovo and other companies but not HP so now to verify which companies are in this column we can use again the value count method so let me write it here parenthesis and inside the name of the column we want to count so here I press Ctrl enter and now we can see how many rows have the value of Lenovo and Dell and so on so we can see that there are a lot of Lenovo laptops but none of them are from HP because we used this condition so we filter HP out okay now let's solve the second exercise and here we have to find laptops with price over €2,000 so in this case we have to compare with this this sign that is greater than and in this case it works like this so we write one greater than two for example here we got false because one is not greater than two but if we write three greater than two it's true and that's how it works so here on the left we write the series name in this case I'm just going to copy here uh this syntax and now I'm going to paste it here so here this is the series also known as column and here instead of writing company I'm going to change it for the price column so I'm going to copy this column which is price euros so I copy this and now I paste it here so here I have my column price euros now on the right we have to write the €2,000 here so here I'm going to write the number 2000 and that's how we're comparing we're comparing the price underscore euros column with this 2000 value so we're going to see which values inside this series are greater than 2,000 so let's press ctrl enter and let's find out so here we can see that uh the first three values are false so they might be uh less than 2,00 and the fourth is true so this one is greater than 2,000 so now let's filter the data frame with this syntax and let's run this and let's verify very quickly the values in the price in euros and here as you can see all the values in this column are greater than 2,000 and that's it i hope you successfully solve these exercises all right in this video we're going to learn how to create a column based on one condition using the were method so here first we're going to start by importing pandas as usual and reading the same CSV file we worked with in the past video so here we have the same CSV file now to create a column based on one condition we have to import numpy because this word method belongs to numpy so here we import numpy as np and we run this one so here to show you how the word method works we're going to create price tiers based on the price euros column so here we're gonna use this column and we're going to compare these prices with another price to determine if this is either cheap or expensive so let's use the number 2000 and let's say that if a laptop costs more than 2,000 this is considered expensive and if it costs less than 2,000 is considered cheap so here we can create this with a condition so as you might remember from that previous video we make conditions by comparing two objects so in this case I'm going to compare this uh price euros series so I write df_laptops and then I write the name of this column and I put it inside quotes so now we have this series i'm going to run it and now we have to compare this with the value 2000 as I told you before so if this is greater than 2,00 then this is going to be true and now we're going to compare these values one by one with the value of 2,00 so for example the first one is less than 2,000 so the first one is supposed to be false so here I run this and we got the first value of false so now this is what we learned in the previous video but now we're going to use the np where to determine which value is going to replace the false and which value is going to replace the true so I'm going to show you here i write np then open parenthesis and inside the first argument is going to be this code so the condition this is our first argument then the second argument is going to be the value we want to set when this condition is true so when this is true I want to set the value of expansive so it's going to be expensive when this is greater than 2,000 and the third argument is the value we want to set when this is false so when the value is less than 2,000 so in this case I want to set it to chip when is less than 2,000 so the second argument is the value when the condition is true and the third argument is the value when the condition is false now I'm going to run this one and as you can see here we don't have any more the true or false values but now we have the values of cheap or expensive well here we only have cheap but this is only the summary if we check this array in detail we will find many expensive values and now we can set this array to a new column as we learned in this course so first to create a new column we write the name of the data frame in this case df_laptops and now I uh write this square brackets and now I write the name of the column I want to create in this case I'm going to name it price tier and now I'm going to set this equal to this array so I copy this one and paste it here so now my new column will have the value of this array which is here so now I'm going to run this uh cell and now I'm going to show the first five rows of the df_laptops data frame so here I use the head method and now we can see the first five rows now if we go to the last column here we have a column named price tier and this one has the values of cheap or expensive so let's verify fast the condition so here the first one is less than 2,000 so we got the value of cheap because this one is false so it's correct and this one is greater than 2,000 so we got the value of expensive and this column was successfully created now we can count the values inside this new column by using the value counts method so to do that we only write the name of the data frame then we write value count then open parenthesis and now the column we want to count so here the column is price tier so now I run this one and we can see that almost 1,100 laptops are considered cheap and only 137 laptops are expensive and that's it that's how you create a column based on one condition with the were method okay now it's your time to create a column based on one condition with a word method by solving this exercise in this exercise you have to create an array based on the screen size so in this one you have to use this inches column that is right here and you have to compare the values so let's say that if the size of the screen is greater than 15 in we can say that the screen is big but if the screen size is less than 15 in the screen is small so here I'm going to say if the screen size is greater than 15 that's going to be the condition so here is the condition and you have to create an array based on this condition so you have to do this as your first task then you have to set this array as a new column as we did in this video and then you have to show the first five rows and count the values inside this new column that I'm going to name it screen size but you can name it as you want and that's it you can pause this video and try to solve this exercise on your own and after you finish this exercise you can continue watching this video to see my solution okay to solve this exercise I'm going to create first a condition that says that the screen size is greater than 15 so first I write the name of the data frame df_laptops then I open square brackets and I write here the name of the column inches so here I write inches then I compare this with 15 and this is going to be my condition so here we got the values of false and true and now if we use the where method we will be able to replace these values of false and true for the categories we want so here I'm going to write as first argument this uh condition and then the value when this condition is true so if the screen size is bigger than 15 the value that is going to be assigned is big and if that's false so the screen size is less than 15 the value is going to be small so this is my uh array that I'm going to create here and I have here an array that contains the values of small and big now I'm going to set this to a new column so I write here the name of the column that I want to create so it's going to be name screen size and then I'm going to set this to this array so I copy this one and paste it here so this array is going to be set to my new column screen size now I'm going to show this uh this data frame the first five rows of the data frame but first I run this code and here I got an error because I didn't write the S here so I wrote the S and now I run this one and everything is okay now so now I'm going to show the first five rows so I write this and use the head method and as you can see here I have here the first five rows and if I go to the last column here we have the screen size column so let's quickly verify if it was successfully created so here in the size the first value is less than 15 so this one is supposed to be false so we should get the value small and it's correct and now the third one is greater than 15 so this one is true and we should get the value of big and yeah we got big so this column screen size was successfully created and finally let's count the values inside this screen size column so here I'm going to use the value counts method so I write value counts open parenthesis and now I write the name of the column so I'm going to copy here and I paste it here now I run this one and as we can see here there are 835 laptops that are considered big and 468 are considered small because their screen are less than 15 in and that's it i hope you successfully solve this exercise all right in this video we're going to learn how to filter a data frame based on two or more conditions so to start with this let's import pandas and read that CSV file that we used in the previous videos so here I have the file and now it's stored in the df_laptops data frame so here's the data frame and now before filtering a data frame based on two or more conditions let's make a recap of how to filter a data frame based on a single condition so here in the previous video we learned how to find for example Apple laptops so we only compare the company series with the Apple string so here we use the double equal sign to compare these two so we run this and we got this true or false values and the same goes for laptops that cost more than €2,000 so we compare this one with the 2,000 number and in this case we use this sign and we compare this series with this number so I run this one and here I got the results true or false in this array so now what if we want to get Apple laptops that cost more than 2,000 so condition one and condition two together so in this case we have to create a multiple condition and we have to use either the and or the or operators so here this is the and this is the or operators so here let's do this let's find Apple laptops that cost more than €2,000 so to do this first I'm going to copy the the single condition that finds Apple laptops so I copy this one now I paste it here and now let's write the second condition that finds laptops that cost more than €2,000 so I copy this and now I paste it here so now we have condition number one and condition number two so now to filter a data frame based on multiple conditions first we have to separate this condition one and condition two with the parenthesis so here I write parenthesis and inside parenthesis is my first condition and I do the same for the second condition so I write parenthesis and inside this is my second condition now we have to use either the and or the or operators and here we have to find Apple laptops that cost more than €2,000 so this has to be Apple laptops and the cost has to be more than €2,000 so we have to use the and operator so I write the logical operator and this is the and now I execute this so here before I execute I'm saying first condition and second condition so here we have to write the parenthesis as we did and then the logical operator in the middle as we did so now I run this one and here we can see the result so here is an array with true or false values and now to filter the data frame we have to copy this condition this multiple conditions and we paste it here so we're going to filter the data frame based on this multiple condition so first we write the name of the data frame df_laptops followed by square brackets and inside square brackets we write the multiple condition so I paste it here and now we can run this to see the results and here I got an error because I didn't write laptops correctly i need to add that P and now I run again and now everything is fine so here I have the data frame that is the result of this code and let's verify if it's correct so here in the company column we're supposed to get Apple so here let's verify company it says Apple Apple and only Apple so here this first condition was successfully evaluated so now the second condition says that the price should be greater than 2,000 so let's check this price column and now we can see that all the values are greater than €2,000 so we successfully build this multiple condition and filtered this data frame so now before we continue with the or operator you should know that in Python there is this and and or keywords so here those are logical operators but in pandas we have a special operators so instead of writing and in pandas we use this sign that we used in this example and instead of writing or in pandas we use this sign here so it's just the syntaxes we have to follow in pandas so try to memorize them and now I scroll down and here is the or operator so now let's continue and let's find laptops that are Apple or Dell so in this case we have Apple or Dell so now as you might expect we have to use the or operator so here I have the individual conditions so first the first condition is this company array equal to Apple and the second one is this company rate equal to Dell so now I'm going to create the multiple condition based on this uh simple condition so here I'm going to write the first one then write the parenthesis now the second one so now parenthesis so here I have condition number one and condition number two grouped and now I have to add the operator so here is the or operator because we want either Apple laptops or Dell laptops so here I run and we build this multiple condition so now let's filter the data frame based on this multiple condition so let's copy this code and now I'm going to paste it here and we have to follow the same syntax to filter a data frame so we write laptops then open square brackets and inside we put the uh multiple condition so now I run this one and here I got the result so now let's verify quickly that company column should have either Apple or Dell so let's check here is the company column and here we see Apple Apple and here Dell well the data frame is a bit longer so it says Apple and then we cannot see what's in the middle but then it says Dell and then Dell Dell and yeah only Dell so apparently this column only has Apple and Dell but let's verify this by using the value counts method so I write value counts now I write the name of the column and now I run this one and as we can see there are 297 Dell laptops and 21 Apple laptops so yeah we successfully filter this uh data frame based on two or more conditions okay finally let's challenge ourselves with this last example so in this case we have to find laptops from Apple or Dell that cost over €2,000 so I have here each condition separated and the first one it says equal to Apple then equal to Dell and then the price greater than €2,000 so here we have all the simple conditions so now we have to create a multiple condition based on this one so based on these three simple conditions so first we're going to write the code that corresponds to Apple or Dell so here first we select the first condition then the second one and we write the parenthesis to group the conditions so here parenthesis and now here between these two conditions I write the or operator so here it's or because it's either Apple or Dell and then it says that it should cost over €2,000 so here I copy the third condition and paste it here and now I add parenthesis and now since it should cost over €2,000 the operator we need is and because the laptops have to cost over €2,000 so now I wrote here this and operator and now this is almost finished but if we leave it as it is we could get some unexpected behavior because Python could group these conditions in an undesired way so here we have to give clear instructions so to do that we have to add parenthesis so here we want this multiple condition I'm selecting right now to be evaluated first so if we want it to be evaluated first we have to add a parenthesis to group this first multiple condition so with this parenthesis that I added here we're telling Python to evaluate this first multiple condition and in this way we're giving Python clear instructions of what we want to accomplish so now a little recap this first multiple condition gives laptops from Apple or Dell and now we're grouping here with this parenthesis so this one is going to be evaluated first and then this third condition gives laptops that cost over €2,000 so after evaluating this first multiple condition we're going to get laptops that cost over €2,000 so now hopefully that's clear to you now I'm going to copy this and I'm going to filter the data frame based on that condition so here I write the name of the data frame then I open square brackets and inside I paste the code we wrote before so here is the code and now I'm going to run this one so here I press Ctrl enter and now we can see the data frame so now let's quickly verify the results so here the company column should have only Apple or Dell so here in company says Apple and Apple Apple and then Dell and then Apple and only Apple and Dell so that's correct and now then we have to verify if the price column is over €2,000 and here in the price underscore euros column we can see that all the prices are over €2,000 so here the first one is over 2,000 and the second one and the third one and all of them so we successfully build this multiple condition and we filter this data frame based on two or more conditions and that's it that's how you filter a data frame based on two or more conditions in pandas welcome back in this video we're going to create a conditional column from more than two choices using the select method so as usual we're going to import pandas and read the CSV file laptop price that we've been using in this course and now I'm going to create a conditional column for more than two choices so this is similar to a previous lecture where we created a conditional column from only two choices using the word method but in that lecture we had only two choices that were cheap and expensive but in this case we're going to have more choices like affordable expensive and too expensive so when you have more than two choices you have to use the select method and I'm going to teach you how to do it in this video so first we import numpy as np so I run this one and now to create an array based on multiple choices we have to create first two variables first the condition variable which is this one and this is a list so here I open with square brackets and inside we're going to create the condition and the second one is the values so here I write values and this one is also list so let's start with the values here we introduce all the values that we're going to assign based on the condition that we're going to create so in this case we have four values as I mentioned before so in this case the first one is too expensive the second one is expensive uh the third one is going to be affordable and the last one cheap so these are my four categories and now I'm going to create a condition for each of them so here I write the first condition and I have to write the name of the data frame df_laptops now square brackets open quotes and here we'll use the price column which is this one the last one price euros so we use price because we're going to decide whether a laptop is too expensive expensive affordable or cheap based on the price so let's say that if the laptop costs more than €3,000 this is going to be too expensive so the first one will correspond to the first element in the values list so we continue with the second condition and now I copy this so it's much easier and here we're going to create the condition for the expensive value and let's say that if a laptop costs more than 2,000 and is less than 3,000 uh less or equal to 3,000 then is expensive so here I have two conditions this one is the first and this one is the second so as you might remember we have to separate both of them with parenthesis so here I write parenthesis and now I write here parenthesis and to create this range because the expensive value is a range between €2,000 and €3,000 we have to introduce here a operator in this case we need the and operator because we want it to be greater than 2,00 and less than 3,000 so at the same time so now I write and so now we have this range so this one corresponds to expansive now let's create the third one so I add comma then press enter and now I'm going to copy this one so I can create the third range much faster so here now let's say a laptop is affordable if it costs more than €800 but less or equal to €2,000 so that's the range for the affordable value and the last one is going to be cheap and this is uh when is the price equal or less than €800 so that's it we have the conditions and we also have the values so now I'm going to delete this and I'm going to run this code and here I got an error because I forgot to include here the comma so I write comma and now it's ready so now I press Ctrl enter and now it's done so we have the conditions variable and the values variable so now let's create the new column so first to create a conditional column with the select method we have to write np select then open parenthesis and here we have to introduce two arguments the first one is the conditions in this case my conditions variable and the second one the values you want to assign so here I have the values in my values variable so here I write values so these are the two arguments we need to insert inside the select method so now if we run this we see that we get an array so this array should be our new column and here in this array we see the values of affordable affordable cheap and all the values that correspond to the price that are in this price euros column so now we're going to create a new column so I write equal to and here I write this name of the new column so here I write the name of the data frame first so df_laptops open square brackets and here inside quotes I'm going to write the name of my new column so it's going to be price tiers and here it's ready so now I run this so I have my new column created so now I'm going to show it here so you can see it much better i go to the last column and it's right here so it's my price tiers column and as you can see here we see the values affordable cheap expensive and so on so let's check if the values were assigned correctly so here let's check the first one and its price is 1,300 so this one is between €800 and €2,000 so the third range that we created here and the third corresponds to the affordable so let's verify here and yeah it says affordable let's do it one more time so in this case with this one it says 2500 so let's check here 2500 belongs to the second range because it's greater than 2,00 and less than 3,000 so the second range corresponds to this second value expensive so here it says expensive okay so we successfully created a conditional column from more than two choices using the select method so now let's count the elements inside this price tiers column so I'm going to do it right here and I write the name of the data frame so DF laptops open square brackets and here I paste the name of the column and to count this series or this column we have to use the value counts method so I write value counts parenthesis i run this one and by the way you can uh either put this inside parenthesis or put it here as I did before it has the same effect and now we can see that the majority of laptops are considered affordable and that's it now let's do an exercise so you can understand much better what we learned so far okay this exercise is similar to the one you solve before in the wear method but in this case we're going to use the select method to create four categories so in this case we're going to create four categories based on the screen size so we're going to use the inches column that is right here and we're going to get four categories and here we have the four ranges that we have to follow to solve this exercise and after this you have to create a condition and values variables so once you have all of these you set it to a new column as we did in this video then you show the data frame and you count the values inside this new column okay now you can pause the video to try to solve this exercise yourself and then you can continue watching to see my solution all right let's start by creating the conditions variable so I write conditions equal to square brackets here and now here is going to be the conditions that we're going to create now also the values list and here let's write the elements so the first one is too big and here I write too big the second one is only big the third one is small so I copy and paste it then we have too small so I paste this one and write too small so here we have the four values and now let's write the conditions so here I write first df_laptops which is the name of the data frame and inside we write the name of the column we want to evaluate in this case is here and is inches so now I scroll down and paste it here so I paste it and now I'm going to compare this with 16 so if this is greater than 16 it's going to be considered too big now I'm going to write the second one so here I copy and now sorry I paste it here now if this one is greater than 14 but less than 16 then it's going to be big so I write this one here so less or equal to 16 and I write it here now we write parenthesis to separate both conditions so here now the and operator and now we have it here it's ready so now I copy and paste this one and now I delete this blank space and now I press this tab so now this one is greater than 12 and less or equal to 14 and the last one should be less than 12 so I write this now 12 less or equal to 12 so here I delete this one and now it's ready so I have the conditions and I have the values here so now I run this and it's ready so now we create this new column so we write np select and the first argument is the conditions variable here and the second one is the values and now I run this one and we have this array so now let's create our new column which is going to be named screen size so df_laptops now open square brackets and I write screen size so I assign this to this new column and now I show the data frame so here I paste it and now I run this one and let's check the last column and it's here so I have my new column screen size and now let's count the number of elements inside this new column so I I'm going to use here the value counts method so I write df_l laptops now square brackets and I write the name of this new column which is screen size and then I write the name of the method value counts open parenthesis controll enter and now we have that the majority of laptops are considered big and that's it I hope you successfully solve this exercise all right in this video we're going to learn how the is in method works in pandas we're going to start by importing pandas and reading the CSV file we've been working with in this course so we have here the laptop price CSV file and now it's here in the DF_laptops data frame and here our first task is to select Apple or HP laptops before we did this task with another methods but now we're going to do it with the is in method so first we start writing the name of the data frame so DF_laptops and then we write the name of the column we want to apply this filter in this case is the company column because this one has values like Apple HP and so on then we have to use the is in method and we only have to write dot is in then parenthesis and here we have to introduce data in form of list so here if we want to select apple and hp we have to open the square brackets to start a list so here I open square brackets and then the first element should be apple so I write Apple and the second one HP so now what we're saying here is select the Apple and HP element from the company column or company series so we're going to filter out all the elements that are not Apple or HP and now let's run this code so I press Ctrl enter and as you can see here I got a series with true and false values and here it made a comparison between this Apple and HP with this company column so here for example the first three rows we have here the data Apple Apple and HP and here we're saying look for Apple and HP in the company column so the first three rows will be true so here we have true true and true and basically that's how the is in method works and now let's filter the data frame based on this condition that we created so here we have to write first the name of the data frame so I write df_laptops and now I open square brackets and as usual here inside square brackets we have to introduce the filter or condition so here I copy and now I paste the code we wrote before so here I have the filter or condition so now we're going to filter this data frame so now I press Ctrl enter and as you can see here the company column only has Apple or HP data so here it says Apple Apple HP and so on so now let's verify if this column only has apple and HP so we're going to use the value counts method i write value counts then inside parenthesis I write the name of the column so I write company now I press Ctrl enter and here we got that there are only HP and Apple in the company column so great we successfully filter our first column now let's continue with multiple filtering and now our task is to find notebooks or ultrabooks from Apple or HP here we have two conditions the first one is notebooks or ultrabooks and the second one is from Apple or HP so here we have to use the operators as we did in previous videos to mix the condition one and condition two now we're going to write the code for each condition so let's start with condition one so I write filter one and equal to and now I write the code so here it says find notebooks or ultrabooks this type of data is inside the type name column so here we have the time name column and it says ultrabook or notebook so now we're going to filter that column so we write the name of the data frame DF_laptops now I write the name of that column let me copy it here that's why I copy this one and I paste it here and now we use the is in method so we write is in parentheses square brackets and inside we write the uh the elements we want to get so here notebooks actually I'm going to copy it from this one because it's written in a different way with capital letters here so notebook and the second one is ultrabook so here notebook and ultrabook and here we're selecting notebooks and ultrabooks from the type name column and that's the first filter now the second filter is from Apple or HP and actually we already built this filter or condition so this one is here here it says uh company and is in Apple or HP so here we copy this one and now I'm going to create a new filter this is going to be filter 2 equal to and here the code that we created before so now we have filter one and filter two and what we have to do now is to link filter one with filter two with an operator a logical operator so we have to choose between the or operator and the and operator so here in this example we should use the and operator because here we have to find notebooks or ultrabooks that are from Apple or HP so here we have to satisfy both conditions so we write here filter one and then filter two and we write the and operator in the middle so we have filter one and filter two and here I forgot to run this code first so I run this code and now I'm going to delete this one and now I run and we got this result here this series so now to filter the data frame first we write the name of the data frame and we open square brackets and now we run this and we get here a data frame that is supposed to have the values of Apple and HP in the company column and also the values of notebook and ultrabook in the type name column and you can verify this by using the value counts method as we did before and that's it in this video we learned how the is in method works in pandas all right in this video we're going to learn how to find duplicate rows with the duplicated method so here we start by importing pandas and reading the CSV file laptop price so here we have the data frame and the column that we're going to use for our first example is this laptop ID so as you might know ids are unique so we shouldn't have any duplicate ID in this data frame so here let's see if this column has duplicated values so we're going to use the duplicated method and first we're going to find duplicates in one column or series and the series is going to be this laptop ID so here first I copy the name of the data frame which is df_laptops and then we write the method that is duplicated so I write duplicated open parenthesis and now we only have to write the column so here I copy and paste the name of the column and that's it we only need this so now we run this code and we got here a series with true or false values so here we can filter the data frame based on this condition so here let's do this i write here first the name of the data frame which is df_laptops then open square brackets and as we did in previous videos we have to introduce this condition or this series inside this square brackets so here we paste it and now we have this filter so with this code we're going to see if there is a duplicate element or duplicate value inside the laptop id column so now let's find out i run this one and now we got this data frame and we only see the names of the columns and there is no data so this means that this laptop id column doesn't have any duplicated elements because if there was any duplicated element we will see it here in this data frame but there isn't any duplicated element great now let's find duplicates in two or more columns so here we only found duplicates in a single column but now let's find duplicates in two or more columns so now to do this we're going to find duplicates inside the columns product type name and inches so to do that we write the name of the data frame so df_laptops then we write duplicated then we open parenthesis and inside we write the columns that we want to analyze so here is these three columns product type name and inches so I'm going to copy this and now I paste it here and now I'm going to write the quote so here comma quotes and now comma here quotes sorry now comma and that's it so now I have these three columns and here what we're doing is to find rows that have duplicated elements inside these three columns at the same time so now let's run to explain you this much better so I run here and we got this series with true and false values so now let's make the filter so we write the name of the data frame so here is DF_laptops now we open square brackets and we introduce this uh series or condition or filter call it as you want so now I paste it here now we press Ctrl enter to run this one and we got the data frame with the filter and as you can see here we got some duplicated elements so here for example we got this uh first row that has this uh in product has MacBook Pro then ultrabook in type name and 13.3 in inches and this one is exactly the same to this one it has the same values so now to see this much much better I'm going to store this data frame but first I'm going to put this inside a variable so you can see it much better so here I'm going to create a variable name duplicated equal to this i run this now I copy the new variable and now I'm going to sort this data frame so you can see much better the duplicated elements so now I'm using the sort underscore values method that we learned in this course and now I open square brackets and here I'm going to specify the columns that I want to sort by so first I want to sort by the product column and then by the type name column so here I write type name and product now I run this one and here we got this data frame with duplicated values but now sorted by product and type name so now we can clearly see the duplicated values for example this one that starts with 15- AC the notebook 15.6 and is exactly the same as the second one here so both have the same values and the same goes for this uh this one here with this one the same values and also this one has the same values and that's how you find duplicates in two or more columns here one little detail I almost forget is this uh keep argument that is set to first by default so here this uh duplicated method has this keep argument so by default the duplicated method will only keep the first duplicated element that is listed in a column and we're going to see this keep argument in more detail in the following example okay in this second example our task is to find the cheapest and most expensive laptop of each company using the sort values method and duplicated method and here I'm going to explain you much much better how the keep argument works so we're going to start by sorting the data frame so here we write the name of the data frame df_laptops then we use the sort values method so we write sort values parentheses square brackets and then the columns we want to sort by so company the first one and the second one is in this case the price so the name of the column is price euros here I'm sorting by the price of the laptop because we want to find the cheapest and most expensive laptop and this is going to make more sense when we add the keep argument in the duplicated method later so now I'm going to override the data frame df_laptops here so I write df_laptops equal to this data frame sorted by company and price so now I'm going to show you the new data frame so here is the data frame and now it's sorted by company and also by the price so the cheapest laptop is first and the most expensive laptop is at the end keep that in mind because I'm going to use that concept to explain you much better how the keep argument works in the duplicate method later so now let's continue here and now let's check all the values in the company column so here I'm going to use the value counts method so I'm going to use I'm going to write uh df_laptops and now I write value counts now parenthesis and here I write the name of the column I want to evaluate so company so now I run this and here we can see the elements inside this column and we can see that all the companies have at least two duplicated values for example here Dell has 297 and if we scroll down we can see that even Huawei has two duplicated values so we can say that all the values inside the company column are duplicated okay now let's continue this and now let's find the duplicated values inside this company column so now we write the name of the data frame df_laptops then duplicate it now I open parenthesis and now I indicate the column I want to evaluate so in this case company so now I run this and we get this uh series and as you might remember this duplicated method has by default the keep argument which is set to first this means that only the first duplicated value is going to remained while the orders will be removed so here just to remind you that I'm going to add the keep argument and I'm going to set it to first i don't need to do that because the keep argument has this first value by default but I'm just writing it so you don't forget about it so now I'm going to set this to a new variable which I'm going to name duplicated underscore first so this is my new variable now I run this and now I have all of this filter inside that duplicated first variable so now what I'm going to do is to filter the data frame based on this variable so I write the name of the data frame df_laptops now open square brackets and now I paste the value of this variable now I'm going to run this one to see the data frame filter so now I got here the result and here I get the data frame with duplicated values in the company column so now I want to show the data frame with non duplicate values in the company column and to do that we have to use the not operator so this one and what this does is to get the opposite result of this filter so here this variable give us that duplicated elements but if we use that note operator we're going to get that nonduplicated elements in this company column so now let's check this out so I'm going to run this one and as you can see here we got in the company column uh different values so here first says Acer then Apple then Asus and as you can see none of these values are repeated so none of them are duplicated all of them are different so now to check this much better I'm going to select two columns the first one is going to be company and the second one is going to be the price so here I choose the price because our task is to find the cheapest and most expensive laptop so here I select these two columns and now we have this uh more compact data frame and here as you can see here all the uh the elements inside the company column are different and also these are the lowest laptop prices per company so all of these laptops are the cheapest in each company and we didn't get this by accident we got this because we used the source values method and also we set this first value to the keep argument inside the duplicated method here I'm going to scroll up a little bit to explain you much better what we did here when we sort it by the prices and by the companies what we did here is to show the cheapest laptops first so the cheapest were first and the most expensive were in the last position so here then when we set here keep equal to first it means that only the first duplicate value is going to remain in the data frame that is only the laptop with the cheapest price will remain in the data frame and that's why all of these values inside this data frame represent laptops with the cheapest price per company so for example here the cheapest Apple laptop costs €898 and that's it now we can even check if this uh data frame actually have duplicated values or not so here we only have to use the value counts method this is optional you don't have to do this but I'm I'm going to show you that this one doesn't have any duplicated values so I run this one and now as you can see all of the elements inside the company column only have one value so here Acer has one Lenovo has one and so on so there isn't any duplicated values here okay that's how the keep argument works when it's set to first now let's check how it works when keep is set to last so here I'm going to copy the same code we use here because it's actually the same we only have to modify the argument keep so here is the same code now I'm going to change it here instead of first I'm going to write last now I'm going to set this equal to a variable that I'm going to name duplicated underscore last so this is my new variable now I run this and now let's show a data frame with non-duplicated values in the company column and in this case since we're using keep equal to last we're supposed to get the most expensive laptop per company because first we sorted the data frame ascending so first were the laptops with the cheapest value or cheapest price and at the end we got the laptops with the most expensive price and if we say here keep equal to last this means that we're going to get only the last duplicated value so now let's check if this is true now I'm going to filter this by writing DF_laptops so here so now let's verify if this is true so I write here DF_laptops now I open brackets and here I write this duplicated underscore last to make the filter and now I have to write this not operator so here I'm saying non duplicate values so here I'm going to get non duplicate values so I press ctrl enter and now we got this data frame and now let's select two columns to make this uh data frame smaller so first the company column and also the price column so it's price euros so I write it here now I run this and now we got this data frame and all the values inside the company column are unique so there isn't any duplicated values and also here we got the most expensive laptop per company so for example the most expensive Apple laptop costs €2,858 and we could get this data frame by setting the keep argument equal to last and also by using the sort values method all right now let's see how the keep argument works when it's set to false so now I'm going to copy this code and I'm going to set it here so instead of last here I write false and I'm going to change the name of the variable so in this case it's going to be duplicated false now I run this code and I scroll down here and I'm going to make the filter here so I write df_laptops then open square brackets and inside I write the name of the variable that we created here so it's duplicated false so now I write this and now to get that nonduplicated values we use that not operator so we write here the not operator which is this one and now we press Ctrl enter and as you can see here this data frame is empty and this happens because we set the keep argument equal to false and this means that we're not going to keep neither the first duplicated value nor the last one and as you might remember the company column only has duplicated values so here I'm going to uh write the code so you can remember in case you forget about this so I write df_laptops then value counts and now I write the name of the column company and press control enter so as you can see here all the values inside the company column are repeated at least twice so for example Huawei is repeated at least twice so all of them at least have one duplicated value this is why we got this empty data frame because all the values inside the company column are duplicated and that's it in this video we learn how to find duplicate rows with a duplicated method okay in this video we're going to learn how to drop duplicate elements with the drop duplicates method so here I have already pandas imported and also the CSV file that we've been working with so far and here I have the data frame now let's see how the drop duplicates method works okay the drop duplicates method help us remove duplicated elements in one or more columns so this is similar to the not operator we seen before when we use the not operator to get that nonduplicated values and this can also be accomplished with the drop duplicates method so now let's have a look and first to use this method we have to write the name of the data frame as usual and here I write df_laptops and now that drop duplicates now I write parentheses and now I open square brackets and inside I write the name of the column I want to evaluate so here the name is going to be company so here we're going to remove duplicated elements inside the company column so I run this and now we shouldn't get any duplicated element in the company column so to verify this I'm going to select only the company and price euros column so here I run and now we can easily see all the elements so we can see uh that in the company column we don't get any repeated value so Apple HP Acer and so on well none of them are repeated and actually we can verify this with a value count method so here value counts and here inside parenthesis we write company so we run this and all of them only have one element so no duplicated elements so now I'm going to uh get back the data frame here and now our task is going to be to find the cheapest and most expensive laptops so here in the drop duplicates method we also have the keep argument so let's see how it works so first we have to sort the data frame ascending by company and price and as you might remember this help us put the cheapest laptops first and the most expensive laptops at the end so here we write df_laptops sort values and inside parenthesis we write the name of the columns we want to sort this data frame by so we write company inside quotes and then we write the name of the uh the other column which is price euros and it's here so we run this and now this data frame is sorted so let's override the values inside this data frame so here I run and now the DF laptops data frame is sorted so now we can get the cheapest laptops by setting keep equal to first in the drop duplicates method so here we write DF_laptops then drop duplicates and now we write the name of the column which is company and we add the keep argument so here keep equal to first actually we don't need to add this keep equal to first because this is set by default but I add it here so you don't forget about it so now let's run this one so let's select only the columns we're interested in so here I select the company and price euros so I write square brackets and then paste this so now I run and here we can see all the laptops with the cheapest values per company and now if we want to find the most expensive laptops per company we only have to set this keep argument equal to last so here I copy and now I'm going to paste it here so I paste it and now I write keep equal to last so now I run this one and we see here all the most expensive laptops per company and this is the same result that we got when we used the not operator with a duplicated values in the previous video all right now I want to show you two more arguments and the first one is the in place argument i think you're quite familiar with this one so this one saves all the changes we made to the original data frame so when this is equal to true we're saying that all the changes that we're making to this df_laptop dataf frame are going to be saved however when this is set to false which is the value by default we're saying just to return a copy so now we want to save the changes we made so I'm going to set this equal to true and the other argument I want to show you is that ignore index and this one by default is equal to false but we're going to set it equal to true and this is going to ignore all the original indexes that the data frame had so for example here the original index of this row is 1,189 but if we set ignore index equal to true this is going to be zero and the second one is going to be one and then two three and so on so now let's run this to try these two new arguments but first I'm going to delete this selection I'm making because we're not getting a copy because we set this in place equal to true but we're only saving the changes we're making so we're updating this data frame so here I delete this and now I only have this so I run this and now this data frame should be updated so now I I'm going to show it here so I write df_l laptops i show it here and now we have this data frame and as you can see here it starts with 0 1 2 and so on and now I'm going to select only the columns we're interested in so I write company and comma and here price euros so now let's clearly see what's going on here here we got the last duplicated elements and also we ignore the original indexes and that's it now it's time to solve an exercise to put all these concepts into practice all right in this exercise our task is to find the biggest and smallest screen size in laptops for each company using the sort values and duplicated methods and here we have to use also the keep argument so we can get the smallest and biggest screen size okay now you can pause the video and try solving these exercises on your own and then you can continue watching the video to see my solution okay to solve this exercise let's start by sorting the data frame so I write df_laptops now sort values and here I'm going to write the name of the columns I want to sort by so first is the company column and then is the inches column because we want to analyze the screen size so here I sort by these two columns and now I'm going to overwrite the values inside this original data frame so now I run this and now in our data frame the small laptops should be placed first and the biggest laptops should be placed at the end so now let's choose the smallest laptops per company setting the keep argument equal to force so here I write the name of the data frame then drop underscore duplicates so now I open parenthesis and here I write the name of the column I want to analyze so in this case is company now I add the name of the argument keep equal to first and here I run this code and in this data frame we got we shouldn't have any duplicated element in the company column and to clearly see the laptops with smallest screen sizes per company we only select these two columns so here I'm going to copy it and now I open square brackets and paste this so now I run this but before we see the results let's first read the CSV file so we make sure that we have the original data frame here so here I have the code that reads the laptop price CSV so here I'm going to read it again and we do this because we made some changes in the previous examples so we want to start from scratch and here I read this CSV file so this is the original data frame and now let's run this again and also this one so now let's see the smallest laptops and here we got the smallest laptops and in case we want to get the laptops with the biggest screens we have to copy this and only modify the keep argument so here I I'm going to paste it below and now here I modify this and now it's last now I run this and here we got the laptops with the biggest screen so for example the Apple laptop with the biggest screen is 15 in and here above we got that the Apple laptop with the smallest screen is 11 in and that's it i hope you successfully solve this exercise all right in this video we're going to learn how to get and count unique values using the unique and end unique methods so first we're going to start by importing pandas as PD and reading the laptop price CSV file and now we have this data frame and now we're going to count some unique values of some columns in this data frame so here I'm going to explain you first what the unique method does so this one returns unique values of series object and the uniques are returned in ordering of appearance so here let's get unique elements in the company column so to do that first we have to write the name of the data frame in this case DF_laptops then write square brackets then the name of the column in this case company and this is the syntax of the series and now we have to use the unique method so we write dot unique and that's it is as simple as that so now I run this one and here we get an array and this array contains unique elements inside the company column and now let's see another example but now with the inches column so as you might remember this inches column represent the screen size so now let's do this so we write again the name of the data frame followed by square brackets now the column is inches and we write unique we open parenthesis and now we run this and here we get an array and the array has all the screen sizes so 13.3 15.6 and so on now we can get how many unique elements are in a column we only have to use the len function so here I write len now I open parenthesis and inside I write this so this is the code we created before so now if we run this we can get the length of unique elements inside the inches column so here we have 18 unique values inside the inches column and now if we write company instead of inches we're going to get how many unique elements are in the company column and here we have 19 so there are 19 unique values inside the company column and that's it for the unique method now let's check how the end unique method works and this one returns number of unique elements in an object excluding null values by default so this is basically the same as the len function but here we have a method in pandas so let's use this end unique method so we write first the name of the data frame followed by a square brackets and now the name of the column so company first we're going to know how many unique elements are in the company column so first we have here the company series and now we write and unique now we add parenthesis and we run this and we got 19 so there are 19 unique elements in the company column and this result is the same as the one we got here with the len function now let's get the number of unique elements in the inches column so we copy this code and I paste it here and now I just have to write here inches so now I run this and here we got 18 unique elements in the inches column and it's the same we got before using the len function and that's it in this video we'll learn how to get and count unique values using the unique and end unique methods welcome back in this video we're going to learn the difference of selecting data with the methods lock and eyelock in pandas we can select data by using the methods lock and eyelock both have some similarities but also have some differences and in both methods to get the data we want we only need to pass the indexes and the columns okay before showing you how these two methods work with code we're going to learn some core concepts and also we're going to see some examples so first the lock method so the lock method is label based so we have to specify rows and columns based on their index and column labels so this means that we have to write the name of the index or the name of the column in order to get the data we want with the lock method so our first argument will be the name of the index and our second argument will be the name of the column and now let's check the eyelock method and this method is integer positionbased so we have to specify rows and columns by their integer position values and this one is a zerobased integer position which means that it always starts with zero so the main difference between the log method and the eyelock method is that with the lock method we only need the column and index labels but with the eyelock method we need to specify the integer position values so we have to locate a data inside a data frame and then see in which position is the row or the index and in which position is the column okay now to understand this much better let's see an example so here we have this data frame that I named TF and this data frame has three columns fu bar and bas and also has six indexes and the first index is zero and the last one is five so here this 0 1 2 3 4 and 5 that is on the left those are indexes and those are the index labels or the index names but here on the right we can see the index position of each of them so the first element has a zero index position and the last element has a five index position here I wrote the name of the index using letters and not numbers in purpose because I wanted to show the difference between the index label and the index position the index label is the actual name you see in the index it can be words or numbers but the index position can be only numbers and always start with zero okay now let's see some differences between the lock and the iO methods so in the first one you can introduce the label or the name of the index or the column but in the second one you only can introduce the position of the index or the position of the column so here for example if you want to select an element with a value you will see some differences between these two methods so with the lock method you have to write the name of the data frame df followed by log with square brackets and inside this square brackets you have to write the name of this index in this case I want to get the value of the index zero so this one so that's why I specify that index label so here I wrote dflo and then zero inside square brackets but in the second method so the eyelock method we have to write df illock and inside square brackets we write the position of this index so in this case the position is position zero so the number so here we wrote the number okay now let's continue and to select elements with a list is similar so we write df.lo we open square brackets and inside we write the list so in this case the elements will be the index labels so in this case zero and two and for the iO method we write I lock and then square brackets and inside square brackets we write the list but now the list has the index position so zero and two in numbers okay now to select elements with the slicing we have to write df.lo lock and inside square brackets we have to write the syntax start colon stop so in this case the start element is zero then we have the colon and then the stop element is two so these two elements are my start and stop and with the iO method we write df iillo and inside square brackets we write the position of the index in this case the numbers zero and the number two but here there is a big difference between these two methods so in the lock method the start and the stop element are included in the slicing however in the eyelock method the start element is included so zero is included in this slicing but the stop element is not included so is excluded from the slicing this means that we're going to get only the data that belongs to the index zero and index one using the iO method however we're going to get the data from index zero one and two if we use the lock method because the start and stop elements are included in the slicing so this is a big difference that you have to keep in mind when you use either the log or the eyelock method so now to finish this video let's see how the following lectures will work so as you've seen in this video the lock and the eyelock methods are quite similar the big difference is between the name label that the lock method uses and the index position that the I log method requires in the log method lecture we're going to have an average lecture which means that I'm going to teach you how this method works with examples and we're going to do it together however when we get to the eyelock method lecture we're going to do something different so this lecture is going to be an exercise for you so the examples we're going to see in the log method and the eyelock method are going to be the same so you have kind of translate the code we use in the log method to select elements with the I log method after you try to solve these exercises on your own we're going to check my solution and compare the results and that's it in this video we learned the differences between the log method and the I log method okay in this video we're going to have a look at the data set we're going to work with in this section so first we're going to import pandas as PD as usual so I run this first line of code and then we're going to read the CSV file that we'll use in this section so to do that we use the read CSV method so I write pd_csv and here I open parenthesis and now I open this quotes and now I'm going to write the name of this CSV file which is players and now I'm going to complete this with tab so here's the name players_20.csv so I'm going to press control enter you see the values but here I added this s by mistake i'm going to delete it and here I run again and now everything is fine so here is the data frame and now we can see that this data frame contains information about soccer players and here we can see different information about the players like the age the height the weight nationality and so on okay now I'm going to name this data frame and I'm going to set this equal to df so I write df equal to pd csv so I run this and now I'm going to show the data frame and also I'll show you the columns we'll use in the next videos okay now we have the data frame and the first thing we're going to do is to say that short underscore name to the index so this short underscore name is a column that contains the names of these soccer players and if we set this to the index now instead of zero one two and three in the index we're going to get the names of the players here so we can identify the rows easily and that's going to help us a lot in the following videos so here I'm going to copy the name of the data frame but actually I don't need it because it's just df and here to set a column as an index we have to use the set underscore index method so we write set underscore index we open parenthesis and now we write the name of the column inside quotes so here is the name of the column so I copy it and now I'm going to paste it here and if I run this we can see that the new index is the short underscore name column so here we can see the names of the players and now we can easily identify each row because it's not a number but now a player okay now this set index method created a copy of this DF data frame and if we want to save the changes and update this data frame we have to overwrite this data frame so we can write this df equal to df set index or we can also use the in place argument so I write in place equal to true and now if I run we're going to see that all the changes were saved so we're supposed to get the data frame with the index short underscore name so let's have a look i run this and here we can see that this index is named short name and contains the names of soccer players finally we're going to select some columns that we're going to use in the following videos so here if I scroll down you're going to see that the data frame has around 100 columns and those are a lot of columns so we're going to select like seven columns so here we're going to use the columns that start with long underscore name until club so these seven columns so right now I'm going to copy this seven columns and now to select these columns we're going to use double square brackets so here double square brackets and now I write the name of the columns or actually I'm going to paste them and I have it here so now I add quotes and also comma to separate each column and after this I'm going to explain you uh what each column means so you can understand much better the data set we're working with so now I only have to add quotes here in the club and now it's ready so now I select the columns in here I run and you can see that there are only seven columns so now let's have a quick look at each column and here first we have the long name and this is the full name of the player so here this is the short name and this is the long name then we have the age then uh D O B is the date of birth of each player then we have the height and this one is in centimes and the weight is in kilograms then we have the nationality and the club each player play by the year 2020 and that's it now I'm going to just override the value of this data frame so we can only have these columns so here I overwrote the value and now here we have the data frame okay okay now your task is to check this data frame in detail so you can easily understand the following videos okay in this video we're going to learn how to select elements by index label with the lock method and we're going to start by importing pandas and doing all these things we did in the previous video so here I have the code that we did in the previous video and now I'm going to run these three cells and now here we have the data frame we created in the previous video so we have the seven columns we selected and now we're going to start by selecting with a single value so here we're going to follow this syntax of the lock method which includes the square brackets and then the row label or index label and also the column label so our first example is to get all data about the player Leonel Messi so here to do this we're going to write the name of the data frame which is DF then lock then open square brackets as it says here and now we write the name of the player so in this case it's Leonel Messi and actually I'm going to copy the name because it's written in a different way so I make sure everything is correct so here I write or I paste Leonel Messi or L messi and here I run and we get all the data about this player so we have his full name the age and all this data so now let's get a particular value of this data that we got so for example we can get the height of this player by writing df.lo and now we have to write the index name so in this case is the name of the player so I write l messy and then we have to write the column label so we want the height of this player and the height is inside this height cm column so here I paste the name of this column and it's here so we follow the syntax and now we have this so we run this and as you can see now we got the value of 170 and that's the height of this player in centimeters so now let's continue with the next example and in this case we have to get that weight of the player Cristiano Ronaldo so now I'm going to copy this and I'm going to paste it to have it as a reference so now here in the index we have to write the name of the player and by the way here we set the name of the column short underscore name to this data in purpose so we can use this data as the index ignore the numbers 0 one and so on so here this short underscore name help us easily find this data with the log method so now in this case we want the name of the index and in this case is Cristiano Ronaldo and now I'm going to copy and I'm going to paste it here so here the index name or the row name is Cristiano Ronaldo and now we want the weight in kilograms so here I write KG and now it's ready so now I run this one and we can see that the weight of this player is 83 kilograms okay now let's get all rows inside the height column so so far we get here a specific value of this uh height and also the weight but now we want all the rows about this column so we can do that by using a special symbol or a special sign so this sign is this column and this allows us to select all the elements inside a row or inside a column so now let's see how it works first we write the name of the data frame followed by that lock and now here we want the height uh in the column so here we write comma and we write the name of this column so here height cm and then here in the index we want all the rows so in this case we write the colon and this indicates that we're going to get all the elements and all the elements in the index this time so here we run this and now as you can see we got this series and this series is about the height of the players so here we have the information about the height of all the players listed here so for example Leon Messi is 170 cm then Neymar is 175 and so on and now we can use this symbol in the column labels so I'm going to show you here and we have to get all the columns that correspond to the index Legon Messi so here I write DF.log open square brackets and here I write the name of the index which is the name of the player L messi and now we have to write the name of that column so in this case we want all the columns so we need the column sign and I write it here and now it's ready we run and we get all the columns that corresponds to the index Leonel Messi so here we have the data and this one looks similar to the one we got here that we obtained by writing only the name of Leon Messi and yeah it's the same but in this case we use this column that help us get all the columns of this index okay now let's see how to select elements with a list of values in the previous section we selected elements with single values but we can also use lists so in this case we want to get all the data about Leonel Messi and Cristiano Ronaldo so here we have two elements so we can introduce these two elements inside a list so now we create a list with square brackets now we open here the quotes and we introduce the first element so in this case l do messy and now I write this blank space and now we write the second element which is Cristiano Ronaldo so now I write here and now we have to put this inside the lock method so here I write df.lo open the square brackets that belongs to this lock method now we introduce this list so here I copy and paste it here and now with this we're going to get all these elements inside this list so I run this and now we can see this data frame so now we have the data about Messi and Ronaldo and all these columns here so that's how you select uh elements with a list of values and now let's get the height of Messi and Ronaldo so in this case we can create the list and also add a column so here we got the seven columns and we can get or specify one column so I'm going to show you here and first I copy this code we created here and now here is the name of the index so now we only have to write the name of the column we want so here we have to write the name of the height column so here is height cm so I paste it here and now instead of getting this data frame with all these columns we're going to get only one column so here I run this and now we only have the height of these two players and now let's get the height and the weight of Leonel Messi so here we have to create a list for the column names so in this case we should have a list with two column names so I going to create a list so I open square brackets and now I write the name of the height column and it's here height cm so I open this quotes and then the second element is weight kg so here kg and now this is my list so now I write df.log and put this inside this log method but first I'm going to write the name of the index which in this case is also the name of the player and here is l do messy so I write it here now open quotes and now I add this blank space and now I can add this list that corresponds to the column names so it's here and now if I run this we get the height and the weight of this player all right now let's get the height and weight of Messi and Ronaldo and this is going to be like a mix of the previous examples so here I write df.lo lock square brackets and inside I write first the list of the index names so in this case the index are Messi and Ronaldo so here I have it here from the previous example and this list has Messi and Ronaldo so I copy this list and I introduce it here as the index names so here we have our first list and then we have to write the column names and here it's the height and the weight and here I have it inside a list so it's height and weight so you now copy and paste it and now we have these two uh elements and these are inside the log method so I run this one and now we can see that we have this little data frame with the height and weight of Messi and Ronaldo okay now we're going to learn how to select a range of data with a slice and here we're going to use this syntax that it's here which is start colon stop colon a step and as a side note this is contrary to usual Python slices so in Python slices this stop element is excluded but here in pandas both the start and stop are included so this is kind of an exception so here you just keep in mind that this start and the stop are included so that said now let's continue with the first example here we have to make a slice in the column labels so here I'm going to show you the data frame so you get a better idea of what we're going to do here and I'm going to use this little data frame so here we're going to make a slice between the columns H and the column club so I'm going to struct this 2 4 six columns so I can do that by slicing these columns and I'm going to show you how to do it here so here I can use this syntax so first we write the start element or the start column in this case this is going to be the H then we add the column and then we add the column that is the club so here I write club and now we have this syntax start column stop so here h column clap and now we have to put this inside the log method so here I write df.log square brackets and first we introduce the name of the index that we want to select and this time I have this one created and is a list that has messenger Ronaldo as elements and the name of the list is players so here I write the name of this list so players and now we have to write the name of the columns so here I copy or cut this slicing and now I paste this slice so we have this slice from H to club so now we can run this and now as you can see we have the slice between H to club so only the columns between these two columns were selected and as you can see the clap column is included in this slice so that's how you make a slice with column labels and we can also make a slice with index labels and I'm going to show how to do it here okay in this example we want to slice index labels that are between the top one and the top 10 player so this data set sorts the players based on their scores in the FIFA game so the first player or the first row that is in this data set is the best player according to this FIFA game and the 10th player in this data frame is the 10th best player in this game so here we can get the slice that represents the top 10 players by selecting from index zero to index 9 but first we have to get which player is in index zero and which player is in index 9 so here we have to use the index attribute so I write df.index and now to get the first 10 elements we write here the square brackets and now uh this column and then 10 so this represents the 10 first elements in this index so here if I run you can see all the indexes and if I write now the square brackets we're going to get only the first 10 so here we have the first 10 and we have that the best player in this game is Leonel Messi and the 10th best player is Muhammad Salah which is here so now we can make a slice by writing both Leonel Messi and Muhammad Salah so the start element will be Messi here and then the stop element will be M salah so I'm going to write it here so first is going to be the first element or the first index then colon and then the 10th element so I write it here hammet sala and now what we have to do is put this inside a log method so here I write df.log then square brackets and now we cut this slice and now I paste here so now this is the index or the name of the index and then we have to write the name of the columns and I created here a list that is here and the variable is named columns so I copy this one and now I paste here so now if I run this you can see that we get the first 10 players in this data frame so from Leonel Messi which is index zero to Muhammad Salah which is index 9 and that's how you select a range of data with a slice in the columns or in the index finally let's see how to select elements based on a condition so in the previous videos we learned how to make conditions in pandas and now let's make conditions and also select elements from these conditions so here we have this condition that says select players with height above 180 cm so to create that condition we have to write here the name of the data frame so I write df which is the name of the data frame and then open square brackets and we write the name of the column we want to evaluate so in this case is the height column so I copy this one and paste it inside square brackets now we have to compare this with the 180 cm and we only write the numbers here to compare it so here greater than 180 so here we have this condition and we have to put that condition inside the log method so here I cut this and put it here in the first argument so here is supposed to filter now our index and now we have to write the name of the columns we want to select so in this case the name of the columns are here inside my columns variable so I copy and paste it here so now we're selecting elements that satisfy this condition in the index and that belong to these columns inside this columns variable so I run this and we can see here in this data frame only players above 180 cm so for example Leon Messi is not here because he is 170 cm so he's not here okay now let's select elements based on multiple conditions so here our task is to select players with height above 180 cm and from Argentina so here is condition one the height and the second condition is the nationality so here we write both conditions and the first one we actually build it here so here is my first condition the height above 180 so I paste it here so first condition is done and now we have to write the second condition so nationality equal to Argentina so here I copy this one and now I write the name of that column which is nationality but I want to see how it's written so it's here nationality so I paste here and now this has to be equal to the country Argentina so I write here and here we have condition one and condition two so to satisfy both conditions together write the parenthesis here to separate them first and then to satisfy these two conditions together we have to write the and operator so the and operator is represented by this symbol and here we have condition one and condition two so with this we're going to satisfy both conditions and now we only have to insert this multiple condition inside this log method so I cut this one now I paste it here and this is going to help us select elements based on this condition so now I only have to write the name of the columns and here instead of writing the name of the columns one by one I'm going to write only the column symbol which indicates that I want all the elements so here I wrote this column sign and now I press Ctrl enter to execute this code and here we have players with height above 180 cm from Argentina so as you can see here we don't have Cristiano Ronaldo anymore because he's from Portugal but we have other players from Argentina with height above 180 cm and that's it in this video we learned different ways to select elements with the log method okay in this video we're going to learn how to select elements by index position with the iO method so as usual I imported pandas and run these three first cells to get this data frame that has these seven columns and this short name column in the index so it looks like the one we created in the first video and now we're going to uh select elements with a single value but in this case we're going to use the eyelock method so all these exercises are the same as the examples we've seen with the lock method so I recommend you to resolve these exercises on your own and you can pause this video and after you try it yourself you can continue watching the video to see my solution so you can download this Jupyter notebook file that is in the notes of this video to start solving these exercises okay let's start with the first task and here we have to get the height of Leonel Messi so we have to write df do iilock and then square brackets and inside instead of writing the names of the index or the name of the column we have to write the position so here what we have to do is to write the position of Leonel Messi and this one is position zero in the index because here as you might remember Leonel Messi is first here in the first row so first row position zero so here zero and then the height is the position of the column here so this is zero this is one this is two and height is position number three so we write three and that's how we do it so now we run this code and we get the height of Leonol Messi then we have to get the weight of Cristiano Ronaldo and I'm going to copy this one so we can do this faster so here first the index of Cristiano Ronaldo here is one and now the column of the weight I think is here so it height is number three so weight is number four so here instead of three we write four and now we run this and we get that the weight of Ronaldo is 83 kilogram then we have to get all the rows inside the height column and in this case I'm going to use the column symbol and is this one so this represents all the elements either in the rows or in the column so here we write df and then iilock then square brackets and since we want all the rows we have to introduce this one first as first argument and we get all the rows then here we have to write the position of this height column so here the position is three because we did before in this exercise so is three and we paste it here so we paste it and we run now and we get all the rows inside the hide column next we have to get all the columns that correspond to the index Leonel Messi so we write df i log and then we write the index that corresponds to messy which is index zero and then we write uh the colon symbol because we want to get all the columns here so if we write the colon symbol we're going to get all the elements that are inside the columns so here we press Ctrl enter and now we get all the elements or all the columns that correspond to the index Leonel Messi okay now let's select elements with a list of values using the eyelock method so the first task here is to get all data about Messi and Ronaldo so we use dfilock and now we open square brackets and since we want to get the data of two players at the same time we have to create here a list so here is my list and now I write the index of Messi which is zero and then the index of Ronaldo so one then we have to put this list inside this uh square brackets so here we paste it and now we press Ctrl enter and here we have all the data that corresponds to these two players next we have to get the height of Messi and Ronaldo so we write DFILOG and here we have to write the indexes that corresponds to Messi and Ronaldo and here we have it from the previous exercise so I paste it here and now we have to write the position of the column height so in this case is position uh 0 1 2 and three so we write three now we press Ctrl enter and we get the height of Messi and Ronaldo next we have to get the height and weight of only Lionel Messi so here again DF i log and then here we write the index that corresponds to Messi which is zero and here the positioning of height and weight so height is three and weight is four now we press Ctrl enter and we get the height and weight of Messi and finally we have to get the height and weight of both Messi and Ronaldo so it's like a mix of the previous exercises so let's do it here dfi lock then square brackets and we have to introduce first the list that corresponds to Matthew and Ronaldo so here I have this one so I paste it here now the columns that represent the position of height and weight and it's this one so we copy and now we paste it here now Ctrl enter and we have this little data frame that shows us the height and weight of both Messi and Ronaldo all right now let's select a range of data with a slice in here you have to remember that we have to use this syntax start colon stop and here the start is included but the stop is excluded so keep that in mind so here we have to get the slice of the columns that are between age and club and here we have the index we have to use so first we translate this players variable so instead of writing the name of the players we write the index or the position so here is zero and one and then we have to write the here the df i lock as usual and now we have to write square brackets and inside we paste the index of the players so here players then we have to write the slice that we want to make so now let's see which position corresponds to the age in the club column so here I'm going to show this data frame and here we have h so this is zero and this is one and here club is 2 3 4 5 6 so h is one and club is six so now we write here I'm going to delete this one because we don't need it anymore and here is one and six but as you might remember this last element this top element is excluded so we have to write this one + one so if we write this we get seven and the seventh is going to be excluded so we get the number six so that's why we do this so 6 + 1 is 7 so we get this 1 column 7 which represents a slice from the H to the club column so now it's ready so we run this one and we get this slice from H to club and we get the index of Messi and Ronaldo okay now let's get a slice of the first indexes so here first we start translating these columns variable so instead of writing the names we write the position so one 2 3 and four so this one as you might remember H is position one and here the column dab is next to h and also hide and also wait so 1 2 3 and four so it's here and now we have to write the syntax so df do i log then square brackets and now what we have to do here is to uh write the slice that we want so in this case we want the first 10 indexes so we do this because we want to get the top 10 players so if we want the first 10 elements or the first 10 indexes we have to write here zero which is the start element then colon and then the stop element which is in this case 10 because this is excluded and we're going to get the number nine so we get from zero to nine so the top 10 players so this is my slice and now I'm going to write the name of the column that we want to get and this one is inside the columns variable so I write columns and now I run this one and we get this slice from 0 to 9 and is here so we have uh the top one player and we have the top 10 player and all the data that is inside these columns that we specified in this columns variable okay okay now let's select elements with conditions and here a quick note because the eyelock method cannot accept boolean series but only a boolean list we have to use the list function to convert a series into a boolean list so we're going to see this in this first example and here our task is to select players with height above 180 cm so here first I'm going to translate this uh columns variable and this one is the same as the one we saw before so 1 2 3 and four h is one and then do is next to h so it's two and then three and four so here we have to write the condition first so df square brackets then the name of the column which is uh the column we want to evaluate here height so height underscore cmters now we compare this one with 180 and this is my condition but as I mentioned before this is a boolean series and we have to convert this into a boolean list so here we write list and parenthesis and we use this list function so now we write the df ilog then we open square brackets and now what we're going to do is to insert this uh list inside the first argument here so it's there and now we have to specify the columns we want to get so I write columns now I press ct controll enter and here we got this data frame that has only players with height above 180 cm okay now this final task we have to select players with height above 160 cm and these players are from Argentina i think I didn't write it here but anyway it's from Argentina so this is condition number two and now condition number one which is height above 180 cm we have it here so I copy this one and now I paste it and now condition number two is the nationality equal to Argentina so I write nationality then here instead of greater than I write equal to now uh we compare this with Argentina so I write Argentina and now we add this parenthesis so condition one with parenthesis and now condition two with parenthesis and in the middle we write the and operator so now we have to make this a list so I write list open parenthesis and now what we have to do is to write the dfilog method as usual now I open square brackets and I insert this list here as my first argument so here I wrote it and now I want to get all the columns so I write the column sign which is this one now Ctrl enter and here we got all the players with height above 180 cm and these players are from Argentina as you can see here and that's it i hope you successfully solve these exercises okay in this video we're going to learn how to set new value for a cell in a data frame and we're going to start by importing the pandas library and also reading the CSV file and doing the changes we made in the previous videos so here I run these three first cells and now we have this data frame with these seven columns we selected before and with the short underscore name as index so now I'm going to scroll down and we're going to see the first example so let's see how to set a value to one cell and now let's say we want to update the height of the player Leonel Messi so here I first have to locate the height of the player so I write df.log now open square brackets and now I write the name of the player in this case this one and then we have to write the name of the column so is this the name of the column and now I write here so it's here now and now we have to press ctrl enter and the current height is 170 and now let's say we want to update this to 175 so we only have to write equal and then 175 and then we press Ctrl enter and if we show this data frame here we can see that the data frame has a new value inside the index L messy and height CN column so here it says now 175 which indicates that this data was updated okay now let's see how to set a value to an entire column so let's say that we want to update the height of all the players to 190 cm so we can do that by using some things we've seen in the previous video so first we have to locate the whole column and we have to use the log method so here I write df.log open square brackets and now to get all the rows we have to use the colon so here I write colon and now I write comma and this indicates that we're going to get all the rows so now I write the name of the column I want to get in this case is going to be height because we have to set the height to 190 so I copy this and now I copy height now paste it here and now we write this equal to 190 so first I'm going to show you what this is going to return and here we can see that we get the rows of this height cm column and these are all the rows because we use the column here so now we set this equal to 190 and all this data is supposed to be 190 so we run and now I'm going to show this data frame and now you can see that all the data inside the height cmters it's 190 so here we updated all the data inside this column okay now let's see how to set a value to an entire row and this is similar to setting a value to an entire column because we're going to use the colon sign so it follows the same idea and here our task is to get all columns that correspond to a player ranked last in FIFA so here I'm going to show you the data frame and as I mentioned before the player listed first in this data frame is the best player according to this FIFA game and the last one is the worst player in this FIFA game so in this case is this one pan sim and we have to get this element and we can get this element with the eyelock method so we write i log and now we can write the name we can write the name of that index so in this case it should be this one uh 18,000 and here I can copy this one so I can paste it now but we can also use negative index so here instead of writing this uh number we can write minus one and minus one represents the last element in a list or in this case in this uh indexes so now we have to write comma and write the name of the columns we want to get but in this case I want to get all the columns so we write the column symbol which is this one and we're going to get all the columns here so I press Ctrl enter and this is all the data about this player so we're going to update this data and now I want to update with null values and to set null values we have to use the numpy library so first I import numpy as np and now to create null values we have to write np n this nan represents null values so here I run and as you can see here well only says nan but if now I set this equal to this data here that we got by selecting with eyelock we can update the values here with null values so I run this and now let's see what happens i'm going to show the data frame and it's here so I scroll down until I find the last row and is this one so it says pan simmon and we can see that it only has null values because we updated with this np n and here with this i lock we located that last row okay now let's see how to set value to multiple cells so we have to set a value for items matching the list of labels so in this case let's say we want to update the height of two players so first as you might remember we have to write a list with these two players inside this list and I'm going to do this here so I write square brackets and now I write first the first element which is L Messi i'm going to update the height of this first player and the second player is going to be Cristiano Ronaldo so I write it here and this is my list with these two players so now to update this uh the height of them we have to use the log method then we insert this list here inside and now what we have to do is to indicate the hide column so I'm going to write it inside a square brackets to indicate this is a list and here I opened um these quotes and now I write the name of the column so here height cm so I paste here now I run this and we get this little data frame so now I want to update the height to 175 so these two values are going to be 175 so I press Ctrl enter now I show the data frame and we're going to see that the height of Messi and Ronaldo is now 175 and the rest is 190 because of the changes we made before finally let's see how to set a value for rows matching a condition so here we want to change the height only for players with height above 180 cm so first we write this condition so we write df and here the column we want to evaluate so in this case the column is height cm so we want this to be greater than 180 and now to change the values based on this condition we have to write df.lo log and inside this we have to write this condition so this condition is going in the first argument and here we have to add the name of the columns we want to obtain so here I have these columns inside this uh columns variable so I copy and paste it here and now I'm going to press Ctrl enter to see what we get so is this uh data frame with only players that have height above 180 and now we can set this data equal to zero so all the data inside this data frame will be zero so I press Ctrl enter and now I show the data frame so now we can see that all the data for almost all the players is zero because their height was above 180 but Leonel Messi and Cristiano Ronaldo are the exception of this rule because we change their height to 175 so 175 is less than 180 so this condition was false and this value wasn't set to these two heights and that's it in this video we learn how to set a new value for a cell in a data frame okay in this video we're going to learn how to drop rows and columns from a data frame using the drop method so first we run these three uh cells here that transforms our data frame as we did in the previous videos so now we scroll down and here I'm going to show you how to drop rows using the drop method okay this drop method has different parameters which means that we can use different ways to use this drop method and here I'm going to show you uh one way which is using the axis parameter so here I'm going to start by writing df.drop so this is the syntax of the drop method and now I write the index or the row we want to get rid of so in this case this is going to be uh let's see the first one too Leonel Messi L messi so I want to delete all this row so I write the name of the index and I write it here so this is my first argument and I write it here so now I press comma and now the drop method doesn't know if this first element is a row or a column so here we have to specify in the second argument and here we use the axis parameter so we write axis then equal to and here we have to write number zero because zero represents rows and one represents columns so if I write zero it means that this first element l messy is a row or it's an index so we want to drop this index l messy okay so here now we press ct controll enter to run this one and we get here this data frame but the row that had the index l messy is not here so we dropped it okay now the second option is using the index parameter so here we write df.drop and now we use the index parameter so we write index equal to now open this uh square brackets and now we write the name of the element in this case L messy which is this one so I copy and paste here and that's it so we're telling this drop method to drop this L messy index so here now I'm going to comment this out and I'm going to run this one so I press Ctrl Enter and as you can see we get the same data frame so we get the same result that we got here with the df.drop but with the axis parameter so here we got the same result so you can choose any option you want and now let's see the second example and in this case we have to drop two or more rows and update the data and to update the data we use the in place uh parameter so here we write df.rop now we open parenthesis and here we write the name of the elements in this case I want to get rid of the first and second rows so in this case uh the row l messi and christristiano Ronaldo so I'm going to delete these two rows so I write first the square bracket that represents a list and now here I write first the name L messi and now I paste it here and the second element is going to be Cristiano Ronaldo so I copy and paste it here and by the way we open a list here because we're working with two or more rows so we have two or more elements and that's why we have to write this list and now we have to specify that these are rows or indexes so I write access equal to zero and then to update this data we have to write in place equal to true and here we're going to update the data because we're using this uh parameter before we didn't update the data because here the drop method only creates a copy so to save all the changes we make we have to use the in place parameter so here now I run this one and as you can see we don't get anything but if we now uh print this data frame we can see that the rows Messi and Ronaldo are gone so the first row here is Neymar Junior okay now let's see how to drop columns with the drop method and this is similar to dropping rows so first let's see how to do it with the axis parameter so we write DF.Drop now parenthesis and the column I want to delete is the long underscore name so I want to delete this column so here I paste it and now I have to specify the axis so we tell the drop method that this is a column so here we write axis equal to one because one represents columns and now we run this one and as we can see here this data frame starts with the H column so this means that the long underscore name column is not there and now let's do this but now using the columns parameter so here it's the columns parameter and now I use df.drop drop and here open parenthesis and now I write column so this is the parameter and now equal to and here I write the name of the column so here I paste this column and that's it that's all we need because when we use the columns parameter that drop method knows that this element long name is a column so that's enough and now I'm going to comment this out and here I run and as you can see we don't have the long underscore name column okay now let's see how to drop columns by position and in this case we want to drop the last column so the first thing we have to do here is to locate the last column and we can do it by using the columns attribute so here I write df and now I write the columns attribute and here I open square brackets twice and if I want to get the last column we only have to locate it here so this is not the original data frame so I have to write it here df so now let's find in which position is this club column so this one is position zero then one two three four five and six so this club is in the position six so now I delete this and here I write six so if I run this we get this result and it has the club element or the club column so it's fine this approach but what we can do now is to use negative numbers so if we write minus one this means that we want the last element and this is much better than using the six that we had before because sometimes the last element is not six but seven or five depending on how many columns we had before but if we write minus one we can make sure that the last column is going to be selected because minus one is always the last column so here I run this and here you can see that we got the same result so it's working fine and now to drop this uh column we have to use the drop method so we write df.rop and now I open parenthesis as usual and now I insert this one here and now I add the axis so we tell the drop method that this element is a column so here axis equal to one is column so now I press Ctrl enter and as you can see we don't have the club column anymore so we deleted the club column and now let's check the last example and here we have to drop two or more columns and then update the data so this one is simple we already did it but with rows so we only have to write df.rop and now I open parenthesis and since we want to drop two or more columns we open square brackets that represent a list and now here I write the two columns and in this case it's going to be the columns long name and D O so here I paste this one and the second is D O B so date of birth and now these are the two columns that we're going to delete so here we add axis equal to one which indicates these are actually columns and now we add the in place parameter so we update the data and we don't create a copy so here it's ready and now we press Ctrl enter and we didn't get anything but if now I print the DF data frame we can see that the long name column and the D O column are not here and that's it in this video we learn how to drop rows and columns using the drop method in pandas okay in this video we're going to learn how to create random samples using the sample method so first I'm going to import pandas and run these three first cells so here I run these cells and now we have the data frame ready so now I'm going to show you how the sample method works in pandas so the sample method help us extract some random elements from a data frame and also series and this can help us when we want to analyze a small portion of our data frame so here our first example is to extract 10 random elements from the nationality column so here in my data frame I have this column name nationality and here we have the countries where these players were born so what we're going to do is to get 10 random elements from that column so here we have to write first the name of the data frame which is DF then square brackets and here I'm going to copy and paste the nationality so here nationality and now we have to write sample and then parenthesis and inside we have to specify the number of elements we want to extract from this data frame or in this case this series so in this case I want to extract 10 random elements from this series so we write 10 and that's it so if I run now you can see that I get 10 random elements from this data frame so here I get 10 uh soccer players and also here the nationality so this is the index and these are the values and now if I run again you will see that we're going to generate other 10 random elements so let's see i run and here we have another 10 elements and now here we run again and you can see that we generate 10 random elements every time we run this cell okay okay now we can control the random number generation here with a sample method by writing the operator name random state so here I write random state and now we have to write any number we want for example I write the number 99 and now if I run this we get new 10 random elements but if I keep running this we can see that these 10 random elements remain the same so they don't change no matter how many times I run this cell so this is how the random state works but if now I change the number here for example I write one now this is another random state and we're going to get another 10 random elements so I'm going to leave it with 99 so this is my random state and that's how the sample method works so now let's see the second example and in this example we have to extract a random 20% sample of the data frame so here I have the original data frame and this one has 18,000 rows so if we want that 20% we just have to divide the number of rows uh by five so here let me show you we write df sample then we open parenthesis and here we could write this number divided by five so this divided by five and we get our sample but now we can do something different we can add a new parameter to specify the portion or the fraction we want to get so let me show you here we need to introduce the frack parameter so I write frack and this stands for fraction and now I write equal to and here I write 0.2 so 20% now I run this and here we get uh the 20% portion of this original data frame so here this is 3,600 rows and yeah that's like uh this 18,000 divided by five so the 20% okay now we can control here the random number generation so here we can add this uh let me see here random state so this is the parameter that help us control the randomness of the sample method so we write random state then equal to and in this case I'm going to write any number let's say uh 12 and now I run and now I get the 20% with the random state 12 and if I keep running here we get the same uh 20% sample so it remains the same finally let's see how to increase the sampling rate which is known as upsampling so when we want to increase the sampling rate we have to specify the frack parameter greater than one so for example if frack is equal to two this means that we want to get a 200% of the original data frame so we want to double the number of rows but here keep in mind that when the parameter is greater than one so for example when we want to double the number of rows in a data frame we have to set the replace parameter equal to true so that's like a rule every time we set the frack parameter greater than one we have to set the replace parameter equal to true here so keep that in mind and now I'm going to do this here so I press DF now that sample and now open parenthesis so now we want to double the number of rows in this data frame and now we have to write frack equal to two and here I'm going to add the replace parameter because here two is greater than one and we have to follow this rule so two greater than one and parameter has to be equal to true so now here what we have to do is to add this new parameter replace equal to true and that's it so now we run this one and here I got an error because I didn't write this sample correctly so now I fix it and now I run this one and here we got this data frame and this data frame should double the number of rows of the original data frame so let's check it out here there are 36,000 rows and let's see how many rows the original data frame had and here 18,000 so yeah it doubled the original number of rows and now here again we can add a random state in case you want to control this so random state and here you can write any number so I'm going to write 99 in this case now I press Ctrl enter and this data frame or this sampling that we're getting is going to remain the same so it's not going to change as you can see here and that's it in this video we learn how to create random samples with the sample method okay in this video we're going to learn how to filter a data frame using the query method so we're going to start by running these three cells and now I have this data frame that we created in the previous video so here is a data frame and now I'm going to show you how the query method works so the query method help us filter a data frame the same way we did in previous videos but in previous videos we use something called boolean slicing but in this case we're going to use the query method and this is a different syntax so I'm going to show you here and first let's solve this task that says select players older than 34 so to solve this with a query method we write df.query then we open parenthesis and here we open this quotes and inside we write this condition so first the name of the column which is H let me see here so H and we have to compare this with 34 and as you can see this syntax is different from the syntax we've been using with the boolean slicing but let's try it out so here let's run this one and in the data frame that we got here the hes are greater than 34 so here we have 37 36 and yeah so we successfully filter this data frame with a query method and now I'm going to show you how we will do it with the boolean slicing in case you don't remember how this works so with the boolean slicing first we have to create a condition so here we write df then open square brackets and now we write the name of the column so here h now we compare this with 34 so we write greater than 34 and now we put this inside uh square brackets so here I write df square brackets and this condition is inside so here this is called boolean slicing and this is going to return the same data frame we get here with a query method so now let's check it out here i comment this out and I run this one and this is the same data frame we got before but now with a boolean slicing so both have the same result but different syntax and you can choose the one you find most practical or the one that you find easy to remember okay now in this second example we have to select players older than 34 from Italy so this is some kind of multiple condition and we're going to do this with a query method so here I write df.query now parenthesis we open this quote and inside we write the two conditions so here we only write the name of the column so h greater than 34 and now we write the second condition so nationality here equal to Italy so here Italy should be inside quotes but since we already use this double quotes here we have to write single quotes to avoid any conflict between these two so here I used single quotes for Italy and now it's fine so here we have condition number one which is this one and this one is condition number two so now we have to add an operator and in this case we have to add the and operator so we have to do this because we want to satisfy both conditions and as you might remember the ant operator is represented by this symbol that I'm writing right now so we've been using this symbol and in this case with the query syntax we don't use that symbol but a different symbol so here instead of writing that symbol we only write the word and so here we set this condition and this other condition so that's how it works with the query method so now it's ready and now let's run this one and here I got the data frame and in this data frame uh the age of the players are greater than 34 and all the players are from Italy as you can see here so we successfully filter this data frame with this multiple condition okay now as an exercise you have to write the equivalent boolean slicing of this query method so you have to do something like I did before here this is the equivalent of this uh query method so you have to do something like this and you can pause the video and then to check my solution you can download my script that is in the notes of this video okay now in this third example we have to add a not operator to the first example so let me copy the code of the first example which is right here so I copy this one now I paste it and now we have to add a not operator so so far the not operator that we've used is this one so we've used this operator in this course so far but now in this query method we have to use a different operator so instead of writing this symbol we have to add first parenthesis and now here write the word not so this is the not operator and this is being applied to the h greater than 34 so if we set not this we're going to get h less than 34 so let's check it out i run this one and as you can see in this data frame all the players have age less than 34 so this is how we do it with a query method and now you have to write the equivalent boolean slicing of this query method okay now in this task we have to convert the height to meters and then select those with height above 1.8 so we can do that with the query method we only have to write df.query now parenthesis and inside quotes we have to do this operation we have to transform the cm to meters because the height now is in cm so to convert this to meters we only have to write the name of the column here and then divide it by 100 so here we have the height in centimeters and if we divide it by 100 we get the height in meters so now what I'm going to do is to compare this with the height here 1.8 so I write 1.8 and now this query is ready so we can run this one and as you can see in the data frame all the players have a height above 1.8 all right now your task is to write the equivalent boolean slicing of this query we created here okay now our last task is to select players that were born after 1990 so to do that first we have to check the data types of the columns so here I'm going to see the data types of all of these columns so here I write DF and I'm going to use an attribute that we learned in this course which is the D types so this one help us get the data types of all the columns in this data frame and now I'm going to see this one but here above so I run this and as we can see here the date of birth so do ob has a data type object and this isn't good because here the date of birth so do ob is a time so it has the year the month and I think the day and this column should be a data time type and we can convert the data type of this column by using a method that is called as type and I'm going to show you here how it works so first we write df then the name of this column d o and now we write as type so this is a new method we open parenthesis and here inside we write the type we want to convert to so here I write date time 64 and this is a time type that is common in pandas so here this is my time type or date time type and now I'm going to override here this do ob column so it's here so now I'm updating the data type of this column so we do this because we want to extract the year of this column and we can only do that if this column has a datetime type so that's why we're converting this column to a daytime type and why we want to get the year because here in this task we have to select players that were born after 1990 so if we compare the years only the years we can know which players were born after 1990 okay and once we convert this data type what we have to do is to get access to the year attribute of this column and we can do this by writing the name of the column i'm going to show you here in this cell so here DF square brackets D O B and then to get to the year we have to write first the dt attribute so this is some kind of date time and now to get to the year we write that year so this is the year attribute so now I'm going to run this but first I have to convert the column to daytime type so here I paste it and now I run this one and as you can see here I got only the year of each date for example Cristiano Ronaldo was born in 1885 and here is the year so we extracted only the year by using this year attribute so here we can do the same but with a query method so I'm going to show you how to do it with a query method so first we write df query open parenthesis then this uh quotes and inside we have to write the name of the column so instead of writing this uh series syntax we have to write only d o so here we have to write d o obs attribute so dt and then that year as we did here before so it's the same just with the syntax that we need for the query method and then we have to compare this with 1990 so here we're saying that the year is greater than 1990 and that's my query so now I can run this one to verify that all the players were born after 1990 so I run this one and as you can see here in the data frame all the players were born after 1990 for example Neymar Jr 1992 and this one 1991 and yeah 1997 and we filter this data frame successfully now your task is to write the equivalent boolean slicing of this query here and you can use this code I wrote here to help you and that's it in this video we learn how to filter a data frame using the query method in this video we're going to learn how the apply method works in pandas so I already imported pandas and I read this data frame that we're going to work with so now I scroll down in here i'm going to show you how the apply method works in pandas so the ply method help us apply functions or operations to a column or to a data frame we can apply built-in functions as well as our own functions so I'm going to show you how to do both first we're going to use the numpy function and apply it to a series so in this case we're going to apply a function to the H column or the H series so this one so to do that we write first DF and then we write the name of the series so H and then we apply this method so we write dot apply and now we open parenthesis so now inside parenthesis we have to use this function and in this case I'm going to use a numpy function so first I'm going to import numpy as np so here I have it and now we have to write the function we're going to use here so in this example I'm going to use the square root function so here we write npsqrt which stands for square root so now if we run but first let me show you how this looks like without the apply so here I run and here we have the ages of these players but now if we apply this function we're going to get the square root of each H so now I press control enter and we get the result so this is a square root of the age of Messi for example and this is the square root of the age of Ronaldo okay and that's how you apply a function to a series and now I'm going to show you how to create your own function and apply it to a data frame so here I scroll down and now we're going to create our own function that calculates the BNI of a soccer player so here I write def calculate and then BMI so now I write the parameter here and it's going to be row now I press here the column symbol now I press enter and here I write return and now I'm going to write the operation that I want to perform here so to calculate the BMI we have to divide the weight in kilogram by the square of the height in meters so here we have to convert this height into meters and then get the square so here I scroll down and now I'm going to create this uh formula so the weight divided by the square of height so meters and here this number two and now here in the formula I'm going to write row which is this one by the way this represents the data frame i just didn't write df so to avoid any confusion with this D of data frame so I wrote row but this represents like the data frame so now row and then we open square brackets so we're going to make a selection of a column and now here I'm going to select the weight column so here I'm going to copy the name of this column and now I paste it here and now here we have to introduce the other uh column which is the height so I copy this one and now this is in cm so we have to convert this to meters so we divide this by 100 and now to get the square we have to uh open parenthesis here and now write this symbol twice which is multiplication and if we write it twice we get the square so here if I write two it means that we get the square of this number so now if I want to divide this I write this operator and now we get the weight divided by the square of the height in meters as you can see here so now this is ready and I'm going to run this one but first I'm going to delete this and now it's ready so I press Ctrl enter and now I can apply this function using the apply method so here I'm going to write the name of the data frame df then apply open parenthesis and inside I can specify the name of the function as I did here with this numpy function but in this case I'm going to write the name of my function so calculate_bmi so I write it here and now I have to add one more parameter which is access and this means how we want to make this operation or how we want to apply this function so here in our function we want to make operations with multiple columns so here I have column number one and column number two so we want to make some operations between two or more columns this is why we have to write here axis equal to one in this way we can make operations with multiple columns as we are specifying here so now it's ready and now I'm going to run this cell and here you can see that we get the BMI of each soccer player so here I have 24.9 for Messi and 23.7 for Ronaldo and that's it in this video we'll learn how to apply functions to a series and to a data frame okay in this video we're going to learn how the lambda function works and how we can use this function inside the apply method so first I imported pandas and now we have the data frame as we had it in the previous videos and now I'm going to scroll down and here I'm going to explain you how to use the lambda function so first to give you a simple explanation let's start by creating the average or the basic function we learned in the Python crash course so okay let's create a Python function that sums two values a and b so here I write defaf then sum values which is the name of my function now I introduce the parameters so a and b then colon and then I want to sum these two uh parameters or these two values so I write a plus b now I assign this to a variable which I name X and then I return X value so here return this X variable which is here so this is my basic function so this is the function we've been using so far in this course and now if I want to call this function we write the name of the function and then we pass in two arguments so here I want to sum two and three and now if I run this I should get five so here I run it and now we get five because 2 + 3 is five so now this is how the basic function works and now I'm going to create this same function but now using the lambda function so here in the lambda function we have to first write the name of the function which is going to be uh let's say sum values lambda so we know this is the lambda function now equal to and here we have to first write the lambda key so we write lambda and now you can see that this is in green so this is the lambda key and now we have to write the input and then the output so first the input so here my input is a and b as we wrote here before so a and b and then colon so we write colon to separate the input and the output and now I'm going to write the output and the output is going to be the operation we want to perform so in this case a + b so we want to sum these two values so here I run this one and now we created this lambda function so this lambda function is supposed to get two values and then it returns the sum of these two values okay now I want to show you how this lambda function looks so I copy and paste it here and now I'm going to show you how the average function looks so here you can see that it says function uh and then the name of the function and here it says function but here we have lambda so this is a lambda function and this is just the average function or the basic function we learned and now I'm going to apply this lambda function so here I write parenthesis and now I pass in two arguments so in this case it's going to be two and three so we introduce these two values and we should get the sum of these two values so five so I run and yeah we get the number five so here we created this lambda function which is the equivalent of this function so just a recap now this is the input which is the same as these parameters here and this is the output so what we're going to return then the name of the function we have to assign it here and this is the name and here this is the equivalent of the name of the function here in blue and now you might be wondering why we need to use the lambda function the lambda function is useful when we want to create a temporary function so a function that we're going to use only once and then we're not going to use anymore so that's one of the cases when we use the lambda function over the average function and also sometimes we use it when we want to simplify things so it's like a oneliner compared to the average function that has more lines so it's much simpler okay now I'm going to show you how to use the lambda function with the apply method so here I scroll down and our first task is to use the lambda function to convert the height series to matters so here what we have to do is first write the name of this column so df here square brackets now I copy and paste this height_cm so it's here and now we write apply this is the apply method and now inside parenthesis we write the lambda key to create the lambda function and then we write the input so here we can write any variable i'm going to write the letter x and this is going to be my input and now I write colon and now I have to write the output so in this case the output is going to be x divided by 100 because we want to convert cm to meters so here x / 100 is how we convert centimeters to meters and this x represents one row of this uh series so we're going to apply this uh lambda function to each row in this series so now I'm going to show you how this works so I run this one and we have the height but now in meters so before messy was 170 cm but now it's 1.7 m and that's how to use the lambda function with the ply method and in this case we don't need to set a name for this function as we did before here we set a name of the function but here when we combine the apply method and the lambda function we can leave it as it is without the name so now I'm going to show you an alternative to this apply method so how we can achieve this uh same task without the apply method so this is really simple we only have to write the name of the series and then divide it by 100 to convert this to measures so now I comment this out and now let's test it out so here I run this one and as you can see we got the same results and now you might be asking yourself why we need to use the apply method if there is a simpler alternative so in this example the apply method looks unnecessary but there are some cases where this apply method with the lambda function is really useful and I'm going to show you this in the following example and our second task is to use the lambda function to convert the long name series to uppercase so here let's have a look at this series so I print this data frame so this long name column has the full name of the soccer players in uppercase and in lower case but here I'm going to delete this one we want to get the full name in uppercase so we can do that with the apply method and the lambda function so we write first the name of the series or the column we want to get so this case is long name which I have to paste it here and now we write apply now parenthesis and now I write lambda so I'm going to use the lambda function and I write the variable so x this is my input and now the output is going to be x and here we're going to use the upper method so we learned this method before and we only have to write upper and then parenthesis and here we can use this upper method because this X is a string and why this X is a string is because here this long name series contains letters so contains strings so as you might remember this long name contains the full name of a soccer player and that's a string so here the x represents one element of that series and this means that this x is a string so a strings has this upper method and that's why we can use the upper method so here I run and now as you can see in the result the names are in uppercase okay and now I'm going to show you an alternative to achieve this but now with the str attribute so here I have to write the name of this series but now we cannot use this upper method right away because if we paste this and now I'm going to comment this out so now if we do this we're going to get an error because here uh this is not a string but this is a series so here when we do this apply with lambda here this x represents a string because this is an element inside this series so that's one of the benefits of the lambda in the ply when we use them together this x represents one element of this series and one element of this series is a string because these are names however here this is a series so we cannot use the upper method because the upper method only works with strings so here what we have to do is to get access to the string attribute and we do that by writing str so this is the string attribute and now we can use the upper method because we got access to the string attribute so now if I run this we can see that we got the full name in uppercase and here we could see one of the benefits of the apply method when we use it with a lambda function because here when we use the lambda in apply we didn't need to specify any attribute but here we have to specify an attribute and that means we have to remember all the attributes an object has and sometimes when you're a beginner that might be complicated i remember many times that I didn't remember all the attributes and sometimes I did something like this and as a result I got errors so here when you use the apply and the lambda function it's much easier at least it's easier to remember than many attributes here so here I'm going to show you another example and in this case we have to use the lambda function to get the year of the do series so here I'm going to show you this data frame again and the do column or do series is this one so here we have the date of birth and we want to extract only the year so this one so to do that first we have to um convert the data type of this column into a date so if we get the data types here we will see that the d o is an object and this should be a datetime type so we can convert this data type into a datetime by using the as type method so here I'm going to write DF now the name of this column which is D O and now I write that as type and this method allows me to convert that data type so in this case I want it to be daytime 64 which is the standard datetime in pandas so now I'm going to overwrite here this column and I'm going to change the data type so here I run this and now if I show the type we're going to see that it changes so here now the date of birth column has a different data type so in this case it's date time and now this is ready so here I'm going to delete this one and now that this is a daytime uh type here we can extract the year so now what we have to do is to write the name of this column so df d o b and then apply lambda and now the input which is the x now colon and now xyear to get the year since this is a datetime type we can get access to the year attribute so we only have to write that year and that's it so here we can run this one and as you can see in the result we get only the year so for example Messi here we have the date of birth that only represents the year and here Ronaldo was born in 1985 and so on so here we did this with apply method and with the lambda function but here there is an alternative using attributes so we only have to write the name of the column followed by the dt attribute that represents that date time so d is date and t is time and now we can get access to the jer so here I'm going to run i'm going to comment this out and now I run so I did it and now here we got only the years and as you can see is the same result so here we can see again one of the advantages of the supply and lambda function because here we didn't have to use the dt attributes but if you wanted to use the alternative you need to know that there is a DT attribute because if you don't know about this attribute and you only do this you're going to get an error because here this again is a series and you need first to get access to the data type so here we have to write DT and now after that year and that's how it works but if you don't have any idea about this dt you wouldn't be able to do it this way but only with the apply and the lambda function so here is one of the benefits of this lambda function with applied method finally in this task we have to apply the lambda function to a data frame in order to calculate the BMI so we did this before to calculate the body mass index but in this case we're going to do it with a lambda function so first we create this lambda function and here I'm going to create a function so first the input x and now the output so the output is going to be here x now we open quotes and we need two uh columns so the first one is the weight and the second one is the height so here I'm going to print the data frame so I have the columns so here I copy first the weight and I paste it here so here we're selecting the weight column and now I'm going to do the same so I copy this one and I paste here and now I copy the height column so here we have the weight and the height so now I convert the height to meters so I divide it by 100 and now I use parenthesis and here this symbol twice with the number two to get the square and now we have to divide the weight by the height so here I write this symbol and now we have this but here we have to add a parenthesis so now it's ready so we have this lambda function and now we have to insert this inside the apply method so write df apply parenthesis and now we insert this as first argument and as a second argument we have to use the axis parameter so this axis equal to one so here now is ready and I'm going to run this one to see the results so here I run it and we got the BMI of each player so here Messi has 24.9 and Ronaldo has 23.7 and that's it in this video we learn how to use the lambda function with the apply method all right in this video we're going to learn how to make a copy of the data frame using the copy method so here I have the data frame that we're going to make a copy of so is this one and now I'm going to show you how to do it so we're going to use the copy method and here to make a copy of this data frame which has the name DF we only have to write the name of the data frame so DF and then copy now we open parenthesis and here by default the deep parameter which is a parameter of this copy method is set to true so this means that all the modifications that we make to the data in the original data frame is not going to be reflected in the copy and vice versa so here I'm going to show you with an example if I make a copy and now I set this copy to DF_copy and now I run this and if now we update a value of this original data frame df so let's say we want to update the height of Leonel Messi so we do df.log then square brackets then l messy now the name of the column which is height here so I copy and paste it so I paste it here and now if I set this to another value so let's check right now messy has 170 so if we set this to 180 let's see how this is going to affect the copy and also the original data frame so if we now print the DF data frame which is the original data frame we're going to see that here in the height we have messy and the new height is 180 so this makes sense because we just updated this height but now if we print the copy data frame let's print it here we're going to see that the value of this height is still 170 so this means that we created an independent copy so this DF_copy is independent from this DF original data frame so any update we make to the DF data frame is not going to be reflected to this DF_copy because both are independent that's usually the behavior we expect from copies but if for some reason you want that the copy and the original data frame are dependent you have to set this deep parameter equal to false and I'm going to show you here below so here I copy the code we're going to make a copy but in this case we're going to modify that deep parameter and we're going to set it equal to false so here I write deep equal to false and this is going to create something called a shallow copy which means that all the changes that we make to the original data frame will be reflected to the copy so both are going to be like dependent so now I'm going to update a value and in this case let's update the height of Cristiano Ronaldo so I write df.lo lock and now I write Cristiano Ronaldo and then we have to write the name of the column we want to update so in this case it's going to be the height column so here I copy and paste it so now as we can see here Ronaldo is 187 and now we want to change this to let's say 200 so 2 m okay now before I set this new height first I'm going to make the copy so here I make this shallow copy and now here I'm going to actually I'm going to name it shallow_copy so now I run this again and now this copy was created and I'm using this df_shallow_copy variable and now I'm going to set this height of Cristiano Ronaldo equal to 200 so now I run this one and now let's see how the DF data frame and the DF_Shallow_copy dataf frame look so here let's print them so first DF and now we can see that in the original data frame here the height of Cristiano Ronaldo is now 200 and this makes sense because here we set the height to 200 but now if we print the shallow copy here we're going to see that the height of Cristiano Ronaldo in the shallow copy is equal to 200 so this means that the changes we made in the original data frame DF were also reflected in this shallow copy so this DF and DF_Shallow_copy are dependent and this happened because we set the deep parameter equal to false okay now I'm going to show you another way to make a copy and in this case it's going to be with a simple assignment so here this works like this we want a copy named DF new_copy and we set this equal to DF so this is the simplest way to make a copy with a simple assignment and now we run this and now let's update another height but in this case of Neymar Junior so we write TF.log now the name Neymar Jr and now the height column here I quickly copy and now I paste it here so now let's see the height of Neymar Jr is 175 so let's update this to 190 so we update the original data frame and now let's compare the copy and the original data frame so here I print the DF and now we can see that the height of Neymar Jr is 190 and yeah that's correct but now if we print the copy df new_copy we're going to see that here the copy was also updated so the new value is 190 and this means that this copy is a shallow copy because all the changes we made in the original data frame were reflected in this copy and this is the equivalent of making a copy using the copy method with deep equal to false so just keep that in mind when you do something like this and that's it in this video we learn how to make a copy of a data frame using the copy method welcome back in this video we're going to see different ways to make pivot tables if you're an Excel user probably you made many pivot tables in the past in pandas we can also make pivot tables and in this case we use two different methods the pivot method and the pivot table method in this video we're going to see the difference between the two of them so first let's see what's the pivot method so the pivot method reshapes data based on columns values and it doesn't support data aggregation so this means that this is not the regular pivot table you will see in Excel because you can only reshape data with the pivot method and you cannot do anything else to explain you better what the pivot method does I'm going to show you an example so here we have a little data frame and this one has six rows and four columns and as you can see here there are many duplicate values for example in the column fu the one value is repeated at least twice and the same goes for the two value also in the column bar you can see that the a b and c is duplicated so when we have this type of data frame we can reshape it to have a different view and to make a better analysis in this case we can use the pivot method as I'm going to show you right now you only have to write the name of the data frame followed by the pivot method and then specify three arguments so the first one is the index in this case I'm going to reshape this data frame choosing the column fu as an index this means that the column fu will be in the position where is right now the numbers from zero to five on the left next you have to define the column so these are the new columns that we're going to see in our new data frame the one that we're going to reshape so in this case I'm selecting the data inside the bar column as new columns this means that A B and C will be the new columns in our new data frame and finally we have to choose the values we wish to show in this new data frame so in this case I'm choosing the best column so all the values inside there will be shown in our new data frame so this is the column that I'm selecting and now I'm going to show you the result of this pivot method so here it is and as you can see here we have the fu in the index as I told you before and the a b and c that are data from the bar column now are columns in this new data frame also all the data inside this B column is the only data that is displayed in this reshaped data frame and now let's see why sorted this way so why one is here two is here three is here and so on so here the value is defined by the index or row and the column so between one index one and column A is one and why that happens because if we go to the our previous data frame or the original data frame that is here we can find that here is one a and the value that corresponds to that pair is the number one so let's pick another one for example five here we have two and b and if we go here to our original data frame we have that two and b the value that corresponds that pair is five so that's why this value is here and that's how this new data frame was reshaped okay and finally we have the pivot table method and this one creates a spreadsheet style pivot table so this is similar to the pivot table that we will find in Microsoft Excel for example and this one supports data aggregation and to explain you more about the pivot table method as well as the pivot method we're going to see some examples in the next video and this time we're going to write some code so you can understand much better what we're doing all right now it's time to see how the pivot method works in action in pandas so first as usual we import pandas as PD so here I import this library and then we're going to use a different data set to work with this pivot method so to read this data set we use the PD read CSV method and inside parenthesis we write the name of this data set so in this case is GDP CSV that you can find in the notes of this video so this is the new data set and now let's have a look i'm going to run this one and as you can see here we have data about GDP per capita that is in this column and basically this is how the GDP grow over the years for each country so here I'm going to tell you which are the columns we're going to use for this example so first we're going to use the country columns that contains data about different countries then we're going to use that year column that well contains different years and the GDP per capita that it's in this column so basically what we want to do in this exercise is to obtain a different view of our original data set so this data set that we're reading here with pandas has this uh view but we want to get a different view to have a better analysis so the goal of this exercise is to see the evolution of the GDP per capita over the years for each country and then we're going to put the country names in the columns so the only data we're going to show in our new data frame is going to be the GDP per capita that it's here so I'm going to show you now this with code and let's write it here but first let's assign a a variable to this data frame so here I'm going to write df GDP so this is the name of my data frame and now I'm going to show it and it's here so now I'm going to copy this data frame and to use the pivot method I'm going to paste this one and now write pivot now we open parentheses and now as you might remember from the previous video we have to introduce three different arguments and if you don't remember the three different arguments we have to introduce here you can only press the shift and tap keys on your keyboard and you will get this and here you can see the three arguments I'm talking about so first we have to write the index argument so we write index and as I told you before I want the year column to be the index of my new reshaped data frame so I'm going to set this year as the index of my new data frame so I write here year next we write comma and press shift and tap to show this so the second argument is the columns so we write columns then equal and open quotes so here as I told you before I want the countries here listed in the country column i want each country to be an independent column so for example here uh let's say we have the United States so I want the United States to be column number one then column number two China then Australia then Spain and so on so each country should have one independent column so that's what we want and to get that we have to set the country column here to the columns argument so here country and that's it now again shift plus tab to show this window here and now the third argument is values so here I'm going to write values equal to open quotes and here the only data I want to show here in my new data frame is going to be the GDP per capita which is the one that is here and now I'm going to copy this one and paste it here so remember our goal our goal is to see the evolution of the GDP per capita over the years for all the countries listed here in this column so here we're going to execute this code and let's see the result so here controll enter and as you can see here I have the new view of this data frame and it looks much better it's more readable because we can see the GDP evolution over the years for each country so now let's verify if everything is correct so here we have the index year and here we have the year as index so everything is fine then the columns should be country and now we have each country in the columns so it is correct next the values are the GDP per capita and yeah we have here the intersection between the row and a column is a value that corresponds to the GDP per capita of that country in that year so everything is working fine and there you have it this is how the pivot method works in pandas okay now let's see how the pivot table method works in pandas so in this case we're going to work with a different data set and to read it we're going to use the method PD read Excel because in this case the data set is not a CSV file but an Excel file so we use read Excel for an Excel file so in this case the name of this data set is superarket sales.xlsx xlsx and this is what we're going to see after you run this and here you can see that we have different columns about uh what a specific person bought in a supermarket and here well we have the branch the city the gender and different data so here to make a pivot table we're going to first name this data frame and I'm going to name it DF sales and now I'm going to show it here and okay now it's here okay the goal of this task is to see how much female and male spend their money in this supermarket so to do that we're going to use the pivot table method in pandas so first I'm going to copy this data frame and now I'm going to paste it here and now we're going to make a pivot table and add an aggregate function because remember that the pivot table method allows us to add an aggregate function and the pivot method doesn't support that so we're going to use the pivot table this time and now we're going to introduce some important arguments so the first one is the index and in this case if we want to see how much male and female spend in this supermarket the index is going to be the gender so here I'm going to copy gender here and it's going to be here index equal to gender so this is the first necessary argument and the second one is going to be the aggregate function so we have to write a double g f u and c and then equal to and then write the aggregate function we want to perform so in this case it's going to be a sum so we write sum and now everything is ready so what we're supposed to get here is the information about the sales here in this data frame but now divided by gender so we have the female category and then the male category so let's verify this i'm going to run this one and as you can see here we have this summary table or pivot table and now it's divided by gender so we can see how much female is spent here in the total column and also how much male spent also in the total column and here in the quantity column we can see how many products they bought how many products female and male bought in this supermarket and one detail you might have noticed is that only the columns that contain numerical data are displayed here so for example here branch and city that contain uh only text aren't here in this pivot table because here in the argate function argument we indicated that we want to sum and when we sum values we cannot sum uh text but only numerical data so only the columns that have numerical data are displayed in this new pivot table okay that's our first pivot table and we can do even more for example we can select a pair of columns that we're interested in so let's say we only care about the quantity and the total column so we want only those columns so we can get that i'm going to copy this one and to show you how to get only those two columns I'm going to add a new argument so here I'm going to write and in this case the name of the argument is values so I write values equal to and in this case I'm going to select the quantity and the total columns so I open square brackets because I'm going to select two or more columns and inside I write the name of the columns so first quantity I write here and then total so here two so we're going to get the same uh pivot table but in this case only the quantity and the total columns are going to be shown in this table so I'm going to execute this one and here I got an error because I didn't include this comma so I'm going to add it here and now everything should be fine and yeah we got the same pivot table but only the quantity and total columns are displayed here and here we can clearly see that female spend more than male in this supermarket but we can get even more detail here so far we know that female spend 167,000 in this supermarket but with pivot tables we can even know in which product lines this money is spent so let me show you here we can see how the money is spent in this product line column so we only have to add a new argument to this pivot table method so I'm going to show you here first we copy this and now I'm going to paste it here and we're going to make a pivot table that says how much male and female spent in each category or well product line so we add a new argument and this one is going to be the columns argument so I write columns then open quotes i add the comma and here I write the name of this column that is product line so I scroll up I copy this column and then we're going to see in which category is spend the money so health and beauty or sports and so on so now I scroll down and here I paste it and before I run this code here we only want to display the total because we only want to see where the money goes not the quantity so only total so I delete the square brackets too and with total we're going to see where the money goes divided by gender so here I run because it's ready and now as you can see here we can see how much female and male spend in each product line so we can quickly see for example that female spend uh more money on fashion accessories than male and that kind of makes sense and also in sports women spend or female spend more money than male so we could easily see all of that by using the pivot underscore table method in pandas and this is similar to the pivot table you will find in Excel and that's it that's how you make a pivot table in pandas all right before showing you how to make visualizations with pandas first we have to check the data set and also we have to make a pivot table so we can easily make the plots with pandas later so first we have to import pandas to read this CSV file and well I have this import pandas as pd so we just uh run this code and now let's read this new data set so as you might remember to read a CSV file we have to use the read underscore CSV method so we write PD readers CSV and then we write the name of the CSV file so in this case the name is population and I'm going to use this population total CSV so I pressed tap to get this uh the name so we have now the name and now I'm going to assign this to a new variable so the variable is going to be uh df population row so this uh row data and now we're going to have a first look at this data set so I paste this and now I'm going to run this two and now we have this data frame so here as you can see we have the population of many countries throughout the years so for example we have China here United States and India so we have the population and here I wrote the name row because this data set was extracted using some web scraping techniques and then it wasn't modified so now we have to make some changes to reshape this data frame so we make it easy for us to make visualizations with pandas later so what we have to do here is to make a pivot table to reshape this data frame and that's what we're going to do here below so we're going to make a pivot table and we're going to use the pivot method so as you might remember the pivot method returns a reshaped data frame organized by given index column values but it's a pivot without aggregation so this is what we want so we only want to reshape this data frame so we're going to start by dropping null values so we do that by writing uh the name of the data frame and now I'm going to just copy the name and I paste it here and now to drop null values we have to use the drop NA method so I write drop NA and then we have to run this and as you can see here we have the result and it's a copy from this uh data frame but if we want to save the changes that we make to the data frame we have two options the first option is to use the in place argument so I write in place and then set this to true so if we do this and we run all the changes that we make to the data frame are going to be saved and the second option is to do something like this to override the content inside this data frame so we do something like this we write df population row is equal to the same data frame but that drop n a so we're overwriting the content inside this data frame so I'm going to choose the first option just to reduce some code so I write in place equal to true and now I run and this new data frame shouldn't have any new values okay now it's time to make this pivot table so first I'm going to show you what I'm going to do so we have a better idea before writing the code so here we have the original data frame and what we're going to do is to reshape this data frame so I want the year to be in the index so the year column I want it to be here in the index instead of 0 one and so on and then I want the country uh column or the country the values inside the country column I want it to be here in the columns so for example I want China here in one column then United States in another column and then India in another column and I want the population data inside the data i wanted this to be the only data here so to do that we have to use the pivot method and that's what we're going to do here below so let's do it here so first we have to write the name of the data frame which is this one and then write pivot then we open parentheses and here let's see the arguments that this pivot method accepts so I press shift and tab to get this helpful let's call it cheat sheet and now we have the arguments that this pivot method accepts so first is the index then the column and then the values so as I told you before the index I want it to be the year column so we have to write index equal to open quotes and I write year then comma and let's check another argument so the next argument is the columns so I want the columns to be the country so the data inside the country columns so here I write columns then I open quotes and here I write country so country and now the last one I think is values and yeah it's values so I want the values to be the population data so let me see if that's correct and yeah it's here so population and I'm going to press enter here so it looks much better and now population is here so I have the three arguments the index the columns and the values and now I'm going to reshape my original data frame so here I press Ctrl enter now as you can see here we have the countries in the columns so here we have many countries uh it's from the first country Afghanistan to and Pandora Argentina Uruguay and many other countries so we have also the year so it's here the year uh from 1955 to 2020 so we can see here the evolution of the population throughout the years for all the countries in this data set but as you can see there are many countries so what we can do here is to select just some countries so we can simplify our visualizations later in pandas so here I'm going to select some columns but first I'm going to name this uh new data frame i'm going to give it a name so I'm going to name it DF pivot so this is my new data frame now I'm going to rearrange this and now it looks much better so now I'm going to run this and now let's select some countries so I copy this pivot data frame and now we open square brackets double square brackets to select two or more columns and here let's write some countries so first United States then let's say uh India then China uh two more countries Indonesia and last but not least Brazil so here we have the five countries so I run here and we have these five countries and the population from 1955 to 2020 so great now we simplify this data frame and now I'm going to overwrite the content inside the data frame df pivot and I'm going to write here DF pivot equal to DF pivot and with this selection so I'm overwriting the content so I press Ctrl enter and our new DF pivot is here so we have it here and now I'm going to show it to you and this is our new DF_Pot data frame and that's it now our data is ready so we can use it to make gray visualizations with pandas and that's what we're going to do in the next video okay now it's time to make some visualizations with pandas and here I have the data frame that we created this is the pivot table we created in the previous video and as you can see here we have five countries in the columns and here we have the year in the index from 1955 to 2020 so what we're going to do now is to make our first visualization so I scroll down here and the first one is going to be line plots so here first to make this visualization I'm going to copy the name of the data frame and I paste it here so now to make plots with pandas we have to use the plot method so we write plot and now I open parenthesis and one necessary argument we need to introduce is the kind argument so I write kind now equal to and here I have to write the kind of plot we want to make so in this case is a line plot so we write line and this is actually the mandatory argument we have to introduce here and now we can run this code so I press control enter and as you can see here I have the line plot so in this line plot we can quickly see the evolution of the population throughout the years for example China and India which are green and orange lines they had some uh fast growing population while United States Indonesia and Brazil uh they have a lower population and also the population didn't change so much in the past 50 years here we can add more arguments to this plot method to customize this line plot so here we can introduce another argument which is the x label and this x label is what you can see here here uh when we created this line plot uh by default it was assigned this year label but we can change it so for example let's say we have we want to write year but now with capital letter so we write year here and now let's say we want to add a new label here in the yaxis so here we have only to write y label and then equal to open quotes and here we have to write the name we want so in this case I'm going to write only population and finally we can also add a title so we can add any title we want in this case I'm going to write uh well the name of the argument first title then equal to and then the name of the title it's going to be let's say population uh from 1955 and to 2020 so this is the title so let's run this and now as you can see here we got the title population 1955 to 2020 and the X label and Y label were modified too finally we can add one more argument in this case the argument is the size of the figure so to change the size of this figure we can add the argument name fix size and this is a tupole so we have to open parenthesis and now to edit the size we have to add two arguments the first is the size of the xaxis and the second the size of the yaxis so in this case I'm going to set it to a uh and then four which means that the x axis is going to be uh large while the yaxis is going to be short so here I'm going to run this code and let's check it out so here the figure has a different size and that's how you can customize this line plot okay now let's make a bar plot with pandas so the first thing we have to do is to select only one year so the bar plot only accepts just one year and we can plot there the population of different countries so let's select one year of this data frame we have before so I'm going to copy the name of the data frame so you can check it out again so this is the data frame and we're going to select one year so to do that we have to use the index attribute and then the is in method so first I'm going to show you the index method in case you don't remember so here that index sorry again that index attribute allows us to see all the index in this data frame so we have here from 1955 to 2020 so that's what the index attribute does and now if we use the is in method we can filter out some index so here let's say we want to select only that 2020 so I copy 2020 and now here I write equal to and first I'm going to make the selection so it's here and now I'm going to show you what's the result so here I press Ctrl enter and the result is this little data frame that only contains the population in the year 2020 so this is important because the bar plot is supposed to show only the population in this year so here we have it and now what we're going to do is to name this data frame so here we write equal to and then let's give it a name so I'm going to name it DF pivot_2020 so here I press Ctrl enter and now I'm going to show this new data frame well again here and here one little detail I have to tell you is that when we make bar plots we have to put text data in the index so here the name of the countries should be in the index so to do that we have to use the transpose method so this transpose allows us to switch rows and columns and vice versa so here we can easily do that by writing the data frame the name of the data frame and now t so if now we run this code we can see here that we have this so now the year 2020 is in the column and not in the index anymore and the country names are in the index here so this is the format we need to have before making the bar plot so now I'm going to overwrite the content in this data frame so I write df_pivot_2020 equal to this same data frame but that t so here I run this and now it's time to make the bar plot so here I copy the name of the bar plot and now I use the plot method so I write plot again open parentheses and the first argument is the kind so I open quotes and we write bar so now it's ready and we can run it so as you can see we have a basic bar plot and it has some default values like uh the name of this x label and also the default color is blue and we can customize this bar plot a bit more for example I want a different color so I write the color argument and then open these quotes and let's say I want it to be orange so I write orange and also we can change the X and Y label actually I can copy this here so I can save some time here so X label and Y label are here and let's paste it here so X and Y label and finally I can add also the title which was here so I copy and paste it but in this case the title is a bit different because in this case it's not from 1955 to 2020 but it's only 2020 so here I have only 2020 and now let's run this to see the results so you can see here we have a title the X and Y label and the bar plot is in orange so that's how you customize the bar plot all right so far so good now let's go one step further by making bar plots grouped by n variables so here we have to select a group of years to make these bar plots grouped by n variables so I'm going to copy this code we used before to select only the year 2020 i'm going to copy this and in this case I'm not going to select only one year but a group of years so let me show you here instead of choosing only 2020 I'm going to show you the pivot table again so you can easily understand so instead of choosing only 2020 I'm going to choose some other years here so I'm going to delete this and I'm going to write it here so let's say 1980 1990 then [Music] tw00 and 2010 and well finally 2020 so we have a group of years here and we're selecting this using the index and is in method so here I'm going to give it a different name in this case since it's a sample I'm going to write df_pivot sample now I'm going to first I'm going to show you this one so you can see what this looks like so now we have five countries and now five years so now I'm going to assign this to my data frame so df_pivot sample i run this and now we have this new data frame so it's time to make this grouped bar plot so here we write the name of the data frame and then the plot method so we write that plot and now let's add the first argument which is kind and equal to bar now we run this and as you can see here we have the the plots or the bar plots grouped by year so here's 1980 and 1990 and so on and you can also add the same arguments we added here so for example I can add the X and Y label so I can do it here i'm going to do it fast so here I run and as you can see here we have the we modify the X and Y label and that's it that's how you make bar plots with pandas okay in this video we're going to learn one of the most common charts that we can make in pandas and actually in any other visualization tool and these are pie charts so before we make this pie chart first let's give a look to the the data frame we're going to use and in this case to make a pie chart we're going to use the same data frame we used for making the bar plot because it follows the same logic so here I'm going to copy the data frame we created for the bar plot which is this one df pivot_2020 so this is what we created before by using the index attribute and the is in method so here I'm going to copy this and now I'm going to show you here so so you can remember what's inside this data frame and it's here so here as you can see uh we have the column 2020 and the countries are in the index so everything is fine that's what we need that's the format we need for making the pie chart but there is one little thing we have to modify and this is the column name because now it's 2020 and this is a number it's actually I think it's an integer so it's not a good practice to have numbers in columns so what we have to do is to make this a string and to do that we use the rename method so we write rename open parenthesis and now we use the columns argument so we write columns then open this curly braces and now we write the name of the column we want to change which is 2020 and we're going to make this integer uh value into a string so we open quotes and write 2020 so apparently they are the same but the green one is a integer and the red one is a string so now to make to save these changes I'm going to write in place equal to true and I'm going to run this so now we can make the plot here i'm going to write the name of the data frame and now I'm going to use the plot method so here I write that plot so the first argument is kind and here I write pi so the kind is pi and now I run this and here I forgot to include that y argument and I'm going to write it here so the y argument is supposed to have the data so in this case I'm going to show you here again the data frame so the data is here in 2020 so we should write here 2020 so I'm going to delete this and here in the Y argument you write the column that has the data so that's what we did so now I run this and now we finally have our pie chart so here is the pie chart so that's how you make a pie chart if you want you can even add another argument like the title for example here I can say that this is a population in 2020 but in this case in percentages so write this and now we have this title so that's how you make a pie chart in pandas okay in this video we're going to learn how to make box plots with pandas a box plot is a graph that shows us how the data is to spread out and the box plot shows five important numbers first it shows the minimum score of the data then the first quartile also known as Q1 then the median then the third quartile also known as the Q3 and finally the maximum value of the data okay okay now let's make this plot in pandas and we're going to use the data frame that we created before that is named df pivot so here I have it and I run it so you can remember what this data frame looks like okay let's start by making a single box plot and then we're going to make multiple box plots so here since we only want to make a single box plot I'm going to pick one counter here so here I write df_pivot open square brackets then quotes and then write the name of the country so in this case United States so now to plot this box plot we only have to use the plot method so here we write plot open parenthesis and as usual we have to write the kind argument so here kind equal to and in this case it's going to be equal to box and that's all we need so now I run this and as you can see here I have the box plot and here we can add more arguments to customize the box plot for example the color argument I'm going to set it to green and also I'm going to add a y label and this one is going to be equal to population so I run this and now we have this label in the yaxis and the box plot is green and as I mentioned before this box plot contains five important values so the one we have here is the minimum value and this one is approximately 1.7 okay then the one I'm showing right now is the Q1 also known as first quartile it should be here 2.2 then this one here in the middle is the median and this one should be 2.7 and the one here is the third quartile so Q3 and this one is the maximum value in the data all right now let's make multiple box plots so here I'm going to write DF pivot and in this case I'm not going to select a specific country because we're going to make multiple box plots so we're going to get one box plot per country so here I just write df pivot and then plot then parenthesis and now the first argument is kind and here I'm going to set it to box so now I can run this one and as you can see here I got five box plots and each of them represent one country here if we want we can add also the color argument equal to green and also the y label argument in this case I'm going to set it to population so as you can see here the box plots are in green and now we have the population label and that's it in this video we learn how to make box plots with pandas all right in this video we're going to make a histogram with pandas but first let's see what's a histogram a histogram is a graph that organizes a group of data points into ranges in the graph these ranges are represented by vertical bars and they show the frequency distribution of the data all right now let's make this histogram and here I'm going to use the df pivot data frame so here is this data frame and here we can choose any country to make its histogram so I'm going to choose here Indonesia and I open square brackets and write the name here so now I write plot to make the histogram and here I add the kind argument so I set this equal to hist and then I can run this to make the histogram so I run and here I got the histogram of the country Indonesia but that's not all here we can make multiple histograms in this plot so here I add square brackets and now we add one more country so here I'm going to write United States and now I'm going to run this here I wrote incorrectly so I modify and now I run again and here we got a histogram of Indonesia and United States and that's it that's how you make a histogram using pandas in this video I'm going to show you how to make a scatter plot in pandas a scatter plots are used to plot data points on a horizontal and a vertical axis in attempt to show how much one variable is affected by another variable in this example we will plot years versus population and we will see how much the population is affected by the year so to do that I'm going to use the data frame that we had at the beginning so this data frame that we got after reading this CSV file and it has this format with country year and population in separate columns and this format is going to help us make scatter plots easily because each row will represent a dot in the scatter plot so here now I'm going to paste the value of this data frame and now I'm going to make some little changes so the first thing I'm going to do is to select some countries from this data frame so to do that I select the countries column so I write country inside square brackets and now I use the is in method so here is in parenthesis and now I select some countries so I open the square brackets and now I'm going to choose some countries so United States and also India then China Indonesia and finally Brazil okay I have these five countries and now I'm going to set this to a new data frame that I'm going to name DF sample so now I run this and here I forgot to make the selection so I'm going to select this data frame so I open a square brackets and write the same name of that df population row and now we created this df sample data frame so let's see how it looks so here I print it and now we have this data frame but now we only have five countries and we reduce the number of countries because we don't want that the scatter plot is overpop populated as I told you before the root represent one dot in the scatter plot and if we have many dots we won't be able to see the dots clearly but with five countries is fine now that they have this data frame I'm going to plot this with dot plot and now add the kind parameter so kind equal to and here scatter now the two other parameters we have to add are the x and y parameter so here I write x and then I write y so here x and y and this represents the values that are going to be in the x and the y axis so first in the x-axis I want to show the year so I write year now we have this one here and now in the yaxis I want to show the population so I write population and that's it now we can plot this scatter plot so I run this and here we get this scatter plot and each dot represents a row in the data frame df sample that we have here here we see that the dots follow a pattern and this is not so good because scatter plots are much better when the dots don't follow a pattern but I'm just making this scatter plot so you know how to make the scatter plot with the plot method so now we can control the size of the dots so here I could write s the s parameter and here let's write 80 so we make it bigger and now we see that the dots are bigger also we can control the color of this data frame so I write here color and now let's set it to green and now let's see the result so now we see that all the dots are in green and this isn't so good because it's hard to recognize what category each dot belongs to so for example we don't know if this dot belongs to the country India or United States or Brazil we only know that the year is between 1980 and 1990 but we don't know exactly the year and also the population is between 1 million and 1.2 2 million but we don't know exactly how much the population is for this dot i could show you how to set different colors for each dot based on the category but that will involve importing math lip and also writing a for loop and more lines of code fortunately there is an easy approach that help us set different colors for the dots based on the category and also see the values behind each dot and that's by making interactive visualizations and I'm going to show you how to do that in the next video all right so far we made a pivot table and many plots using pandas and in this video we're going to learn how to export the pivot table and also the plots we made with pandas so let's start by exporting the plots we made with pandas and to do that first we have to import mattplot lib so we write import mattplot lip.pipplot piplot and then we write as plt so this plt represents this mattplot lib.pipplot so now we run and we imported matt lip and now we can use this plt to save the plot so we write plts fig and now we open parenthesis and here we have to write the name of the file we want to export and here I'm going to write my test.png so this is the extension and this is the name of the file and now before exporting this file I'm going to show you something here so probably you notice that when we make the plots with pandas we get these words here that says access subplot and all of this so we can get rid of these words by using the show method so we write plt show with parenthesis and if we run this we're going to export this figure and also we're going to get rid of these words so let's try so I run this and as you can see here all those words disappeared and also we exported the figure to a PNG file and now this file should be located in the same folder where you have this Jupiter notebook file okay i'm going to open that file but first I'm going to export the pivot table so here I copy this DF pivot and I paste it here and now to export it we have to use that to Excel method so we write to Excel and now I open parenthesis and here we write the name of the file where we're going to export this pivot table so in this case I'm going to name it pivot table.xlsx so this is the extension of excel and this is the name of this file so now I run this and now the pivot table should be exported all right now I'm going to open the Excel file and the PNG file we created so it's here and here we have the plot we exported and also the pivot table so as you can see here the plot looks exactly the same as the one we created here with pandas and the pivot table is the same so I'm going to show you how the pivot table looks and here is the pivot table and here is the pivot table we exported i open it in Google Sheets and looks exactly the same and that's it in this video you learn how to export data frames as well as plots okay in the previous videos we learned how to make visualizations such as line plots bar plots pie charts box plots and more with pandas but all of them were a static visualizations this is good but sometimes you want to interact with your visualizations to know the data behind a line plot for example or to know the values behind this bar plot this is not possible with the traditional visualizations that we have in pandas but there is a library named plotly that help us make interactive visualizations in Python fortunately this library works good with pandas and in this video I'm going to show you how to make interactive visualizations with pandas using plotly under the hood okay to easily create interactive visualizations we need to install cufflinks this is a library that connects pandas and plotly so we can create visualizations directly from pandas in the past you had to learn workarounds to make them work together but now it's simpler and you don't even need to learn the syntax to make visualizations with plotly but you can use the same syntax we learned so far to make visualizations with pandas with some little changes okay first we have to install the plotly library so to do that here we write pip install and then plot remember that you need to add this exclamation mark to run commands from Jupyter notebook so this is a command and we're running this to install Plotly so I press Ctrl enter to run this and now let's wait until plotly is installed so just a couple seconds okay I got the message that plotly was successfully installed and now that I have plotly installed I'm going to install another library which is cufflinks and as I told you before this library connects pandas and plotly so let's install this library so I write this exclamation mark and then pip install and then we write cuff links so cuf and links now I run this and here we have to wait a couple seconds okay I got the message that it was successfully installed and now we can start making interactive visualizations with pandas so here I'm going to delete this cell and let's import the libraries that we need to make interactive visualizations so first we import pandas so import pandas as pd then we import cuff links so import calf links as cf then we have to import display and html from ipython.d display and this will help us display the interactive visualizations in Jupyter notebook so here I write from ipython display and then import display comma html and finally I'm going to edit the default configurations of these cuff links by writing cf set config file and inside parenthesis I'm going to change some parameters first the sharing parameter I'm going to set it to public then the theme I'm going to set it to ggplot and finally the last parameter is offline and I'm going to set it to true by the way here I'm using the ggplot theme but you can choose any theme you want you only have to write here the command cf.get themes with parenthesis and here you will see all the themes here I didn't get it because I didn't import cuff links as cf that I'm going to import here so I import this might take some seconds because it the first time and now I imported these libraries and now I run this and here we get all the themes so here is ggplot this is the theme I going to use for this video but you can choose anything you want you only have to set it and then test it out and see if you like it okay now before we make interactive visualizations I'm going to show you the syntax which is similar to the static visualizations but with some little changes so let's say we have a data frame named df so if we want to make an interactive visualization we have to write the I plot method which is that I plot and then parenthesis so this one looks similar to the plots we make before for static visualizations but here is not plot but I plot so the I is for interactive and well plot for the plotting that we're making and inside you have to add the parameters the kind parameter which is similar to the static visualization and overall the behavior of this I plot is similar to the plot method but it will have some changes in the parameters or in the way it behaves so let's review each visualization and let's see how the I plot method works okay to make these visualizations we're going to use the same data set we use for static visualizations and I have here the code so I don't have to write it again and this is the name of the data set or the CSV file population total and here we have the same data frame so I'm going to run and I'm going to show you the data frame so you remember how this one looks so here we have the country year and population and we're going to make visualizations but this time we're going to make interactive visualizations so we can see the data behind the plots okay now I'm going to copy and paste all the code we used in the previous videos to make the pivot table so we focus only on making interactive visualizations okay here I pasted the code and these are all the operations we made to get this df_pivot data frame so we read the CSV file then we drop some null values make a pivot table with a pivot method and then select some countries so here I'm going to show this uh df pivot so I run this one and here we have this data frame that will help us easily make the interactive visualizations okay now before we start I'm going to delete these two lines of code or two cells and here I'm going to start with the line plot so I delete this one too and here we start with the line plot but this time it's going to be an interactive line plot so we will be able to see the data behind the lines so here let's start by writing the name of the data frame which is this one so df pivot and then we write the name of the method so that I plot so after this we write kind and then we write the type of plot we want to make so here is line and now this is everything we need to make this basic line plot so let's have a look i'm going to run this one and here we got this plot which is a bit big but anyway you can see clearly and the cool thing about this interactive visualization is that if we point and drag this plot we can see the data year by year so here you see the population in China in 1870 here in this blue line you can see much better the numbers and here we see the population in India in 1875 so here we can see how it grew over the years and here we see the population in 2019 so this is cool because you not only see the plot but you can see the data behind and you get more detail about this line plot okay now let's say we want to add some labels here the X label and the Y label so we can do this also in interactive visualization so we can add some parameters to the I plot method so first let's add the X label but in this case the X label that we had here in the line plot here this is the static visualization we made previously here the parameter was called X label but now in the interactive visualization this one is called X title so keep that in mind so I write X title in here i'm going to name it uh years and the other is going to be population so here I write years and then y title in this case is going to be population so now I'm going to add a title and here the title is just named title and this one is going to be named population and I want to add the years so 1955 to 2020 so here it is and I run this and now we see the title and also we see the years in the xaxis and the population in the yaxis and that's how you make a interactive line plot with pandas and also with plotly all right before we move on to the next visualization I want to show you this bar that comes with plotly and here we have different options for example we can download this plot as PNG with this camera icon or we can make zoom so I click here and let's see what happens so now if I make this so I made a selection and you see that we have more details in these years for these two countries so here let's make another selection and we zoom in again so here to zoom out let me see how to do it and I think I can do this with this zoom out button so here I zoom out and we get this line plot but now uh it's a bit smaller and here again zoom out and we can see all the line plot but now we have more years so here we have 1940 and let's zoom in now and you can see now that it fits much better but now it's too close but you can control this much better with this zoom in and zoom out options but also you can autoscale with this option here so you get the line plot as it was before and finally we got this reset axis and let's see what this does and I think this one fixes the axis position or the tick position but anyway let's check another one so the pan option help us move in this plot so we can see other parts of the plot as you can see here and finally we have this produce with plotly if you click on it this is going to send you to the plotly website so let's try it out here we see yeah the plotly website and anyway those are all the options in this plotly bar and now let's make our next plot so here I'm going to make a bar plot so I write bar plot and now let's make this bar plot and as you might remember we have to make some changes to the original data frame to make this bar plot and to avoid repetition I'm going to copy and paste those lines of code so we can focus only on making this interactive bar plot okay now we have the code that help us make this data frame df_pivot_2020 that we used to make the bar plot and now I'm going to run this code and here the first line of code help us select only the year 2020 and the second one transposes the data frame which means that we switch the rows and the columns so here I'm going to show you the data frame so I print it and here this is the data frame so with this data we're going to make the bar plot and in this case it's going to be an interactive bar plot so first we write the I plot method so I plot and inside we write first the kind parameter this one is equal to bar and then we're going to give it a color so I'm going to set it to light green then I change the name of the labels the X and the Y label so I write X title equal to years and then also the Y title remember that in static visualizations with pandas this is called X label but in interactive this is called X title so here I write Y title and in this case it's going to be the population so with this I'm going to run the code and now let's see the visualization that we got so this is the bar plot but now if we select one bar we get the value that it has so for example here in the traditional or static visualization we only get this so we don't know how much population India has in 2020 we can say that the population in India in 2020 is approximately close to 1.4 billions but we don't know exactly how much is the population but if now we select the India bar in this interactive visualization we get that the population is 1.38 billion and we can see also the other countries so for example United States 331 and China 1.43 billion and so on in this case it doesn't make so much sense to use the zoom in or zoom out so now let's move on to the next visualization which is making bar plots grouped by two or more variables okay here I pasted this code that selects some years so the years between 1980 to 2020 and we're going to use these years to make multiple bar plots so here I run and now I'm going to show you the df_pivot sample so is this one so now we have multiple years and this is going to help us make multiple bar plots so now I write the I plot method so here I write I plot then the first parameter is kind and I set it to bar and let's see how this one looks so here I have this multiple bar plots or bar plots grouped by five variables here we can see in detail the population of each country in the year we want so let's pick this one so it's India in 2020 and it's 1.38 let's pick one more and for example this little one and it's United States in 1990 the population was 252 million okay so far we made interactive bar plots and also interactive line plots and you can argue that maybe it's not so useful to make interactive visualizations in bar plots because the detail is not so much i mean you only get one value but maybe in line plots it's more useful because you get more values behind one line and the next visualization we're going to see is more relevant when it's interactive okay the next visualization we're going to see is the pie chart and here I have this first line of code that help us change the name of the column 2020 so here I run this one now let's make this interactive pie chart and then I'm going to explain you why I think this pie chart is more relevant when it's interactive that one is static so first I copy this data frame and now I'm going to use the I plot method again here I added the kind parameter and now I set this equal to pi then I'm going to set the values equal to this column that is 2020 so now I write 2020 and now this pie chart is ready so I'm going to run this one and here I got an error because I forgot to add the labels so I'm going to add the labels parameter so labels in here equal to and now I'm going to show the df_pivot_2020 data frame and here the labels is the country because here we have the names of the elements in this pie chart so here I write country and this is also our index so here country and now I run this and I forgot to mention that we have to set this country as a column so here unlike the plot method for making pie charts we need to set this country index to column so in the plot method we could make this pie chart with this country as index but in the I plot method we need to set this country index as column and to do that we have to use the reset underscore index method so here I'm going to do this so I write reset index parenthesis and here you see that the country index is not anymore an index but now is a column so now to save the changes I'm going to override the values in this uh data frame so now I run and let's see the results I'm going to run this I plot method so to make the pie chart so I run this and now we verify that this labels parameter only accepts columns and not indexes so keep that in mind this is a difference between the Iplot method and the plot method okay now we have the pie chart and we can see in detail the percentage that each element takes so for example India takes 37.9% and the US takes 9.1% so this is helpful so we can see in detail the percentages and here in the plot method it wasn't possible we could add a parameter that shows the percentages in each portion but that might require more lines of code but in this interactive visualization we got this with the same lines of code and to make this even much better we can play with the visualization so we can hide some elements for example I want to hide India so if I hide India we only see the United States China Indonesia and Brazil and also the percentages were calculated again so all of them sum 100% because we're not considering India as an element so let's show India again and now let's hide China and let's say also India and now we see in detail the proportion between these three countries without considering the countries with the largest population and you can play as you want with this interactive pie chart and this is something you cannot do with a static pie chart so this is one of the advantages that the interactive pie chart has over the static pie chart okay if you like this interactive pie chart I think you're going to like even more the next visualization okay the next interactive visualization is box plot and this interactive box plot is way better than the static box plot because we can see in detail the parameters that the box plot has for example the Q1 Q3 the minimum value the maximum value and the median all of that we cannot see on the static visualization but on an interactive box plot we can easily see by just pointing at those values okay now let me show you how to make a box plot so first we write the name of the data frame which is df pivot let me check it here so I don't make a mistake so yeah it's df_pivot but unlike the previous plots we're going to select only one country so we make only a single box plot so here I write df pivot and now I'm going to choose the United States country so here United States and now the D and here I'm going to use the I plot method so the kind parameter is going to be equal to box and here I'm going to plot it and let's see the result so here we have this box plot and also if we point at this box plot we automatically get all the values of this box plot so we can see that Q1 is 219 million the Q3 is 323 million and we also see the maximum and minimum value in this box plot we don't have outliers so we don't see the outliers but if we had outliers we will be able to see the values of those outliers too okay now to move on to making multiple box plots I'm going to customize this box plot uh much better so Here I write color and now let's change the color to green and also let's add one more parameter so also I'm going to add a title for this one so y title equal to and this is going to be population so now we run and now we get this box plot and we see that is green and also that the y label is population great now let's make multiple box plots and let's do it below so here I write tf pivot now I plot and here the kind parameter so again box now we write uh the y title to add a name for the yaxis so this one is going to be population and now let's run this and let's see the results so here we see the five box plots and each of them represent one country so here we have United States uh India and all the countries and we can see in detail all the values that correspond to these box plots and this is cool because in the static box plot you will have to calculate those values on your own following some equations as I'm going to show you in some videos but with interactive visualization you don't have to do that you only have to point at the box plot and you immediately get the values you want okay now let's make interactive histograms and to do that first I'm going to write the name of the data frame which is df pivot now I select one country so in this case United States again so here I have United States now I use the I plot method then I add parenthesis and the kind is going to be equal to hist for histograms now I add a label for the x-axis and this is x title then equal to population now I run this and we get this histogram and we can see the values inside each bin so the first one has three values as you can see there the second one has four the third one has four and the last one has seven values so here we can control also the number of beans we only have to add that beans parameter and now set it equal to a number we want so now let's say we want three beans so I write three now run and instead of having four beans as we had before now we have three beans so the first one has three the second one has eight and now the third one has seven so we rearrange these beans okay now we can add even one more histogram so we make multiple histograms in this one plot so here the only thing we have to do is to add one more element to this selection so I add one square brackets and here I add one more country so in this case let's add Indonesia so Indonesia here and with this we should get two histograms in this plot so let's run this and now you see here that we have two histograms and the cool thing about interactive visualizations is that we can play also with this plot so here I can hide Indonesia and also I can hide United States so we can see the histogram we want and this is something you could do also with the box plots here with the multiple box plots you could hide any value you want it so you only focus on the country you want and this is very cool because this is not possible with the static visualization finally we're going to see a plot that is more relevant when is interactive and I'm talking about the scatter plot this one is more relevant because in the static scatter plot you don't see the values behind all the dots that are in the scatter plot but in a interactive visualization you can see the values of all the dots that are in a scatter plot and now I'm going to show you so you understand much better this so first df pivot so this data frame and now I write the I plot method so that I plot and here the kind parameter as usual equal to scatter so double t and now I add the mode in this case is going to be markers mode so I run now I scroll down and now we have the scatter plot this one looks like a static scatter plot but now if we select any dot we will see in detail the value behind a dot so for example China in 1985 had 1.07 billion population and we can see all the values behind all the dots in this scatter plot again you can hide any country you want for example I only want to show Brazil so here I have the scatter plot of Brazil and this scatter plot of population doesn't make so much sense because this follows a pattern and usually scatter plot is useful when the data doesn't have a pattern but still I think you get the idea of the usefulness of interactive scatter plots and that's it in this video we learn how to make interactive visualizations with pandas we learned the advantages that interactive visualizations have over static visualizations and also we learned the differences and similarities between the Iplot and the plot method welcome to this section in this section we're going to learn how the group by and the aggregate functions work so we're going to group the data frame we have by some categories and then we're going to perform some operations like counting the elements calculating the mean calculating the sum and so on but first let's check the data frame we're going to use in these videos so we can understand much better the concepts we're going to see in the following videos so here I'm going to start by importing pandas so I write import pandas as pd and now I run this one and now we have pandas imported so now I'm going to read the CSV file that we're going to work with and in this case it's going to be a car sales CSV file and this one shows all the information about car sales and different columns that we're going to analyze so here I'm going to use the pd read CSV method and this one is going to help me uh read the CSV file so now I open quotes and now I write the name of this CSV file which is car sales okay now it's time to read this CSV file so I run this and now we have this data frame and we have different columns about this car sales and here I'm going to select some columns that we're going to use in the following videos and I'm going to explain you what type of information each column has while I'm selecting those columns so here I'm going to write the name of the data frame which is uh I don't have the name so I'm going to assign it now so in this case I'm going to set the name df_cars for this data frame and now I run so now I have the name of this data frame which is df_cars and now I'm going to select the columns I want by writing double square brackets and here I'm going to list the columns I want so first I going to uh here print the data frame and here we have the data frame and I'm going to select some columns and the first column I'm going to choose here is the manufacturer column so here I open quotes and now I paste this manufacturer and this column has information about the company that created this car for example Audi Volvo and so on okay now let's select the next column and here I'm going to choose the sales_in thousands column so this one so now I copy this and I paste it here so this column has the sales of the car in thousands and I suppose that's in dollars okay now let's select the following column and this one is vehicle type and is this column so now I copy and paste and this column has information about the type of this vehicle and there are two types of vehicle passenger and car and to know the types of values that is inside this vehicle type column we can use the value counts method and in case you forget about this method we only have to use the name of the data frame we write this and then value counts so now parenthesis and now I have to add the name of the column so here I copy and paste it and after this we run and we get two types so passenger and car so more vehicles are passenger type and a few of them are car type so now we know about this and now I delete this and here I have the column yeah it's here so next I'm going to choose the price in thousands column which is this one and this one has the price of each vehicle so I'm going to copy and paste it here so I add this new column and now I'm going to choose another column which is the engine size each vehicle has different types of engine size and we're going to see that when we group the engine size later and here I'm going to copy and paste it here i write comma then quotes and it's here and now the next column is the horsepower which is this one so I add it here and okay that's the horsepower and now the last column is fuel capacity which is this one so that's the last column we're going to choose and with that we're done so this is the last column now I'm going to uh arrange this one here so I press enter so you can see much better all the columns and we have seven columns in total and now I'm going to override the values of this data frame so I write df uh_cars equal to this so now it's there and we can even make this much better so you can see and now it's ready so now we have this df_cars data frame and now I run and let's see how this data frame looks so here I have the data frame and we only have seven columns now and always it's a good idea to select the columns you're going to work with because sometimes the data frame has too many columns and if you keep all the columns it's hard to focus on the goal of your project so it's a good idea to select all the columns you want before you start working as I did and now before we finish this video I'm going to show you the unique values inside all the columns we have and to do that I'm going to use a method we learned in this course which is name and unique so I write n unique now parenthesis and this is going to help me see all the unique values inside the columns so here I run and now we see that all the unique values and for example the vehicle type has only two unique values as we've seen before the passenger and the car type and the manufacturer has 30 unique values so probably 30 companies and this is important because when we group values in a data frame we have to make some operations in those groups and if we have more unique values that means that we're going to have more categories so more groups and the more groups we have the more difficult to analyze it okay that's it for this video in this video we reviewed this data frame and we got familiar with some columns that this data frame has so now check this data frame on your own to help yourself understand much better the following videos all right in this video we're going to learn how the AG method works and also we're going to use some aggregate functions in this method so we want to start by calculating the sum of columns of this data frame that we had before so in case you didn't watch the previous video we created this DF_c cars data frame with cars information here car sales information so here we selected seven columns and now I'm going to use this data frame so now I'm going to calculate the sum of columns and to do that I only write the AG method so I write a GG and then open parenthesis and inside we have to write the function we want to apply to this data frame so now I want to sum the values inside the column so I write uh sum and now I run this code and we see the result and here in the first column that is called manufacturer we got this weird result and this is because this manufacturer column has text data and since it's text data here we concatenated the strings so we didn't sum the values but concatenated the text and we didn't want to do this but we only wanted to sum the values but for some reason the AG method didn't completely understand what we want to do and it concatenated the strings but it doesn't matter let's check the next column and this column has uh the sales in thousands and here we got the sum of the values inside this column and that's fine so now the following column is the vehicle type and here this one has text data and this is why we concatenated the values here but anyway it's only text data that we can exclude later and here the next column uh we have the total sum and also in the other columns so now this is how you sum the values inside a column and now I'm going to calculate the mean of numeric columns so maybe we're going to get again the text data like we got here in the manufacturer column so let's see uh I'm going to write here the name of the data frame which is df_cars you know agg and this parenthesis so now we write the function we want to apply to this data frame in this case is the min function and here I run and now uh there is a difference between these two functions because here we got text data inside the manufacturer column and vehicle type but here we didn't get it because uh here the mean also includes division and probably division is not possible in text data so this is why here it was omitted so now we only got the columns with numeric data for example the sales in thousands and we see that the sales is greater than the price in thousands and then we have the average engine size which is three and then the horsepower and the average fuel capacity okay now let's say that you want to apply not only the sum function or the min function but both so you want to apply both functions to this data frame in that case we can add them to a list and then use the AG method so I'm going to do that here we're going to calculate the mean and the count values so here we're going to apply two functions so first I write df_cars then apply then parenthesis and now we open square brackets to indicate that we want to start a list and inside we write the two functions that we want to apply so in this case I write mean and the second is going to be count so here I run and we get two rows and we get the seven columns we selected before and here we see that in the first row that has count as index all the values had numeric data and in the second uh row that has mean as index we have some null data because here we try to calculate the mean of uh text data for example in the manufacturer column and that's not possible so that's why we got this nan that stands for null value and well we didn't calculate that data if you want you can modify and type any other function so for example the sum function instead of the count so here I write sum and now we get these values and here we concatenate strings again and anyway you can include any function you want okay now let's apply different aggregations per column so we're going to calculate the mean and also the sum of the columns sales uh in thousands and also calculate the sum and max of the price in thousands column so here I'm going to write the name of the data frame and I write df_cars then a gg and parenthesis so now this has to be in a shape of a dictionary so here we're going to use the lists again but now to indicate the column that we want to work with we have to open a dictionary and write it as a key so for example here I open quotes and now I have to write the name of the column I want to work with so in this case it's going to be the sales_inous thousand column which is here so I don't have it so I'm going to print it here df_cars so I can copy and paste it so I choose this sales in thousands so I copy this one and now I paste it so this is my first key now I have to write colon and here the value so the value is going to be uh in list format because we want to apply multiple functions and now I'm going to delete this and here I'm going to write the functions uh that we want to apply so in this case for the sales in thousands we uh apply this first sum function you know I add it and then the mean so sum and mean so there it is and now we add the second pair of key and values so comma enter and now we write the second pair so here the key is going to be the second column so price in thousands and now I paste it here and here we have to specify the aggregate functions we want to apply to this column so here I have the square brackets that indicates this is a list and now I write the first function which is sum and the second one is going to be max so here I paste this one then comma and here max okay now this is done and what we're doing here is to apply the sum and mean functions to this sales in thousands column and then to apply this sum and max function to this price in thousands column so we're applying different aggregation functions per column so now I run and now we see the results so here we see three rows and we see also null values because here we don't have a mean for the price in thousands so you see here that we only have sum and max and for example here for the sales in th00and we don't have the uh max function so we only have sum and min but the two columns have one common function which is the sum and here we have values for both of them and that's how you apply different aggregation functions to a column now let's see how to aggregate over the columns so we're going to make operations along the columns so here I'm going to sum uh two columns and in this case I'm going to select these two columns first so I write df_cars and now let's select because if we don't select these two columns the aggregation function we're going to apply is going to be applied to all the columns and we don't want that we only want to be applied in two columns so here I select those two columns and the first one is going to be sales in thousands and the second one is going to be uh price in thousands so I copy and paste it and now well we have these two columns and now I'm going to apply the aggr which is the sum function and this indicates that I want to sum the sales in thousand column and also the price in thousand column so I want to sum these two columns and now to specify that I want to make an operation along the columns I have to add the axis parameter so I write axis and here I write one which means columns and now I'm going to uh run this one and see the results so here we see that we have a series and there are 156 rows actually 157 because here the zero counts and now this is the same number of rows that the original data frame has so I'm going to show you here I print the DF_cars and here we see 157 this means that we sum the values inside these columns so we sum each row inside these two columns and now you might be wondering why I didn't select each column and sum the two columns as we did in previous videos so I'm going to show you what I'm talking about so to sum these two columns we could do this so DF_CARS and then select one column and then uh write this plus operator and then write the other column to this price and thousand and this so if we do this and I run we get the same result here i did this because I wanted to show you how to aggregate over the columns using the AG method but also there is another reason why I did this is because both approaches behave differently so here uh in the first approach you see that we have a value a numeric value in the index two and here in the same row we have a new value and this is kind of weird because we should have the same result but we have a different result so I'm going to show you the data frame here and explain you what's going on so here in index two we see that we have a numeric data for the sales in thousand column but for the price in thousand column we have a new value here so n a n so here in our first approach uh here with the ag method this ag method uh takes this null value as if it was zero so here we got 14 and here this nan is like zero so we sum these two values and we got 14 of course but the second approach so summing these two columns making the selection is different so the way it behaves is that one value is null the sum of two values is going to be null too even if one value has numeric data so that's how both behave and you have to keep that in mind okay before we finish this video I'm going to show you how to aggregate different functions and rename the index of the resulting data frame so here I'm going to write again the DF_CARS data frame and here AG so I'm going to use this method and now here we have to use a different format compared to those we've used so far and in this case we have to use two which are represented with the parenthesis so here the first element is going to be the name of the column and the second element is going to be the aggregate function and outside the tupil we write the name of the index that we want to set so here I'm going to show you first I open the tupole and now I write the name of the column so here I'm going to choose the sales in thousand so I just copy this one and I paste it here so this is my first element and now I add the function I want to apply so in this case the sum function and now outside this topple I'm going to set it equal to the name of the index I want to set so in this case I'm not going to get this sum function as index but I'm going to get the name I set here so for example if I want to set it equal to X I'm going to get the X in the index so here I'm going to show you an example here instead of getting the sum the min and the max in the index I'm going to get the name I set in the aggame which I'm setting here so this one I'm going to get this x name and now I'm going to apply another aggregate function in the price in thousands column so here I just copy this add a kuma and here I paste it so in this case it's going to be a y a value or name and here I'm going to write the price in th00and which is here so I copy now I paste it and I'm going to use the same function so you see how it works so here I run this and we get this uh little data frame and here in the name of the indexes we get x this one and y so this one so we rename the name of the indexes in this data frame here you can see that we have null values in the price in thousand and also in the sales in thousand and it looks kind of weird because here we know that the sales in thousand is 8,000 320 for X and Y because X and Y represent the sum function so it should be the same but here for some reason it doesn't recognize that X and Y belong to the same function so now you might think this is not so useful but in the following videos I'm going to show you how this is going to work with the group by method this gets more relevant when we use both the group by method and the AG method and that's it in this video we'll learn how the AG method works in pandas and also we'll learn how to use some aggre functions all right in this video we're going to learn how the split apply combine strategy works and this strategy help us group values in a data frame so we build categories and we can apply some aggre function so we see the values of those categories so let's start with the split and here let's split the data into separate groups so here we're going to work with the vehicle type column and let's see the values that this column has so I think we've seen this in the previous video but in case you forgot here let's see the values so I write vehicle type and now I'm going to use the value count method to see all the values inside this one so we see that the vehicle underscore type has two uh values and the first one is the passenger and the second one is the car so here we can split that data frame into these two categories so the first category is going to be the passenger category so here we make a filter and here I write df_cars then open square brackets and then I write the name of the column so this one and now I'm going to write a condition so this equal to and here passenger so I copy and paste this so now here this is my condition and I'm going to assign this equal to the passenger uh let's call it filter so this is my variable now I'm going to do the same but now for the car type so instead of uh passenger filter I'm going to name it car filter and now here this remains the same and now this condition should be equal to car so this is what we get and now I'm going to comment this out and now I assign these two conditions to the variables so now I run and here we have these two variables i'm going to show you one of them so here I have car filter and here we see that we got boolean uh values in this series and we're going to use that in the next step so now I delete and now the next step is the apply and here we apply an operation or we apply an aggregate function so we get the value of the category that will show in a data frame later but here we make that calculation so here I'm going to filter uh the data frame based on these conditions so here I write df cars then open square brackets and inside we write the condition so here is the passenger filter first and now I'm going to show you how it looks so this is the filter that we made so here we see that we only got the vehicle type equal to passengers so if we go to vehicle type we see that all of them are passenger and now I'm going to choose a column to make the aggregate function so here I'm going to choose the sales in thousands column so this one so now I select this and here we get only this column and now I have to choose an aggregate function and in this case I'm going to choose the mean so I write that mean and then I open parenthesis and now I calculate the mean of this column so if I run this we see the mean of the sales in thousands but only for the passenger filter and here I'm going to assign this to a variable so I'm going to name it car or actually passenger because this is the passenger uh filter passenger let's say average so this is my first variable and now let's create the second one so here I write df_cars then open the square brackets and now I select this car filter so I paste it here and now I'm going to choose the same column because we have to make the the operation in the same column so it's this one and now I paste it here and the operation is going to be the mean so mean and parenthesis and now I'm going to assign this to a variable and in this case it's going to be named car uh average and here I forgot to add the quotes so here we have the quotes and now this is ready so here I run this code and we got this and now the last step of this strategy is combine so we're going to combine the values of the ply step and the split step and we're going to make a data frame so here to make a data frame I'm going to use the data frame method so pd dataf frame and now I open parenthesis and to create a data frame I'm going to use a dictionary so I open this curly braces now I write the name of the key and in this case the name of the key is going to be that type we selected so vehicle type so I'm going to name this first column as vehicle type and then I open a square brackets and here we write the types of vehicles so the first one is car and the second one is passengers so here passenger and now we add a second pair of key value so here is going to be the sales in thousand then I write colon and now I open this uh square brackets and inside we write these two variables that we created here so the passenger average and also the car underscore average so I'm going to set it in the same order we we put it here so first car and second passenger so here first car and second passenger so now this is ready and what we're showing in this data frame is the categories here the categories that we selected for these groups so car and passengers and also we're showing the values of the aggregate function that we're applying to those categories so in this case the average function so now I press control enter and now we see the results and here we see the vehicle type so here the car type has 80.62 62 in average in sales and the passenger has uh 43.23 23 in average in the sales in thousands column and that's how you group values in a pandas data frame and also that's how you apply some aggregate functions and I know these are too many steps but fortunately there is an easier way to do this with less lines of code and is by using the group by method and with that we're going to do this with one line of code and faster so I'm going to show you how the group by method works in the following video all right in this video we're going to learn how the group by method works and the group by method allows us group data into categories and then we can apply some aggregate functions so we see the values there so for example the sum the average or we can count the values inside each category so we're going to start by grouping vehicle type and then calculate the mean and this is the same we did before here uh in the previous video using the split apply combine strategy in this case we used a lot of lines of code but in this case we're going to use an easy approach with the group by method and we're going to group this category vehicle underscore type and then we're going to compare the results so here let's start by writing the name of the data frame which is df_cars then group by and then here I write the name of the columns so I'm going to use the vehicle type so this one now I paste it here and now I apply the aggregate function in this case we want to calculate the mean so we write mean then parenthesis and that's everything we need to do so let's review the syntax first the name of the data frame followed by the group by method and inside we write the name of the column in which we want to make the groups and then we apply the function and that's everything we have to do now we run this and here we got this data frame and here we see two categories so in the vehicle type there are two categories uh the first is car and the second is passenger we only see columns that have numeric data so here we calculated the mean or the average of each column and now we can compare the result of this data frame with this uh little data frame that we got here using the split apply combine strategy so here we only calculated the sales in thousands so let's compare only that column so here I have this data frame and if we compare the columns we see that the value in car is 80.62 and here is the same and for passenger is 43.23 and yeah it's the same in the sales and thousands column so here we did the same thing but now we use the group by method and as you can see this is a shorter method because we didn't need to split apply and combine but we could uh avoid those steps and only do all of that with the group by method and here there is a little difference here we have the vehicle type in the index but here we have the vehicle type in the column so we can set this as index if you want here so set index and then we set this vehicle type as index so both are the same now or you can here add a parameter as underscore index and here set it to false so with this the vehicle type index will be a column so now I run this and here we see that this vehicle type isn't an index anymore but now is a column and that's how the group by method works now I'm going to show you a different example so we can understand much better this method so now let's group by manufacturer so I write df_cars group by and then inside parenthesis I write this name of the column so here I paste it and now let's run this without applying an aggregate function like in the first example so let's see what happens so here I only group by manufacturer and we got this uh these words that says that this is a data frame group by object so here when we use the group by method we get a group by object so this is the result of running this code this actually is not the final result that we're going to use but we can use this object to get to other values we want for example we can get the name of the groups of this uh group by object but first I'm going to set this uh variable so I'm going to name it group by_g so object so now I run this and now this variable has this group by object now I'm going to show you this object so I print this and here is the same uh object and now I'm going to get to the groups attribute so we have to write the name of this object followed by that and here groups so this groups attribute allows us to get all the categories that are inside this group by object so here I run and this has the form of a dictionary because we see here the curly braces and now we have all the elements so here in the keys of this dictionary we have the categories or the names of the groups and in the values we have the indexes so for example the category Audi has these indexes four five and six so this means that the four five and six index has this all the category and only those rows all the rows have other category so now this is the groups but now since this is a dictionary we can get the name of the groups so here I can use the keys method and with that we get only the keys of this dictionary so here I run now we only got the keys of the dictionary so only the name of the manufacturers so here there is Audi Ford Volvo and so on okay now let's use the get group method to get a specific group within the manufacturer column so here I'm going to reuse this object so I copy and paste and now I'm going to use the get group method so I write get group now parenthesis and here inside we have to write the name of the manufacturer we want to select so in this case let's say I want the Ford so I write Ford now I paste it here now I run and here we got the selection of Ford in the manufacturer column so all the values inside manufacturer are fourth so here we see only fourth and if you check here the groups attribute we'll see that the indexes are the same so let me see here I print now and now let's look the word fourth so it must be here so it's BMW and here is fourth so we see 46 47 until 56 and these are the same indexes we get here so from 46 to 56 so this is the selection that corresponds to the fourth category okay that's enough for the group by object and now let's apply some aggregate functions so now I'm going to uh here use the name of the data frame so I write df_cars and now I'm going to group by the manufacturer so group by here here parenthesis now the name of the column here so I paste it and now I'm going to apply the mean uh function so here the mean method sorry here's the mean method and now I run this and here we get the uh data frame grouped by the names of the manufacturer and here we have all the columns with numeric data so we calculate it for each group in the manufacturer column here we could set this uh index as column as I showed you before but this is enough so now let's see the next example and here we have to group by manufacturer and then calculate the sum so this is similar so you can try yourself and I'm going to do it now so here I write DF_CARS then group by now parentheses and now the name of the column so here again this manufacturer which previously was a column so I paste it here now we have to write the uh function we want to apply so sum parenthesis and this is enough so I run and now we got a similar result that we got with the average but in this case we grouped by manufacturer and then we sum the values inside each category okay finally let's see these three last examples and all of them involve the count function so in this case we're going to group by vehicle type and then count the values so we write again the name of the data frame so df_cars then we group by vehicle type so I write group by then quotes and then we write this column so it's here and now I'm going to count the values inside this column so we use the count method and now I run and here we have only two categories because as we've seen before this uh vehicle type there are only the car type and also the passenger type in here we counted uh the values inside each column so for example there are 41 values that correspond to the manufacturer inside the car type and we see that we got the same number of element for sales in thousands but for the price in thousands and engine size we got 40 so there is one element less and this happens because some columns have some null values so if a column has a null value we won't count it because they aren't actually values but they are n a n if we want to count those n a n values we could fill those null values with a number or with a text and only in that case we will count those values but as long as they are n a n we will count those values with a count method so keep that in mind now let's verify which columns have these null values and we can do that by using the is null method and the sum method so here I write df_cars then is null parenthesis and then we sum here all the null values so here we see that only the price in thousands the engine size the horsepower and the fuel capacity have null values and this makes sense because those columns have less values than the manufacturer and sales in thousand columns and here we see that those columns manufacturer and sales in thousands don't have any null values so these results make sense but what if you want to group by and then count the values but then show the null values well if you want to do that you only have to add the drop NA parameter inside the group by method and I'm going to show you how to do that in this last uh example so in this case we're going to group by engine size so here I write df_cars then group by now parenthesis and now we write the name of the column so here is the engine size and now I add a new parameter so drop na and we set it equal to false so this drop N8 parameter controls if we want to drop the null values or not so if this is set equal to true this means that we want to drop the null values and we're not going to count it if we apply the count method as we're going to do in this example so if we drop the null values they are not going to be counted but if we u set this equal to false so drop na equal to false we're going to consider and include these null values in the group that we're going to make and then we're going to be able to count these null values so let's have a look at how it works so now I run this and here we see the result so we see that there are many engine size types so we have many groups so one 1.5 1.6 and so on and if we scroll all the way down we will see that the last row is the N so here we have the null values and here we see the columns that have null values and this drop NA parameter might come in handy whenever you want to consider the null values in whatever operation you want to perform in the groups you created and that's it in this video we learn how the group by method works in pandas okay in the previous videos we learn how the group by and the aggs and in this video we're going to use both of them to perform more complex operations okay now in this first example we're going to find the minimum and maximum value on each column so here first we're going to group by vehicle type and to do that we write first the name of the data frame so DF_cars and then we group by the vehicle type so here parenthesis and now the name of the uh data frame which is vehicle and then underscore type and in the previous videos we learned that if we want to get the minimum or maximum value we have to use the min method so let's try it out so let's say we want the minimum value so we use that min method and here I run this and we get the minimum value of each column and if we want the maximum value we only have to modify this and write max so here write max and we get the maximum value that's a common approach to get the minimum and maximum value but now I'm going to show you how to calculate both of them at the same time using the AG method so here instead of using the max method I write a GG and now here inside I open square brackets so we create a list and here we have to write all the functions that we want to apply to this group so now I write min inside quotes and now max so max and now this is ready so if I run this we get the minimum and maximum value of all the columns and here we can add as many aggregate functions we want and also we can customize the name of the columns and also we can apply different functions to many columns and I'm going to show you how to do this in the next example so here we're going to find the minimum engine size and the maximum horsepower and to do that first we uh group by vehicle type so I copy and paste this one and after this we use the AG method but now instead of writing the minimum and maximum functions inside square brackets we'll use two poles this time and to create two poles we have to write parenthesis and here the first element is going to be engine size so here I copy this column name and now I paste it and after this I write the function I want to apply so in this case I want to find the minimum value and here I write min and now I'm going to set the name of this uh column that we're going to create here and let's call it uh main engine size so this name that I'm creating here is supposed to be instead of this uh column name so instead of getting this default column name we're going to get this customized column name so now let's create a second uh tupole so here I open parenthesis and now the first element is going to be uh this column horsepower and in this case I want to calculate the max function so here I write max and now here equal to and now I assign the name for this tupil and is max horse power and this is ready so let's review this uh syntax one more time so first we group as we've done so many times in this course we group by vehicle type and after this we um use the AG method and this ag method has a different syntax and here we use tupils and inside these tupils we have to write first the name of the column in which we want to perform an aggregate function and then the second element is the aggregate function we want to perform so here engine size and min and then we have to assign a name for this column and in this case we assign these two names and that's everything you have to do once this is ready we can run this and here are the results so here we got the vehicle type in the index because we grouped by vehicle type and then we have here two columns so first is the mean underscore engine size and this is because here we set this name and this column has the minimum value so two and one and if we check the previous data frame we created here we can verify that those values uh belong to this um engine size column so it's this mean so here we have two and one and two corresponds to car and one to passenger so it's the same here we have mean engine size instead of engine size and here is the minimum value so the minimum for car and the minimum for passenger and the same goes for horsepower so here we calculated this time the maximum value so here the minimum and here the maximum and we can verify that those values are correct because here in horsepower here we have max and for car it says 300 and for passenger it says uh 450 and these are the same values so what we did here is to perform some specific aggregate functions to a particular column so the minimum for the engine size and the maximum for the horsepower so this is much better instead of just applying the arate functions to all the columns so here we can focus only on the columns we want and also set customized column names okay and now to understand all the concepts we learned let's solve this example and this time we have to calculate the sum of sales in thousands and the mean of price in thousands so it's basically something similar to the previous example but now new columns and also new aggreate functions so here I'm going to group first by the manufacturer i didn't write here but we have to group by the manufacturer this time so I copy this and now instead of writing vehicle type I write manufacturer so here to write it correctly I'm going to use the columns attribute so here I write cars then columns and now I run this and now I copy manufacturer just to make sure I write it correctly so I paste it here and now I use the AG method so aggenthesis and then we create the tupils so the first tupole will be um this one with the sales in thousands column so we paste it and here the aggregate function we want to perform is the sum so now we set the name for this first tupole and is sum sales then comma enter and here we create our second tupole this time the column is price in thousands so here price in thousands and the aggregate function is the mean so mean now we create the column and here is mean underscore price so this is the name of the column now equal to and now this is ready so we have here the uh the group object and then here is the agg method and here the syntax that you already know and now I'm going to run this to see the result and here we have a data frame and in this data frame we grouped the data by manufacturer and then we sum the values in the sales column and then we calculated the mean in the price column so here we have all the values for all the manufacturers okay now I want to show you what's under the hood so now we're going to do the same thing we got in this data frame but without using the AG method so only with the group by and aggregate function so you understand much better what we're doing now so here we're going to calculate the sales in thousand the sum so the first column and only with the group by method so here I copy I paste it here and now we sum the values in these groups so we use the sum method and now we select one column so here first I'm going to run and I'm going to show you here if we sum we get the sum of all the columns and for all these uh manufacturers but now we only want one column so we only want this uh sales so here I copy and paste this one so here I paste it and now I run this and let's compare the results of any value here so let's take Audi in this case the value is 40 and now let's check here uh the sum sales so for Audi is 40 so this is the same but here we didn't use the AG method but only the sum argid function and also we select only the sales in thousands and basically this is the same here we just u use the aggregate function but without the agg for mean price we can do the same here so I'm going to copy and paste because it's absolutely the same so here instead of uh using the sum method we only write mean and now instead of sales in thousands we have to use the price in thousands so now I paste this and now I comment this out i run and let's check now VMW so this is 33 and let's check here BMW and 33 so now it's correct and I did this because I wanted to show you what's under the hood so you can achieve all of this without using the AG method but you should use this AG method to simplify things but it's always good to know what's behind this shortcut so here I showed you and now you know how to do it with this group by and also how to do it with the AG method and that's it in this video we learn how to use the group by and agg method together okay so far we learned how to apply a builtin argate function to our group by object but in this video we're going to learn how to apply our own function using the lambda function so here we have an example of the built-in function for example here is a sum function so here we're grouping by manufacturer and then summing the values inside these groups so this is what we learned so far so here we have all the values and we have the sum in each column okay this is what we learned so far we have here the group by object and now here we apply a built-in arate function and here we got the sum of the manufacturers and in these columns so here now we can apply our own function and now to make an example let's say we don't want to get the sales in thousands but now we want to show the real value here so instead of 79 for this uh value we show uh 79,000 so all of them will be in the real value so to do that we have to multiply each value by 1,000 and we can do that by using the lambda function and I'm going to show you how to do it here so first I'm going to uh minimize this and now I'm going to show you how the lambda function works in case you don't remember or you didn't check the previous videos so basically to create a lambda function we only write lambda and then we have to write the input so here we write any random variable so I write x this is my input and now the output so let's say we want to output x + one now I assign a name for this function so I write here uh sum one so here now if we apply this function we write here sum_1 and then we write any value we want so let's write one and here we should get two because 1 + 1 is two so now we run and here we got two so this lambda function is a oneliner function that we can uh create and this is like a short version of the traditional function that we learned in the Python crash course and the lambda function is preferred when we want to apply a customized function to a pandas data frame so now now that I explained you this I delete this and now I'm going to do what I told you before so here we're going to calculate the values in thousand so I'm going to multiply this by 1,000 and here I start by writing the name of the data frame so DF_cars and then we group by the manufacturers so here I copy manufacturer and then I paste it so I have this group by object now we sum the values and after this we use the apply method so apply and inside parenthesis we write the lambda function so lambda my input is x and my output is x * 1,000 so this is input and output now if I run this we get the real values of this data so for example this one is 79,000 and it's not 79 anymore so here this sales column is not in thousands anymore but it's only sales because now we're showing the real data of this column and now you can see that all the columns were multiplied by 1,000 but not all of them have this in thousands in their columns so we are only interested in these two first columns so now let's select only these two first columns with double square brackets and now I write the name of these columns so here price in thousand and now sales in thousand so we only want to show those two columns because those are the actual columns that are in thousands and that's how you create a lambda function and apply it to a group by object okay now to understand this much better let's see another example this time we're going to subtract values from the mean of each group so we're going to group again by manufacturer and now we're going to create a lambda function that help us subtract the values from the mean so here I paste this group by object and now I write apply so that apply and inside the lambda function so now the input so an x variable and now x which represents an element of my data frame and now minus x.min so here with parenthesis and before I do this I'm going to show you the data frame again so you can see what the x and the x mean represents in this data frame so the x represents a value inside this data frame for example this 16 this 39 or this 14 but the XM mean represents the average values that are in these groups so here we're grouped by manufacturer so we get here Aura Audi BMW so here we calculate the mean of each value so here for example Aura the mean is I don't know 10 and after we do that we uh subtract 16 minus 10 so we have to select the right average for each manufacturer and then we subtract each value from the mean of each manufacturer so here we're not calculating the mean of the whole column but we're calculating the mean of each group okay now that this is clear I delete this and now I run and here we have this result so for example in the engine size column we got the first value which is minus one and this indicates that this value is under the mean so this is less than the mean of the engine size but here we have a 27 and this indicates that this value is over the mean and as you can see this function that we created shows us the values that are under the mean and over the mean so we can recognize which are good values okay before finishing this video I want to show you what's under the hood of this lambda function so here I print the data frame df_cars and well we have these values this is the original data frame and now I'm going to calculate the mean of the category manufacturer so we're going to group by manufacturer and then calculate the mean so I'm going to do that just to show you what's under the hood of this lambda function we created so here I write df_cars and then we group by the manufacturer so here I copy and paste this and after this I'm going to calculate the mean so here we have the groups by manufacturer and then we calculate the mean so now we have these values and now let's uh make this operation so let's subtract x minus x mean so this x first means the value inside this uh data frame so for example 16 and this x mean is the average in each group so for this example 16 the average is the average of acura so here let's check the average for this first example so accura the average is 19 and if we subtract 16 minus 19 we get a value close to minus3 so now let's verify so here in this uh data frame that we got after applying this lambda function we see that in the sales in thousands column in the first row we have this -2.83 which is close to minus3 and well this is great so we verify that this is correct so we only uh subtract these values from the mean and now let's check one more example so here let's pick Audi in this case is 20.39 and now let's check the average of Audi so in the sales in thousands columns so here all these 13.5 and 20 - 13 is 7 so now let's verify in this um data frame so we check here the row with the index 4 and now index four and now we uh look for the sales in thousands so here is 6.87 and this value is close to seven so this is correct and with this I wanted to show you that argate function like the mean when they are applied inside the lambda function we apply this mean to the groups that we get after using the group by method in this case the group by method are grouped by manufacturer so we calculated the mean over the manufacturer groups and that's it in this video we learned how the lambda function works and we created more complex functions using the apply method and the lambda function okay in this video we're going to learn how to filter values based on an aggregate function and to do that first we have to create a group by object so here let's create the group by object so I write df_cars then the group by method so I write group by then parenthesis and here I write the name of the column and now I'm going to get the name of the columns by using the columns method so I write it correctly so manufacturer I'm going to do this example with the manufacturer column and now I use the filter method so the filter method help us filter values based on an aggregate function when we use it with this group by method so here we only have to write filter and then open parenthesis so now inside this parenthesis we have to specify a function and we can create our own function and specify the condition that we want to set for this filter so let's do this so I'm going to create a function using this def keyword and the name of my function is going to be filter_f funk then I set the parameter x then colon and we're going to return um this condition so first x and now we select a column so this x parameter represents a data frame and if we open this square brackets we're selecting a column so here I want to select the column sales in thousands so I'm going to create a condition based on the sales in thousands column so I paste this and here we're going to compare the sum in the sales of these manufacturer groups and to do that we write the sum method and now we compare this with a value which is going to be the mean of the sales in thousand and to calculate the mean here I'm going to use the mean method so here I run this and now we see that the mean of the sales in thousands is 52 so this is the mean of the whole column so now I use 52 and now this filter function is going to filter groups based on the sum of the sales in thousands so this has to be greater than 52 so if we have a manufacturer with a total sales less than 52 that company or manufacturer is going to be filtered out so now what we have to do is to copy the name of the function and put it inside parenthesis so here in my filter method we specify that we want to apply that function so in this function we have this condition and we're going to filter based on that condition so now I'm going to uh set a name for this data frame because what we get here is a data frame with this filter so only the values that pass this filter are going to be in this new data frame and I'm going to set this data frame uh name so I write df filter and now this is my new data frame so here I'm going to show you this data frame and now everything is ready we have the function we have the group by object and we use here the filter method so that's everything we need so now I show you this new data frame and it's here so I scroll down and all the values here passed this filter so all the manufacturers passed this filter so that means that the sum of the sales is greater than 52 so Acura passed Volvo passed and so on okay great now I'm going to show you which manufacturers were filtered out and to do that I'm going to use the same group by object so here I copy this and now I paste it so this is my group by object and now to calculate which manufacturers were filtered out I'm going to use the sum uh aggregate function and now I run this so here we get the sum of these values and we get the columns and we grouped by manufacturer so here I'm going to select only the sales in thousands because this is the column we use in this function so here I select only this uh column and here we have this column and now I'm going to sort the values ascending so I write sort values and parenthesis so we use this method and here we get the values of this sum in the sales column and now is ascending and here we see that the first six manufacturers have total sales less than 52 so here to see it much better I'm going to use head and then six so only we see that six first values so here Porsche uh Jawward SA Infiniti Audi and VMW were filtered out okay now let's see how many values these six manufacturer had in the original data frame and to do that I'm going to use the is in method so here I'm going to um use the same column so here I write manufacturer so I copy this and now I paste it and after this I use the is in method so now parenthesis and inside I write the name of these six manufacturers okay I pasted the six manufacturers and now I'm going to filter this data frame so I copy the name of that data frame and now I add these square brackets so now I run this to see this data frame okay now this data frame only contains cars that belong to these six manufacturers so we see Audi BMW and so on now let's see the shape of this data frame by using the shape uh attribute so here I write that shape and now I run and we see that the shape of this uh data frame is 13 comma 7 the first is the number of rows and the second is the number of columns great now let's compare the shape of the original data frame and the shape of the df filter data frame to see how many rows we lost in the filter so here I write df_cars and now the shape attribute and here we see that in the original data frame we had 157 rows and now let's find out how many rows we have in the filter data frame so this one I copy this and now let's paste it here now I use the print function to print both values and it's here so I run and we see that in the original data frame we had 157 rows but in the filter data frame we have 144 so here we lost 13 values or 13 rows and those 13 rows belong to these six manufacturers because those manufacturers didn't pass this filter that we created in the function so those have total sales less than 52 so they didn't pass this filter and that's why we removed these 13 rows from these six manufacturers and that's it in this video we learn how to use the filter method with the group by object in this video we're going to see the data set that we're going to work with in this section and first we're going to import pandas as PD so here I run this first line of code and now I'm going to read two data sets that we're going to work with in this section the first is a movies data set from IMDb and the second is the ratings of those movies so let's read the first CSV file so I write PDR read CSV then parenthesis and here I write the name of the CSV file so immov CSV so now I read and now let's see what happens so here we got a warning message and we also got the data frame this warning message is recommending us to use this low_memory parameter and set it equal to false so this is some kind of issue so I'm going to set this low memory equal to false and now if I run this we'll see that we get the data frame but now we don't get the warning message so this is an issue that we have with this data set but if you add the low underscore memory parameter everything will be fine so now we have the first data frame and now let's see some columns that we're going to use for the section so here this data frame has a lot of columns actually if we scroll all the way down we can see here that it has 22 columns and we're not going to use all of them but we're going to use only five of them and now let's see those five columns that we're going to work with in this section and the first is the ID so this IMDb title ID this one contains the IDs of columns and ids are unique this means that we won't find duplicated IDs but they are unique so each movie has an ID okay the next column that we're going to use is the title column so this is the name of the movie and then we're going to use the year column so this is the year in which the movie was published and also we're going to use the genre and finally the country so those are the five columns that we're going to use in this section and I'm going to set this data frame equal to DF_MOV so now I run this and here we save the CSV file into this data frame df movies so now let's check the second CSV file so this one is about ratings of those movies and I'm going to read this CSV file with the read CSV method so I write read CSV then here I write the name so imdb then rating CSV and here is ratings so now it's ready i comment this out and run this so here now we have the data frame and each column measures the vote in a different ways so for example here we have the total votes here we have the mean and the median and so on so we have a lot of columns here but we're going to use only the ids this ID column is the same ID column that we found in this DF_MOV data frame and the next column we're going to use is the total underscore votes and finally the last column that we're going to use is the mean underscore vote so this is the mean of the votes for each movie and now I'll give a name to this data frame so I'm going to write DF ratings so this is the name of this data frame and now I run this okay now let's select the columns that we're going to work with in this section so I write first DF_MOVC columns to get all the columns and now to overwrite the data frame i'm going to write DF_MOVOV equal to DF_MOVOV but now with double square brackets and inside I write the columns that we're going to select so these are the columns I mentioned before and I just pasted these five columns so the ID the title the year the genre and the country now let's do the same with the DF ratings so now I overwrite the data frame and now I select the three columns that I mentioned before to do so I change the name of the data frame here so I write DF ratings columns and now we get all the name of the columns now I'm going to select just the columns that I mentioned before okay I pasted the three columns and now everything is ready so I run this code and now let's see how the data frames look so here I run the DF movies and now we only have five columns and now I copy and paste DF ratings and we only have three columns and as you can see both data frames have one column in common and this is the ID column keep that in mind because we're going to use that column a lot in this section and that's it in this video we explore the two data frames that we're going to work with in this section now it's your time to explore more these data sets so you can understand much better what we're going to learn in this section all right in this video I'm going to show you how to concatenate data frames in pandas in pandas we can use the concat method to concatenate two data frames and there are two different ways to concatenate data frames we can concatenate data frames vertically and horizontally and in this video I'm going to show you how to concatenate data frames vertically so first let's see uh these two data frames that we're going to use as examples and here we have the data frame DF1 and DF2 and these two data frames have two columns in common so those two columns in common are the ID and the H in the first data frame we see that the values inside the ID column start with the letter A and ends with the letter D while in the second data frame the values for the ID column starts with the letter E and ends with the letter H so we have different values whenever we want to concatenate two data frames vertically we need to have columns in common in this case the columns in common are the ID and the H columns and by concatenating we will have a data frame that combines the data inside the DF1 and DF2 okay to indicate in pandas that we want to concatenate two data frames vertically we have to write the axis parameter and set it equal to zero zero means vertical concatenation and if you go to the pandas documentation you will see that this is known also as concatenating along the rows so that's another way to call this concatenation and now let's see how to concatenate these two data frames so here if you want to concatenate vertically you have to imagine that one data frame is over the other this helps us visualize the final output of our data frame so if we put DF1 over DF2 this is like concatenating DF1 and DF2 vertically so here let's see the result of this so if we concatenate these two data frames we're going to get this final data frame which has uh this ID and H column and has all the data in both the DF1 and the DF2 so here we see that in the ID column we have data from A to D that corresponds to DF1 and also data from E to H that corresponds to DF2 and well all the data inside the H column was concatenated too another thing you have to keep in mind is that by default pandas will concatenate the name of the indexes even if they have the same values so in dataf frame one we have indexes from 0 to three and in dataf frame two we also had indexes from 0 to three so this means that the output data frame will have repetitive indexes and this is not a good practice later I'm going to show you how to deal with this kind of behavior but now let's see the code that helps us concatenate vertically so here is the code and we have to write PD this PD comes from pandas and then we have to write the concat method so we write concat with parenthesis and inside parenthesis we write a list and we start this list with a square brackets and as elements we write the two data frames so first data frame one df1 and then df2 so here we can add also the axis parameter and set this equal to zero because we want to make a vertical concatenation but it's not necessary because the axis is set to zero by default in the concat method and we can omit that parameter all right enough talking and now it's time to start writing code so let's go to Jupyter notebook and here I have the code that creates the DF1 and DF2 so I'm going to run this code and now I have DF1 which is this one is the same data frame I showed you before and also DF2 so now let's use the PD concat method to concatenate these two data frames so here I write PD concat then we open parenthesis and here we open square brackets so first we have to write the df1 and then the df2 so here we can omit the axis parameter but in this case I'm going to add it so access equal to zero so this axis equal to zero indicates that we want to concatenate vertically okay now I run this and we see the result so this is the output and we see that we have the same data frame I showed you before in the slides so here we have the ID and the H column with the data of DF1 and DF2 so now I'm going to show you some other parameters you can add and first is the ignore index so if you check here you can see that there is a ignore index this helps us ignore the original indexes from the DF1 and DF2 so if we add this one so ignore underscore index and set it equal to true we'll see that here we get 0 1 2 3 4 until 7 and we see that we don't get the 0 1 2 3 indexes twice because we ignored the original indexes so this new data frame will have a new index ignoring the indexes from DF1 and DF2 and that's how you use the concat method in pandas now we're going to see an exercise and we'll use this data frame that I showed you in the previous videos and we're going to use this IMDb movies data set and also this IMDb ratings data set so in the previous videos we read these two data sets and we assigned the DF movies and the DF ratings uh values and in this video we're going to use these two data sets or data frames to solve an exercise remember that we also selected some columns so we focused only on these five columns in the DF movies and these three columns in the DF ratings okay in this exercise first we have to extract a 50% sample of the original data frame and this data frame is the DF movies that I showed you before so this one so we have to use the sample method to extract a 50% sample then we have to show the shape of the data frame that we created here so after extracting a 50% sample we create a df sample data frame and here we check the shape of that data frame and then we concatenate the df movies and df sample data frames in this case we concatenate vertically and that's what you have to do so now try this yourself and after you tried to solve this exercise you can continue watching this video to see my solution okay to solve this exercise we're going to start by extracting a 50% sample and we do that with the sample method so we write df_mov sample so this sample method we learned previously in this course and as you might know this sample method has a parameter named frack and that helps us get a percentage of the original data frame in this case the df movies so we set this frack equal to 0.5 and this means 50% then we only run this and here we have the data frame so this is the sample and here we see that the number of rows is 42,000 so this is half of the number of rows in the df movies and now I'm going to set this data frame equal to df sample so now I run and I just created this dfc sample now let's see the shape of this data frame so I write that shape this is a shape attribute and now let's check and we see that it has 42,000 rows and five columns now let's see also the shape of the DF_MOV data frame so here I copy and paste and now instead of DS sample I write DF movies so now I print so we can see both values together and now I run so we see that yeah we have half of the number of rows of the DF movies and also we have the same number of columns and actually these are the same columns because we extracted a sample so if you check here DF movies has these uh five columns title year genre country and also the DF sample has the same columns and this is important because as you might remember when we want to concatenate vertically we need to have columns in common so these two data frames have columns in common and that's why we can concatenate them vertically so don't forget about that and now let's concatenate these two data frames so here I write PD concat now we open parenthesis now I create a list with square brackets and now I write the name of the first data frame which is df movies and now I write the name of the second data frame df sample now we want to concatenate vertically so I add access equal to zero and remember that we can omit this axis parameter if we want because the axis parameter is set to zero by default so now everything is ready so we can run this and here we have the data frame that is the output of this concatenation so this is the result and now we can verify that we successfully perform this concatenation by getting the shape of this data frame so here I'm going to assign a name to this data frame i'm going to name it df concat vertically so it's here and now I only run this and now let's check the shape of this data frame so here I write the name of the data frame followed by the shape attribute and now we have 128,000 rows and this is the sum of 85,000 rows plus 42,000 rows so this is the sum of the rows in dataf frame one and dataf frame 2 so with this we can verify that we successfully concatenated these two data frames because by concatenating vertically we sum the number of rows of the two data frames and the number of columns remain the same and that's it in this video we learn how to concatenate vertically in pandas all right in this video I'm going to show you how to concatenate data frames horizontally and we're going to use the concat method as we did in the previous video so first we're going to see these two data frames and here the second data frame DF2 it's different from the previous example so in this case this DF2 has only one column and this column is named job and this column is different from the columns that DF1 has so here in the job column we have the professions of people in data frame one so we can say that this data corresponds to the same data we have in data frame one but they are split so as you can see we have different columns but the indexes are the same so both have the indexes 0 1 2 and three and this is a requirement to make a horizontal concatenation so here we need to have the same indexes to concatenate horizontally and in pandas when you use the concat method you need to specify that the axis is equal to one this access parameter equal to one indicates that we want to concatenate horizontally and if you check the pandas documentation you will see the sentence along the columns which means concatenating horizontally so it's the same okay now we'll see how we can concatenate these two data frames so basically we need to put one data frame next to the other data frame so we have DF1 on the left and DF2 on the right so now if we concatenate these two data frames horizontally we're going to get a data frame that has three columns so the two columns of DF1 and the one column of DF2 and also four rows so the four rows that are in those two data frames so let's see the output of this concatenation so if we concatenate horizontally we're going to get this data frame so the three columns and the four rows and we could concatenate this because we have these indexes 0 1 2 and three in common so now you see that we have the job information available in one data frame and also the ID and the age column so now let's check the code that will helps us create this data frame so if we want to concatenate horizontally we have to write pd.con concat and inside write data frames inside square brackets because this is a list and then we have to specify axis equal to one and that means that we want to concatenate horizontally so by default axis is equal to zero which means concatenate vertically but when we set it to one we tell pandas that we want to concatenate horizontally and that's everything you need to do all right enough talk now let's write code in Jupyter notebook so now I go to Jupyter notebook and now I have the code that creates the data frame DF1 and DF2 so I run this code and now I show you the data frames which look exactly the same as the two data frames we've seen before so DF1 and DF2 so now let's concatenate these two data frames so I write pd.conat then parenthesis then we need to write the two data frames inside a list that's important don't forget about it so first df1 then df2 and now we have to specify the axis so axis equal to one so this means concatenate horizontally now I run this and we get a single data frame with the three columns so we see the ID the age and the job in this single data frame okay now let's check some other parameters that might come in handy so here you see the ignore index in this case we don't need the ignore index because we make a concatenation that is horizontal and we have the same indexes actually that's the reason why we can make this horizontal concatenation so it's not useful for this case and now a parameter that might come in handy is the sort parameter so this sort parameter allows us to sort the data frame so if you set it equal to true you will be able to sort this data frame and that's it okay now it's time to put everything we learned into practice solving an exercise so here we're going to use the data sets that we read previously so these are the IMDv movies data set and also the IMDv ratings data set so in previous videos we read these two data sets and we set it in these two data frames so here we have the DF movies and the DF ratings data frames we're going to use these two data frames for this exercise and remember that we selected these five columns for the DF movies and these three columns for the DF ratings data frame okay now in this exercise your task is first to show the shape of the data frames that will concatenate so the DF movies and the DF ratings then you have to concatenate these two data frames on the IMDb title ID column so this is the column that these two data frames have in common and you have to concatenate these two data frames horizontally so after you concatenate these two data frames horizontally you set a name to the concatenation and then you use the shape attribute to calculate the shape of this new data frame so now is your time to solve this exercise on your own and after you try you can continue watching this video to see my solution okay let's start solving this exercise so first we calculate the shape of the two data frames so first I write df movies then shape and this uh returns the number of rows and the number of columns of this data frame so now I'm going to get the shape of the DF ratings too so here instead of writing DF movies I write DF ratings so now I print both shapes so I use the print function and now I run the cell and here we have the output so we see that we have the exact number of rows in these two data frames but the number of columns is different so in the first data frame we have five columns and in the second data frame we only have three columns all right now let's continue and now let's concatenate the two data frames so we use PD concat parenthesis then we open these square brackets which is really important not to forget and here we write the first data frame so I copy and paste it and then I copy and paste the second data frame so DF ratings then we have to specify the axis equal to one because this is a horizontal concatenation and now everything is perfect but we need to make a little change so as I mentioned before in the theory we need to have common indexes and in this case we have a common column which is the ID column and we can set this column as an index so we have a common index between these two data frames so let me show you here so here if I print the df movies we see that we have this id column in this data frame and now if I print the df ratings we'll see the same column in this data frame so here is the data frame and we see the same column so this is the only column that these two data frames have in common and we can use this column as an index so if I use that set underscore index we can set this column as an index so here I write the name of the column and now I run this and as you can see now this ID column is my new index and we can do the same for the DF movies data frame so here I comment this out and now I set that column as an index and now to update the data we have to use the in place parameter and set it equal to true so this helps us save all the changes that we made with the set underscore index method so we do the same for the second line of code so here I write in place equal to true and now with this we're going to set the ID column as index for both columns so now I run and if we check any data frame we'll see that the index is the id column so perfect okay and only now we can concatenate these two data frames because right now the index of df ratings is the same index that we have in dfc movies so this is the requirement that we need to concatenate horizontally so we satisfy that requirement and now we can run this code so I run this code and now we have this concatenation so this data frame has the four columns of the DF movies and the two columns of the DF ratings and here the index is the ID column so perfect now I'm going to set a name to this data frame which is going to be u here df concat horizontally so this is the name of my data frame and now let's check the shape of this data frame so I write the name of the data frame followed by the shape attribute so now we see that this data frame has 85,000 rows so this is the same number of rows that the DF movie and DF ratings had so that's fine because when we concatenate horizontally we don't change the number of rows actually the number of rows have to be the same because we should have indexes in common but now in the columns we see that this uh df concat horizontally has six columns but the df_m movies has five columns and the df ratings has three columns so here we see that by concatenating these two data frames we get the sum of these two columns and now you might be wondering 5 + 3 is 8 so this is not equal to six but here we set the ID column as index in both data frames so we lost one column in each data frame so this is why here we have four columns and here we have two columns so now four + 2 is six and with this we can verify that we successfully concatenated these two data frames and that's it in this video we learn how to concatenate two data frames horizontally all right in this video we're going to see how joins work in pandas and first we're going to see the inner join the inner join returns matching values between two tables in this case it will be the matching data between two data frames so if we have data frame one on the left and data frame two on the right the data that we have in common is the data that will be in the middle so the area with the white dots that you see in this image so now let's check how the DF1 and DF2 look and these are the data frames that we'll use in this example so DF1 and DF2 have a column in common in this case is the ID column and also DF1 has the H column and DF2 has the job column but what it's important here is that the ID column has some values in common between DF1 and DF2 so if you compare these two columns you can see that the C and D values are the same between these two columns while this A and B only belongs to DF1 and the E and F only belongs to DF2 so if we keep that in mind and we see again the diagram that we've seen before we can see that the A and B is on the left while the C and D is on the area with the white dots while the E and the F is on the right so this represents the data inside the ID column in these two data frames so this is basically how the inner join works so the data that only belongs to DF1 will go to the left and the data that is in common will go in the middle and the data that only belongs to DF2 will be on the right okay now this inner join is possible because we have the ID column in common between these two data frames so this allows us to make this inner join and to write the code that produces this inner join we have to use the merge method so we write df1 which is the data frame on the left and then we use the merge method so that merge and then inside parenthesis the first parameter is the second data frame so df2 then we have to write the own parameter and here we have to specify the column that both data frame have so in this case the id column and finally we have to specify the how parameter and here we write the type of join we want to perform in this case inner join so we write inner and when we make this inner join by writing this code and using this merge method we get this data frame so the output is this data frame that you see now on screen and this data frame has the three columns so the H belongs to the DF1 and the job belongs to DF2 and we only see two rows we see the row that has the data C in the column ID and also the the second row that has the data D so this is the output after using the merge method now let's write some code in Jupyter notebook so let's go to Jupyter notebook and now I have these two data frames i have the code to create these two data frames and now I just run this code and now I can show you these two data frames so I write df1 and you can see the first data frame and df2 and you can see the second data frame so now let's use the merge method to merge these two data frames so we write df1 merge then parenthesis and inside df2 then on this is the on parameter and here we write the column that these two data frames have in common in this case the ID then the how parameter in this case inner because we want to make an inner join and that's enough that's everything we need so now we run this and we get the same output I showed you in the slides so these three columns and this data that corresponds to the C and D ID okay now let's verify the shapes of these three data frames so I'm going to assign this data frame a name so I write DF_iner join so this is my data frame and now let's check the shape so first DF1.shape now I copy this i paste it twice df2 df inner join so now I paste it and now I'm going to print the three values so here I use the print function and now it's ready so now I run this code and we see that the first two data frames have the same number of rows and columns but the last data frame so the data frame that represents the inner join has only two rows and three columns so this happens because these two data frames have two rows in common so here we see these two rows that have in common which are in the C and D rows so these two rows and here the three columns are the ID the job and also the age column which is here so this is why we have three columns and two rows and that's how you make an inner join using the merge method so now let's see an exercise to put everything we learned so far into practice so for the exercise we're going to use the df movies and df ratings data frames that we created in the previous videos so now let's check the exercise and it's here so for the exercise you have to merge the DF_MOVO data frame and the DF ratings data frame you have to explore the data frames and find the column that these two data frames have in common and after that you have to use the how parameter and set it equal to inner to make this inner join so now you can try this yourself and after that you can continue watching this video to see my solution okay to solve this exercise first let's see the two data frames so we can find a common column so here is the df movies we see these five columns and now let's print the df ratings and let's see what column they have in common so I just realized that the two data frames have this ID column in common and now I only have to use the merge method open parenthesis and write this dfc ratings inside parenthesis then in the own parameter we can write the name of this column which is imdg title id and then we have to write the how parameter and set it equal to inner but here there is a little detail you have to know so here the default value that the how parameter has is inner so if you press shift plus tab you can see here uh in this little window here in how the default is inner so you can use this to find the default values that all the parameters have and in this case the how parameter has this inner uh value by default so this means that we can omit this how parameter because it's set to inner by default so I can delete this and now I can run and we can see that this data frame have the columns of DF movies so these four columns belong to DF_MOVO and these last two columns belong to this DF ratings and also the ids that df movies and df ratings have in common are shown in this data frame so only the common ids are here in this data frame and that's how you make an inner join between the df movies and df ratings now I want to show you a different way to use the merge method so this weight is similar to the weight we use in concatenation but now is in merge so the only thing you have to do is to write PDM merge then parenthesis and write the data frames you want to merge so DF movies and DF ratings so this looks like the syntax of the pd concat but now instead of pd concat is pd merge now we have to add the on parameter and set it equal to the id column so now I set equal to this i comment this out and if I run this we get the same result and well the ids that these two data frames have in common are in this ID column and that's it in this video we learn how to make an inner join in pandas all right in this video I'm going to show you how full joins work and full joins allows us to join all the elements between two tables and you can see much better how full join works in this diagram so here all the white dots cover all the space in these two tables so all the area covered by the y dot represent the full join in this diagram and now let's see an example to understand this much better so here I have the two data frames we've used in the previous video and what we're going to do here is to join these two data frames but in this case with a full join so here now we have all the elements from the ID column so here is the ID column of DF1 and ID column of DF2 so we can see that here C and D are the elements in common in this column so A and B only belong to DF1 and E and F only belong to DF2 but we see that all the elements are taken by this full join because this full join takes all the elements in these two data frames and we can see here that we have DF1 and DF2 again and we see that A B C D E and F are covered in this full join so now let's see how to write code that produces a full join so here again we have to find a column in common in this case the column that both data frames have is the ID column and to make this full join we have to use the merge method so we write df1 merge and inside parenthesis we write df2 and then on the own parameter we set it equal to the column that these two data frames have in common so in this case the ID column and finally in the how parameter we set it equal to outer because this is a full join also known as an outer join and well they are the same so the outer join is the same as the full join or full outer join anyway so now let's see the output of this full join so what data frame we're going to get after using this code so the output will be this so this data frame has all the data that we had in DF1 and DF2 but as you can see in this data frame only the rows that belong to the IDC and ID have nonnull values while the rest have some null values and this happens because the DF1 doesn't have the job column and the DF2 doesn't have the H column so this produces some null values when we make this full join but this is fine this is just how the full join behaves when we have this type of data frames so now enough talk now let's see how to produce this full join using pandas so let's go to Jupyter notebook and now I'm going to show you the two data frames we're going to work with which are the same data frames we've seen in the slide so first the DF1 and then the DF2 so here's DF1 and DF2 so for a gentle introduction to this full join we're going to merge these two data frames and now we only have to write uh here DF1 merge then parenthesis here DF2 then on and here the id because this is the column in common then how and we set the how parameter equal to outer and now this produces the data frame we want it and here we see that uh the A and B ID have null values and the E and F have null values too while the C and D uh don't have null values because these are the elements in common between these two data frames okay this is how the full join works and now to understand much better all the concepts we learned so far we're going to solve an exercise and in this exercise you have to merge two data frames in this case the data frames DF movies and DF ratings these two data frames we've seen in the previous videos and now you have to find the column in common and perform an outer join so now try to solve this exercise on your own and after you do that you can continue watching this video to see my solution okay to solve this exercise let's explore this DF movies and DF ratings data frame so here I print this DF movies and now we have this data frame that we've seen in the previous videos so now let's find which column they have in common so here I print DF ratings and as we can see both data frames have this ID column so I'm going to use this in the on parameter and here I'm going to use the merge method first so that merge then open parenthesis then we write the second data frame and now we write the on parameter so on equal to this imdb title id so now how in the parameter we set it equal to outdoor so now this is ready so we run this and we get this data frame that is the full join uh of the df movies and the df ratings so as you can see these four columns belong to this DF movies and these two last columns belong to the DF ratings and this full join looks like the horizontal concatenation we learned in previous videos and that's an alternative to solve this type of exercise too but in this case I chose to use the merge method because we're learning full joints but you could also solve this exercise with the concat method okay that's enough for full joints and now we're going to see a type of joint that is similar to the full join but has some differences okay now it's time to see exclusive full joins and exclusive full joints will pick only the values that are exclusive to the tables so here I have two tables and the exclusive values from table one and exclusive values from table two will be taken in this exclusive full join so only this area that you see here on the left that is exclusive to the first table or first data frame and also this area that is with the white dots that is on the right that is exclusive to the second table or second data frame so this is known as the exclusive area because it belongs either to one table or to the other table however the area that is in the middle is the common area and this area is not considered in this exclusive full join because this area belongs to both tables or both data frames okay now let's see an example so we're going to use the same data frames here that we've been using so far and now let's see how these uh elements will fit in this diagram so now we see that the A B E and F were taken in this exclusive full join because here in the data frame we can see that A and B only belong to DF1 and E and F only belong to DF2 however C and D belong to both DF1 and DF2 so this is why we didn't take this C and D because this is exclusive full join and we only want exclusive values so now let's see how we can make this exclusive full join uh with code first we need of course a column in common so in this case the column in common is the id column and after we find this column we need to use the merge method so we write dfmer merge and inside we write df2 and then also the on and the how parameter in this case the how parameter is going to be outer so it's similar to the outer join also known as full join but in this case we're going to add an extra parameter which is the indicator so this indicator helps us know which values are exclusive to DF1 and which values are exclusive to DF2 so after we add this indicator we're going to get a new column in our output data frame and this new column is going to be named underscore merge by default and in this column you will see values like left only right only and both so the values that are left only corresponds only to the data frame on the left in this case DF1 and the values that say right only corresponds only to the data frame on the right which is DF2 and these are the two values that are exclusive to DF1 and DF2 so once we have this column we're going to use the query method to select only the exclusive values to DF1 and exclusive values to DF2 and with this we're going to filter out the values that these two data frames have in common all right now let's see the data frame that we're going to get as output after running this merge method and the output is going to be this so we only get the A B E and F uh ids and the C and D were filtered out also you see that we have some null values because the DF1 doesn't have the job column and the DF2 doesn't have the H column and this is why you see here some null values all right now let's see how to make this exclusive full join using pandas so now I go to Jupyter notebook and again I'm going to use that DF1 and DF2 that we've seen before so these are the same data frames we've seen in the slides and also in the previous videos and I'm going to use these two data frames to make this exclusive full join so here what I'm going to write is first DF1 merge and here DF2 then also we add the on parameter equal to id and the how parameter equal to outer so this creates a full join and that's fine but now we want an exclusive full join so we have to add the indicator equal to true and here I copy and paste the code and Now I add the indicator equal to true so with this we create a new column called underscore merge which is this one and as I mentioned before we have here three types of elements left only both and right only so left only are the elements that are exclusive to DF1 and right only are the elements exclusive to DF2 but the both values are the elements that have in common DF1 and DF2 great now let's get only the left only and right only values from this merge column and now I'm going to use the query method so here I copy and paste and now I'm going to use this query method so I write that query and after this I'm going to write this query so here I press enter so you can see much better the query I'm about to write so to select only the left underscore only and right underscore only values we first open these quotes and now write the name of the column so in this case underscore merge and now we compare with this uh equal sign so double equal sign and now we write left only and now we have to do the same with the right only so I copy this and I paste it here but now instead of left only I copy and paste right only so now we need to add the quotes to specify that this left only and right only are strings so here I add quotes and here I add quotes again and keep in mind that here I'm using single quotes but here in this quotes that we wrote before uh here I wrote double quotes and this is to avoid some type of conflict between quotes and now I'm going to use the or operator so we indicate that we want either uh left only or right only so here I write or and this is how we select the left only and right only values from the merge column so here I'm going to run this code so I press Ctrl enter and now you see that this data frame that we have as output only shows the values that are left underscore only and right underscore only and that's everything you have to do to make an exclusive full join all right now that you have a good idea of how to make these exclusive full joins we're going to solve an exercise so here in this exercise we have to merge df movies and df ratings and in this case the type of join is going to be an exclusive full join and well now explore df movies and df ratings those are the data frames we seen in the previous videos so you have to find that column in common and then use the same steps we uh followed here so first add the indicator then add the query and then make this exclusive full join so now try to solve this exercise on your own and after you do that you can continue watching this video to see my solution okay to solve this exercise first let's have a look at the data frames we've seen this data frames before so I have a good idea of the column that they have in common but in case you don't remember I'm going to print both data frames so first the DF movies now the DF underscore ratings so as you can see both data frames have this ID column and now we're going to use this column to do this exclusive full join so now I write merge then parenthesis and now I put this second data frame inside parenthesis then I add the on parameter and in this case it's going to be equal to this imb title id and then we add the how parameter so here equal to outdor we press ctrl enter and with this we get this full join so this is a full join and what we want right now is an exclusive full join so here I'm going to um add the indicator so I copy and paste and now I add the indicator parameter equal to true so now to see much better this I'm going to press enter here and now it's much clearer so I'm going to run this cell and here we see the data frame but now a new column was created and is this underscore merge column and here we can see that most of the values are set to both and now I remember that these two data frames DF movies and DF ratings have the same ID so all the values inside the IMDb title ID column are the same and that means that all the values inside the underscore merge column are going to be equal to both so all these values will be both and this will affect somehow this exclusive full join but I'm going to show you uh what's going to happen but first let's add the query so here I'm going to minimize this data frame now I paste this code and I'm going to add the query so I write query parenthesis and now I'm going to copy and paste the same query we created before because it's the same query uh we're going to use every time we want to make an exclusive full join so every time we want to make an exclusive full join we only use this uh code so here this is ready and now I'll press Ctrl enter and we see that this data frame is empty and we see only the column names but there is no data in this data frame and this happens because here in the underscore merge column all the values were set to both so this means that all the values in the ID column are the same between the DF movies and DF ratings so these two data frames have the same ids and none of these data frames have exclusive ids so this is why we got this uh empty data frame because these two data frames don't have exclusive values and this is fine we didn't make any mistake this happened only because the IDs between these two data frames were the same but this code that we wrote here is correct and that's it for this video i hope you successfully solved this exercise all right in this video we're going to see how left join works left join allows us to grab only the data that belongs to a data frame or to one table so this is the data that is exclusive to dataf frame one or table one and also the data that dataf frame one or table one has in common with dataf frame two or table two so both areas are covered by the white dots and this represents the left join so now let's see an example and we're going to use the same data frame we've been using so far so this DF1 and DF2 and now let's see how this uh data inside the ID column will fit into the diagram so here we have this A B C D E and F and we see that only the data A B C and D is covered by the white dots and this happens because this data belongs to the DF1 so here we see that in the ID column there is the A B C and D ids so this data belongs to DF1 and the E and DF wen selected because these two ids belong only to DF2 so this is why this E and if ids weren't selected in this left join so the left join will return only the data that corresponds to the left table or left data frame so in this case the DF1 all right now let's see how to make this left join in pandas and as usual we will need a column in common and in this case the column in common is the ID column so both the DF1 and DF2 have this ID column and now what we have to do is to use the merge method to create this left join so we write DF1 merge then parenthesis and then we write DF2 because that's the second data frame then on the on parameter we set it equal to the column in common in this case ID and after this in the how parameter we set it equal to left because we want to make a left joint okay after we do this we're going to get the following data frame so this is the data frame that we get after making this left join and as you can see here all the data that belongs to the DF1 that is the ids A B C and D will be in this data frame however some data is missing so we see here some new values and this happens because the DF1 doesn't have this job column and this means that we don't get that data because that data doesn't exist however for the IDs C and D we got the data that corresponds to the job column because this data also exists in the DF2 and since we have that data when we make the left joint we get this data that is doctor and a statistician so this happens because we made a left joint and now that's enough for the theory now let's check how to make this left joint but now writing code in Python so let's go to Jupyter notebook and here I'm going to use the same data frames DF1 and DF2 and now let's make this merge so I write df1 uh merge and then I write parenthesis and write df2 inside parenthesis so now df2 and now the on parameter will be equal to id and the home parameter uh will be equal to this left so we set it equal to left because we want to make a left join now we press Ctrl enter and now we have this data frame so here we have some null data and here we have the data that corresponds to DF2 in the job column and also ids A B C and D are in this data frame because this data corresponds to DF1 all right that's how you make a left joint and now let's put everything we learned into practice by solving this exercise so in this exercise we're going to use the DF movies and DF ratings data frames and then we're going to follow these steps first we extract a 50% sample of the DF movies data frame and then we merge this uh sample so here we give a name to this sample data frame we assign the name DF_OMOV sample and we merge this new data frame with the DF ratings so here the type of join is going to be a left joint and once you have this data frame you have to get the shape of all the data frames involved in this exercise so we compare the shapes and draw some conclusions all right now it's your time to solve this exercise on your own and after you try to solve this exercise you can continue watching this video to see my solution all right to start solving this exercise I'm going to extract a 50% sample of the df movies data frame and I'm going to use the sample method so I write df movies and then sample and then I open parenthesis we learned this method in this course before and now you might know that here we have the frack parameter and this frack parameter allows us to extract a percentage or a fraction of the original data frame so in this case we want the 50% so we set it equal to uh 0.5 and with this we get this sample now I'm going to give a name to this data frame and the name is going to be df_mov sample so we run this and now we have this sample so now let's merge the two data frames and first let's write df_mov sample merge then parenthesis then the second data frame and now I write the own parameter in this case the the column that they have in common is this uh ID I guess let me check here so this data frame has this ID and also these ratings I believe has this ID so let's check and yeah they have in common this ID column so here I write the id and now as last parameter we add how and now we set it equal to left because we want to make a left join so now I run this and this is the data frame we get after making this left join so I'm going to give it a name in this case this is going to be equal to df left and now it's time to compare the shapes of these data frames so here I write df_ov sample so I better copy and paste now that shape so we get the shape of this data frame and now I print this so print and now copy and paste this twice so now I write DF ratings i copy and paste it and finally DF left and after I get the shapes we're going to compare the values so now I run this and we have these shapes so here the first element is the number of rows and the second element is the number of columns and we see that the number of rows of the df_mov sample is around 42,000 the number of rows of the df ratings is 85,000 and the number of rows of the data frame we got after making the left join is also 42,000 or around 42,000 and this happened because we made a left join which means that only the data that belongs to this df movies sample will remain and the data that is not here but is only in the df ratings will be excluded so we're using this id column in this on operator and this indicates that only the ids that belong to the df_mov sample will remain in this dfc left and that's why we got around 42,000 rows in the data frame that corresponds to the left join that said all the number of columns were combined so here we have four columns of the first data frame two columns of the second data frame and then the ID column that these two data frames have in common so in total 4 + 2 + 1 so seven columns in this final DF_left data frame and that's everything you need to know about left joints and now let's see a different type of join in this case we're going to see the exclusive left join and this is a bit similar to the left join but now we're going to get exclusive values from the data frame that is on the left so here let's see the diagram and now you see that the white dots only cover the area that is exclusive to the first table or first data frame so here let's see an example so here we have the DF1 and DF2 and now let's see how the IDs fit into the diagram that we've seen before so let's check it out and it's here so we see that only the A and B ids are taken in this exclusive left join because these are the only ids that are exclusive to this DF1 so here in the data frame we see that this ID A and this IDB is the only data that is exclusive in this DF1 in the ID column so if you try to find this A and B ID in DF2 you won't find them because it's exclusive of DF1 and you see also that the rest of the IDs so C D E and F isn't exclusive of DF1 so this is why these four ids were excluded in this exclusive left join so now let's see how to make this exclusive left join uh with pandas so first as usual we need a column in common in this case this column is the id and after we recognize this column we have to use the merge method again so we write df1 merge and then we write the second data frame in this case df2 then we write the id in the on parameter and then in the how parameter we have to set it equal to outer and as a rule of thumb we always set the outer value in the how parameter when we want to create an exclusive join so either an exclusive left join or an exclusive full join or an exclusive right join so every time we want to create a exclusive join we have to use the outer in the how parameter okay after this we have to use the indicator parameter and set it equal to true so as you might remember this indicator will create a new column called underscore merge and there we will see the data that is exclusive to DF1 the data that is exclusive to DF2 and also the data that belongs to both DF1 and DF2 after we get this column we're going to use the query method to get the data that is exclusive to DF1 only so we're going to see in detail that when we write code in Jupyter notebook but now let's see the data frame that we're going to get as output so this is the data frame that we get after making this exclusive left join and as you can see here the data that we get is only the data that corresponds to the A and B ids so here we see that in the job column we got null values and this is because the DF1 doesn't have this job column and as a result we got this uh null values so now enough theory now let's write code in Jupyter notebook and now let's create this exclusive left join so let's go to Jupyter notebook and now here we're going to uh merge these two data frames DF1 and DF2 and to do that we have to write DF1 merge and then DF2 then the on parameter equal to ID and then the how parameter equal to outer with this we create a full join so I'm just going to show you here but if we add the indicator parameter we can set this indicator equal to true so I just copy and paste this copy and paste okay and with this indicator equal to true we created this underscore merge column and here we're going to pick only the left underscore only values because these values represent the values that are exclusive to DF1 so left underscore only means the data that is exclusive to the data frame on the left so in this case DF1 and we're going to exclude the both and the write only values because that data is an exclusive of DF1 so now let's create the query so here I copy and now I paste it here so now I write that query now parenthesis now I press here enter so you can see much better the query that I'm about to write and here I open these quotes now underscore merge because this is the name of the column then we write double equal sign and now I write the left underscore only because this is the data exclusive to DF1 so here I write single quotes and paste left underscore only and with this our exclusive left join is ready so now I can run this and we can see the result and we get this data frame and on the merge column we see that the data that we got has only the left underscore only values so with this we can verify that this exclusive left joint was successfully performed and that's everything you need to do to create an exclusive left joint and now we're going to put everything we learned into practice by solving this exercise so in this exercise you have to follow these steps first you have to make a copy of that DF movies data frame that we've seen in previous videos and then you have to set the first 1,000 values of the ID column as this ID that I created so this ID that I created is TT 1 2 3 4 5 6 7 8 9 and zero so this is a ID that I just came up with and you have to set this ID to the first 1,00 values of this ID column after you do that you have to merge the data frame that you got after making the copy which I named DF_MOV_2 and you merge this data frame with the DF underscore ratings and the type of joint that you're going to make here is an exclusive left joint and after this you have to find the shape of all the data frames that are involved in this exercise to compare and draw some conclusions all right now is your time to solve this exercise and after you try to solve this exercise you can continue watching this video to see my solution okay to solve this exercise first I'm going to make a copy of this dfc_mov data frame so I write df_mov.copy so I'm using the copy method we learned in this course and then we set this equal to df_mov_2 so with this copy method we make an independent copy now we run this and now we created this df_mov_2 after this to save the first 1,00 values to this ID that I created I'm going to use a for loop so here I write for index in and here I'm going to loop through the indexes in this data frame that I created so here I write df_mov_2 and then that index then colon and then press enter so here I'm going to show you what I'm doing so this index attribute returns all the indexes in this data frame so here we're looping through these indexes and now I'm going to write a condition so I write if index is less than 1,000 then execute the following code so with this we're going to change the values of the first 1,00 rows only so now I press enter in here I'm going to write the code that will allow me to change this uh values of the ID column so now I write the name of the data frame and here I'm going to use the log method so I write log then open square brackets and the log method needs uh two arguments so first the name of the index and then the name of the column so the name of the index is this index that I'm using here when looping through this indexes list so here I'm going to use this index and now the name of the column is the ID column so this one imdb title ID so in case you don't remember this lock method I'm going to quickly remind you what this lock method does so when we use the lock method we locate an specific data in the data frame so here for example I write the name of this data frame and then I write that lock and then let's say we want to get the data inside the index zero so we press Ctrl enter and we get all the data that corresponds to the index zero but now let's say that we only want the data that belongs to the index zero and also to the column id so now I only have to write the ID column and then run and we get this ID and now we're going to change the value of these ids for only the first 1,000 rows so now we set this equal to the ID that I created this one that is tt 1 2 3 4 5 until 9 and then zero so this is the value and now that this is ready I'm going to run this code so now I run and now let's see the data frame so now I'm going to print df_mov_2 and now you can see that the first 1,000 rows of this ID column uh have this TT 1 2 3 4 5 6 7 8 9 and zero so we successfully change the data inside the first 10,000 rows and now I'm going to uh make this merge so first I write the name of the data frame df_mov_2 then merge and then parenthesis then I write the name of the second data frame and then the on parameter equal to this ID column and then the how parameter equal to the outer value so here with this we get an full join but now if we add the indicator equal to true and if we add also the query we can get this exclusive left join so here I write that query and now inside parenthesis I'm going to write the same code I wrote here so merge equal to left only so every time we want to make a left join we can use this uh query so in this case this query help us get the exclusive left join so we paste it here and now this is ready so we can run this cell and see the result so I run and now we have this data frame and on the underscore merge column we see that the data that we have is only left underscore only so apparently everything is correct so now I'm going to set a name for this data frame and the name I'm going to give this data frame is df_exclusive leftft so now I run this and now we have this data frame and now let's find the shape of this data frame so first df_exclusive leftft so I use the shape attribute now print now parenthesis and now I copy and paste this twice so now we have this and now I write the name of the DF_MOVO_2 and also DF ratings so now I run this cell and now let's see the result and here we see that the first and the second data frame have around 85,000 rows however the DF_exclusive left data frame only has 1,000 rows and this happens because this 1,00 rows is exclusive to that df_mov_2 and this 1,00 rows correspond to this ID I created so this TT 1 2 3 4 5 6 7 8 9 0 so I created this ID in purpose because this ID didn't exist so those ids were unique to this data frame and this is why in the exclusive left joint we only got this 1,000 rows so this means that the other 84,000 rows aren't exclusive to DF_MOV_2 so probably they also belong to this DF ratings data frame and that's it in this video we learn how to make a left joint and also an exclusive left joint in pandas okay in this video we're going to learn how right join works a right join allows us to grab the data of the second data frame or the second table so consider this diagram let's say that the data frame on the left is the first data frame DF1 and the data frame or table on the right is the second data frame DF2 so a right join will grab only the data that corresponds to DF2 or table 2 so this data is the data that is exclusive to DF2 and also the data that belongs to DF2 and to DF1 so the data in common this right join is represented by the white dots in this diagram so now let's see an example these are the data frames that we've seen in the previous videos so we have DF1 and DF2 and the ID column that is the column that these two data frames have in common and now let's see how this data that is inside the ID column will fit into the diagram that we had before so here we have the data that is inside the ID column so A B C D E and F and now we see that the right join only covers the ids C D E and F and this happens because this data corresponds to that DF2 so C D E and F belongs to DF2 while A and B doesn't belong to DF2 so only the data that is exclusive to DF2 and also the data that DF1 and DF2 have in common will be taken by the right join as you can see in this diagram okay now let's see how to make the right join using pandas so first we have to identify the column in common so in this case is the ID column and after we do that we have to use again the merge method so we write df1 merge and inside we write the second data frame also the column in common in the on parameter and then we write the type of join in the how parameter in this case this is a right join so we write how equal to write and after we write this we're going to get the following output so this is the data frame that we get as output so here we see that only the data that corresponds to C D E and the F ID is in this data frame and this happens because these ids belong to DF2 and now let's see how to implement this right join in pandas so let's go to Jupyter notebook and here I'm going to use the DF1 and DF2 data frames so let's write the code to make this right join so first DF1 merge then parenthesis DF2 so this is the second data frame and then the on parameter equal to id and the how parameter equal to write so with this we make this right join so now let's run this code and we get this uh data frame so it's the same output we've seen before and as you can see making a right join in pandas is as simple as that okay now let's solve an exercise to challenge ourself a bit more and we're going to follow these steps first we're going to extract a 30% sample of the DF ratings data frame then we give a name to this sample so I'm going to name it DF_ratings sample and then we uh make a merge so between the df movies and this df ratings sample that we created in the first step and in this case the type of joint that we're going to make here is the right joint and after you do that you have to calculate the shape of the data frames so we can know what's going on behind the right join okay now is your time to solve this exercise on your own and after you finish you can continue watching this video to see my solution okay to solve this exercise first I'm going to extract a 30% sample of the DF ratings data frame so I write DF ratings then sample and here I use the FRA parameter so here equal to 0.3 and now I give a name to this data frame so I'm going to name it DF ratings sample and now I run this and I just created this data frame so now I'm going to merge the df_mov and this new data frame so I write df_mov merge then the name of my second data frame which is this new data frame I just created and then the on parameter and after this the how parameter so on the on parameter we have to write the column these two data frames have in common and this column is the ID i'm going to show you here so I'm going to use the columns attribute and is here so ID and now I paste it here and in the hub parameter I set it equal to right because we want to make a right join so this is ready and now I'm going to run this code so I press Ctrl enter and now we made this right join so we have the data that belongs to df_r ratings sample and also we got the columns that belong to df movies so now I'm going to name this data frame and I write equal to and here df write so this is the name of my data frame that we got after making the right join and now I'm going to run this and here I calculate the shape so first df write shape and now I paste twice then I copy the values of the data frame so df movies and the other is df ratings sample and the last one is the df_right so it's here and now I print these three shapes so I use the print function and here uh I copy and paste and now we have the shapes so we see that the number of rows between the df_right and the df ratings sample are the same and this happens because we uh made a right join and with this we obtain only the ids that belong to the second data frame in this case the df ratings sample so all the data that corresponds to those ids will be available in the df write however the ids that only belong to the first data frame so df movies will be excluded and this is why some of the 85,000 rows from DF Unscrew movies were not considered in this right join and that's everything you need to know about the right join now let's see the next type of join okay now it's time to see exclusive right join and this type of join is similar to the right join but here we're going to exclude the data that the data frame one and data frame two have in common so the only data that the exclusive right join will grab is the data that is exclusive to the right table or the second table also known as the second data frame so this is the area that is covered by the white dots so that area represents the exclusive right join and now let's see an example with the same data frames we've seen so far so DF1 and DF2 and now let's see how the ids fit into this diagram so here we see that only the E and F ids will be taken by this right join because this is the data that is exclusive to the DF2 or the second data frame so now let's see these two data frames again and we can verify that the ids E and F belong only to the DF2 while the other ids either belong only to DF1 or belong to both DF1 and DF2 so now let's see how to make this exclusive right join in pandas so first as always we have to find the column in common in this case the id column and after this we use the merge method so we write df1 merge and inside parenthesis we write the second data frame the column in common in the on parameter and in the how parameter we have to choose the outer join so in this case we have to write out because as I told you before in some other video as a rule of thumb every time we want to make an exclusive join we have to write the word outer in the how parameter and after this we have to add the indicator parameter and set it equal to true this indicator will create a new column named underscore merge and in this column we'll see which data is exclusive to the DF2 or to the second data frame so after we do this we have to use the query method to pick only the data that is exclusive to DF2 all right now let's see the output of this exclusive right join and the output is the data that corresponds to the ids E and F and we see that the H column is empty because the second data frame also known as DF2 doesn't have this column for that reason we got null values in this H column all right now let's see how to implement this exclusive right join in pandas so let's go to Jupyter notebook and here let's write the code so I'm going to use DF1 and DF2 as I did in previous videos and now I write DF1 merge then the second data frame DF2 and then the own parameter here ID then the how parameter equal to outer so with this we make a full join or outer join but now we want to make um exclusive right join so we have to add the indicator parameter and set it equal to true so this indicator will create the merge column and this merge column will tell us which data is exclusive to the second data frame and this type of data is represented by the right underscore only so this means the data that only belongs to the right data frame and in this case the right data frame or the second data frame is the DF2 okay now let's create a query that only picks this right underscore only data inside the underscore merge column so now I copy and paste this code here and after I do this I'm going to create a query so here query then parentheses and now u I open these quotes then I write the name of the column actually I copy the name of the columns to make it faster so now this column equal to and now the name of the data I want to pick in this case is the right underscore only and that's everything we have to do now we can run this cell and we're going to get only the data that is exclusive to the second data frame so here you see that this data is exclusive to DF2 and with this we successfully created our exclusive right join all right now let's put everything we learned into practice solving an exercise so in this exercise you have to make a copy of the DF ratings data frame and after you do that you have to set the first 1,00 values of the ID column as the ID that I created here this is an ID that I just came up with and you have to set that value to the first 1,00 uh rows in the ID column after you do that you have to merge the DF_MOV and the DF ratings_2 this data comes from the first step which is this one and the type of joint you have to make is the exclusive right joint and finally you have to calculate the shape of all the data frames involved in this exercise so we can draw some conclusions and that's everything you need to do in this exercise so now try to solve this on your own and after you finish you can continue watching this video to see my solution okay to solve this exercise I'm going to start by making a copy of the DF ratings data frame and I'm going to use the copy method so I write df_ratings.copy parenthesis and with this we're going to create an independent copy so now I'm going to set a name for this data frame and I'm going to name it DF ratings_2 now we run and we created a copy of the DF ratings data frame then to set the first 10,00 values of the ID column as this ID that I created uh I'm going to use a for loop so here I write for index in df uh ratings_2 index so we're going to loop through that df_r ratings_2 dataf frame so now I write colon and here um I'm going to write a condition this condition will limit the changes we're about to make to only the first 1,000 values so we write if index is less than 1,000 then execute this line of code and in this line of code we're going to change the value of the ID column and here I'm going to use the lock method so I write df_ratings_2 lock and here I use the index and the column the index comes from the index that I created here for this for loop and the name of the column is going to be this column id so I write the imb title ID and then set it equal to this ID so now I copy this ID and I paste it so just a recap in case you didn't understand what I did so far first we loop through the indexes of DF ratings_2 i'm going to show you the indexes here so you understand much better so the indexes start with zero and ends with uh 18,855 so we're going to loop through all these values but we're going to limit the changes that we make with the lock method to the first 1,000 indexes so with this only the first 1,000 indexes will have this new ID value okay now let's run this code and now we change the values and now let's see this data frame so I print now df ratings_2 and now you see that the first 1,000 rows in the ID column have this TT 1 2 3 4 5 6 7 8 9 0 value so now I'm going to merge that two data frames so df_ov and df ratings_2 so now I write df_mov merge then the second data frame and then the own parameter which is equal to the column in common in this case is the id column as you might remember from previous videos then the how parameter equal to outer because we want to make uh this exclusive join so we have to set it to outer and then we add the indicator equal to true so now I'm going to make the query and to select only that data that is exclusive to the second data frame i'm going to use this code that we previously created here so this merge which is equal to write only allows me to pick only the data that is exclusive to the second data frame or to the data frame on the right so now I copy this and after we do this we paste it inside the query so now everything is ready we can run this code so now we merge these two data frames and we see that we have some null values but that's fine because the second data frame doesn't have this title years genre and country columns so it's fine now let's verify that we successfully made this exclusive right join by analyzing the shapes of the data frames so now I'm going to set a name to this uh exclusive right join and I'm going to name it DF_exclusive join so now I run again and now we created this data frame so now let's examine the shape so now I write this that shape now I paste twice and now I write the names of the other data frames so DF ratings_2 and also df movies so now it's ready i'm going to put this first and now let's print this so we can see the shapes of the three data frames so I print and now I run this code and we see the shape of these three data frames all right now we see that the data frame we got after making the exclusive right join only has 1,000 rows and this happens because only this 1,000 rows is exclusive to the DF ratings_2 which is our second data frame or let's say the data frame on the right so this data is exclusive to this second data frame and for that reason we only got 1,000 rows in this data frame and that's it in this video we learn how to make the right join and exclusive right join in pandas all right in this video we're going to learn some meta characters used in regular expressions so in regular expressions meta characters are understood as characters with a special meaning like the slash d /w and slash s and in this video we're going to see the meaning of these characters and also see some examples to easily explain you how regular expressions work in Python i have on the left a website called regx 101 and this website allows me to test some regular expressions and I have here a text that we're going to use as example and on the right I have a list of meta characters with their meaning so first I'm going to explain you the meaning of each meta character and after that you can remember the meaning by looking at the right side after we learn all these special characters in Python we're going to go back to Jupyter Notebook and test some regular expressions using the Jupyter Notebook text editor okay now let's start with the first meta character and this one is the slash D or back slash D and this meta character help us get the digits from a text so the numbers from 0 to 9 so let's test it out so the only thing I have to do here is to write the backslash followed by the D and as you can see here on the left we have the numbers highlighted in blue which indicates that this regular expression back slash D gets all the digits from 0 to 9 okay now this back slashd as most meta characters have their negation and their negation get the opposite values of each meta character okay now let's get the negation of back slash D in this case the negation of back slash D is backslash but now D in uppercase so this one so if we write back slash D in uppercase we're going to get not digits so we're going to get from the text all the elements or all the characters that are not digits so let's try it out so here I write backslash and now D in uppercase and now you see that all the characters that are not digits were matched and this includes even white spaces and also this uh hyphen so anything that is not a digit will be matched if we use this backslash d in uppercase okay now let's see the next meta character and now is the backslash w and this one means word character so this includes letter from A to Z and then from A to Z but now in uppercase and also the digits from 0 to 9 and also the underscore so keep this in mind i used to forget about this underscore but this underscore is included as a word character so now let's try it out so here I'm going to test it out here I write backslash and now W so as you can see all the word characters were matched and those that weren't matched aren't word characters for example here the period is not a word character also these exclamation marks are not word characters also the hyphens are not word characters and the rest is a word character because we have here digits and also we have here letters either in uppercase or in lower case and to get the negation of this back slashw we have to use the backslash w but now in uppercase and now we get not a word character so this back slashw in uppercase is not a word character as you can see here on the right so now let's write this back slash w in uppercase and here we see that these spaces here this is an space this isn't a word character so it was matched also the period and the exclamation marks as we've seen before and well keep that in mind that this blank space counts as a character and in this case is not a word character because sometimes people used to forget that the blank space is there but yeah is there and is not a word character and now it's time for the back slash s and this back slash s means wide space and this includes a space also a tab and new line so if you don't know what I'm talking about well a space is only this is how I make a space then the top is this and the new line is only when we press enter and we get a new line well it's as simple as that so now let's try this out so here I write backslash and then the letter S so here I have back slash S and now we matched all the white spaces so this white space between the words hello and world also this new line that is after this period and then we have a lot of um blank spaces and then uh again the here the new line and so on just keep in mind that this last line this one doesn't have a new line because here we don't have like a next line we don't have a new line but if I press enter here we automatically get this uh blue selection so this is in blue because now we pressed enter and we created a new line and if I press the delete key we'll see that that new line disappeared and is not matched anymore okay the negation of this back slash s is the back slash s in uppercase in this case it's not y space so this means like the opposite of whitespace as simple as that so now I write back slash in uppercase and as you might expect we get everything that is not uh a space or a new line or a tap so this is what we got okay now let's see another meta character and now it's time for the that dot so this period or dot sign matches any character except for new line so when we write this dot sign we're going to match any character that is not a new line so now let's test it out so here I write the dot sign and here we matched all the characters except for the new line so here you see that we match this uh space and also the that dot sign and also this exclamation marks but not the new line so here this new line that is supposed to be here after the period is not matched and also all the new lines that are at the end of each sentence they are not matched okay now there isn't such a thing like the negation of this that dot sign but if you want to ignore these special characters and not only this but any special meta character you can use only this backslash to ignore any special character so let's try it out with this dot sign so now this that sign is a meta character so it's a special character in regular expression but now if we add the backslash here we're going to see that we don't get any more any character except for new line so as you might remember this dot sign gets any character except for new line but now I added the backslash and this isn't true anymore because this backslash ignores any special character so this means that we escaped that meta character and now this is only a regular character so the only thing that matches is this regular dot sign so is the only thing that matched and the rest didn't match because this is only an average dot sign and that's how the backslash works in regular expressions okay now let's see some special symbols and we have two special symbols that help us match elements at the beginning or at the end of a string so first let's start with the card symbol which is this one that you see on the right and this symbol matches the beginning of a string so let's try it out here let's see how it works so let's say that we want to match the word hello which is this one so if we want to do that we only have to write hello and that's it yeah we can get the hello world but now what happens if we have this so we don't have one hello but we have two hellos so here we have hello world and again hello so if we only want to match the first hello we have to use the carrot symbol to match only the beginning of a string so here if I write a carrot we'll see that the only element that matches is the first hello because here is the beginning of the string so it starts in this part matching elements and now it ends here and since this hello isn't at the beginning of this sentence is not considered for this carrot symbol but what if we want to match the second hello so this one well in this case we can use that dollar symbol which is this one and this matches the end of a string so if we write here uh dollar but now first delete the carrot and write the dollar we'll see that well nothing matches because this first sentence doesn't end in hello but ends in hello with the period so we have to add the period and now you see that we match this hello with the period that is at the end of the string because of this dollar sign okay now let's delete this second hello so we have the sentence as it was in the beginning and now I want to show you that we can easily match the words hello and world by only writing these words so for example if we write hello we match the word hello in the test string and if we write the word world we match the word world here in the test string so that obviously makes sense but here the power of the carrot and the dollar sign comes when you have repetitive values in the string so you only want to match either the words at the beginning or the words at the end but not the words in the middle for example you want to get only the last exclamation mark that is in this second sentence so if we only write uh this exclamation mark we're going to get only the last exclamation mark and not the others so if for some reason you want to get only the last element that's when you can use this dollar symbol or also the car symbol when you want to get it at the beginning and something important I want to mention is that the regular expressions might change slightly their behavior based on some reg flags so regular flax help us like edit the behavior of these regular expressions and here in this website you can see it in this site so you only click here and you will see the regx options so here I click and now we see all the rex flags so here we have activated multi-line and this allows the carrot and dollar symbol match start and the end of line so this means that each line will be taken like a separate text which means that this is like one text and this is another text so here we have one beginning of a string and here one end of a string and here in the second line we have another beginning of a string and now another end of a string and so on until the last sentence or the last line here also we have the beginning of a string and also the end of a string so in total we'll have like 1 2 3 4 5 six beginnings and six ends of a string because we have six lines and that happens because we turned on this multi-line but if we turn it off so I click here now we only have one beginning and one end of a string so the beginning is here in hello and the end is here in four five six so this happens because I turned off this multi-line so we only have one beginning and one end of the string and I think this is how Python behaves by default we're going to see that later when we go to Jupyter notebook but just keep that in mind you can control these uh reg flags using a specific method in Python and that's it in this video we learned some meta characters that we can use in regular expressions all right in this video we're going to see some quantifiers used in regular expressions some of the most common quantifiers in regular expressions are this asterisk this plus sign and this question mark so now let's try them out first with the asterisk so let's say we want to match the number of exclamation marks after the word python so here we have the word python and we have five exclamation marks after the word python so now let's match some of those exclamation marks using the quantifiers so first with the asterisk so first here I write python and now I write the exclamation mark so now you see we have only the uh Python with exclamation mark so this match because it's exactly the same characters and now if we use the asterisk we're going to see that here the asterisk matches zero or more characters and this asterisk will have an effect on the closest character so if we write here this uh asterisk this will count this exclamation mark zero or more times so here as a result we got that the python and all the exclamation marks were matched and this happened because this asterisk match zero or more characters so in this case the character that was evaluated was this exclamation mark and we count them as many exclamation marks we wrote here in the text so as a result we got the Python and the five exclamation marks so now let's see how the plus sign will work so this plus sign matches one or more elements and I think this will behave similarly to the asterisk so let's find out so here I write plus and now we see that it has the same behavior so we got here the Python and the five exclamation marks so this plus sign affected this exclamation mark and it capture one or more exclamation marks so here we have five exclamation marks and all of them were matched so as a result we got Python and all exclamation marks and now let's see how the question mark works so in this case this question mark matches zero or one character so in this case if we write the question mark we get only one because it says zero or one and in this case there is one exclamation mark and that's why we got only the python and this first exclamation mark and now you might be wondering what does greedy and lacy mean because here I wrote in parenthesis greedy for the asterisk and for the plus sign and lazy for the question mark okay first greedy means match the longest possible string so it means match as many characters as you can so for example here this asterisk is greedy because it can match zero or more characters but here if we write again the asterisk we see that it matches the maximum number of characters so it match the longest possible string because here it matched Python with five exclamation marks and the opposite of this is the question mark because if I write the question mark it will match the shortest possible string to fulfill my requirement so in this case one is the minimum number of exclamation marks that will satisfy my regular expression and also this plus sign is greedy because it behaves the same way the asterisk behaves so if I write here the plus sign we'll see that it will match as many characters as it can okay now to understand this much better we're going to see a different example and in this case we're going to use the dot sign and as you might remember this dot sign matches any character except for new lines so here I'm going to write the word hello and as you can see the word hello here in this string was matched but I don't want to get that but what I want to do is to get only the words between the H and the L so to do that I write here uh only the H and the L and now I want to get everything that is in the middle so here I can use the that dot sign and also I can use the plus sign because here if we use the plus sign we're going to match one or more character and if we use both the plus and the dot sign together this means that we're going to match one or more characters except for the new line so here if I write uh plus we'll see that all the words between H and this L were matched and this is kind of unexpected because here I got the words between this H and the L that belongs to the word world but I didn't want that i only wanted to get the words between the H and this L that belongs to the word hello so here we got this behavior because the plus sign is greedy so it means it will match as many strings it can so we got the longest string and in this case the longest string is this uh from H to the last L as a result the regular expression we wrote here became a greedy regular expression because it matched the longest possible string and if we want to make this regular expression lazy in order to get the shortest possible string we have to convert this greedy expression into a lazy expression and to do that we only have to add this question mark so this question mark converts any greedy regular expression into a lazy regular expression so let's try this out i'm going to add here the question mark after this plus sign so here I write question mark and now we see that we got the shortest possible string so we didn't get from this H to this L we didn't get from this H to this second L in the word hello but we only got the characters between the first H and the first L and this is because we use the question mark and this is a lazy operator and it converted this greedy regular expression into a lazy regular expression so anytime you want to get the shortest possible string you only have to use a lazy operator like this question mark to convert a greedy regular expression into a lazy regular expression okay now that everything is clear about greedy and lazy matches now we're going to continue with the quantifiers so here the next quantifier we're going to see is this curly braces with a number inside so this means exact number so instead of getting zero or more or one or more or zero or one now we can specify the exact number of characters that we want to match so on the left I'm going to write again the Python with the exclamation mark and now instead of writing asterisk or the plus sign or the question mark now I write these curly braces and now let's say we want to match three exclamation marks so if that's the case we only have to write the number three and we got three exclamation marks so Python and these three exclamation marks and that's how you define the exact number of matches that you want for an specific character so here you can modify so now number two and you get two exclamation marks or five and you get five exclamation marks so you can set the number you want and these curly braces will match the exact number of characters that you're indicated inside the curly braces okay now if you want to match more than n characters so for example if you want to match more than three characters you only have to write the comma so here let's go back to number three and now here we match Python and three characters but now let's say that we want more than three characters so we're not sure about how many characters we want to get we only know that is three or more so if that's the is you only write three and comma and voila you get five characters because you specify that you want it more than three and this is kind of greedy so it will match as many characters as it can and with the next quantifier you can specify a range of number so the first element is going to be the minimum number of characters that you wish to match and the last element is going to be the maximum number of characters that you wish to match so let's say that we want to match between three and four characters so if that's the case we only write three comma four and now we got four exclamation marks because it will always try to match as many characters as it can and that's it in this video we learned how to use quantifiers in regular expressions okay now let's see some meta characters that we also use often in Python so first we have these parenthesis and this represents groups so to capture a specific group of characters we only have to use this parenthesis so now I'm going to show you an example so you understand much better how this works okay for this example let's create a regular expression that matches these numbers that we have here so here we have nine numbers separated by this hyphen so every three numbers we have one hyphen so for this example we're going to focus only on these three sentences okay to do that we have to use the slash d because these are numbers so we have to match digits and the meta character that allows us to match digits is this uh slash d so here I write slashd and now we matched all the numbers so this is fine but what if we want to get the first group of three numbers so in that case we have to use a quantifier so we have to use the plus operator so let's write plus and as you might remember this plus allows us to match one or more character so in this case one or more digits and that's why we got kind of the same result but now if we add the hyphen we're going to see that this behaves differently so now we only got until the last hyphen but now if we add this /d with a plus sign twice we'll see something different so here I paste this once and now I add the hyphen and now I paste this again so here we have a digit then hyphen then digit and hyphen again and finally a group of digits so each slash d with a plus sign represents each group of three numbers separated by this hyphen so we built this because we wanted to get only the first three numbers so the first three that are here that I'm selecting right now but I think it's not visible but anyway we want to get the first three numbers so now if we want to get the first three numbers now we have to add the parenthesis so this parenthesis will help us like capture a group so if I add parenthesis here and parentheses here we'll see that now the first three numbers are in green and the rest are in blue so that green means that we capture that group and that's the first group that we capture so let's capture the second group of numbers or the second uh group of digits that are between hyphens so I add parenthesis and here again parenthesis so now we see that the second capture group is in yellow and if we add parenthesis to the third group we'll see that this one is in purple so here we have three groups and we capture those groups by adding parenthesis so here we'll see that it says in this uh box match two but the important thing here is that below it says group one so all the matches in green are group one because of the parenthesis then in yellow we have group two and then in purple we have group three so this is how the parenthesis meta character works and this is an extremely useful meta character when you want to extract certain information from the whole text data so let's say that you have a data frame and you have a column that has only information about the date so if you match all the date you can capture only the years so if you capture only the years you can extract those years and create a new column based on that year okay now it's time to see the square brackets meta character so this meta character matches characters in brackets so unlike parentheses a square brackets don't capture an expression but only match anything inside it so for example we can write a square brackets and inside write the numbers 7 8 and 9 and we see that all the 7 8 and 9 digits were matched so here we have a number seven and here we have the number eight and here the number nine so this doesn't mean that we have the 7 8 and 9 together but it means that we match anything that is inside the square bracket so this is like match 7 or 8 or 9 and instead of writing 7 8 9 we can write seven hyphen 9 and get the same result so this hyphen means get a range from 7 to 9 so anything that is between 7 and 9 will be included so if now we write from 5 to 9 so we're going to get 5 6 7 8 and 9 so all those digits will be matched and the same goes for letters so if we write A to Z we're going to get all the letters that are in lower case and that are between the letter A and the letter Z in the alphabet so here in this string you see that all the letters in lowerase were matched but those in uppercase were not matched and if you want to match the letters in uppercase you only have to specify here A in uppercase and then hyphen and then Z in uppercase so here Z in uppercase so this indicates that we want not only letters in lowerase but also letters in uppercase and that's how the square brackets work and now if you add a carrot in front of these square brackets so at the beginning of the square brackets this will modify the behavior of these square brackets so when the carrot is at the beginning of the square brackets this means match characters not in brackets so this is the symbol so it starts with square brackets then carrot and then anything that you want to include and then square brackets so if we add here for example the carrot symbol so this means match anything that is between A to Z in lowerase and in uppercase and as you can see here the spaces the dot sign the exclamation mark and the digits were matched so now if we write the carrot and the range from 0 to 9 we're going to get anything but digits and this is the result all the characters in blue are non-digits and as a final note keep in mind that the expressions that are inside the square brackets can get really long and sometimes people get scared because it's hard to understand what this regular expression means but remember that it doesn't matter how long the regular expression gets because if it's within a square brackets you can split them as groups and when you split them as groups you can think of each group that is separated with an or condition so here for example if I write 0 to 9 and then I write a to z and then a to z but in uppercase so we can separate this um these characters with the or condition so from 0 to 9 or from A to Z in lower case or from A to Z in uppercase and if you want we can add more and more characters and it doesn't matter how long it gets you can split them and add a or condition in the middle and you can understand anything inside square brackets in that way and speaking of or conditions let's review how this or condition works in regular expression so now I delete this and to make an or condition in regular expressions we have to write this sign so this is the equivalent of the or in Python and it works the same way so let's see it in action so here I'm going to match a string that has 9 or 8 followed by two digits so to do that we have to write uh nine and here the or and 8 so with this we match both 9 and 8 and now if we want to get two digits that are followed by this ninth or eight digits we have to add here the uh backslash and then the d and here write this uh curly braces with the number two but now we have to group this 9 or 8 so only this nine and eight digits are evaluated by this or condition and not nine with this whole text because right now this or operator is evaluating nine with this eight and this back slashd and we don't want that so we have to group this so we add parenthesis and here we group this nine or eight so now you see that we got the correct match so we wanted to match a string that has 9 or 8 followed by two digits and we got that so here we got this 9 or 8 followed by two digits so the text that is highlighted in blue is the matched and the text that is highlighted in green is the capture group because we use the parenthesis and that's why we capture this group and in case you don't want to capture a group you have another alternative so instead of using parenthesis you can use the square brackets so here I write square brackets so this matches a string that has 9 or eight followed by two digits but now without capturing the number nine or the number eight so you see now that the text is highlighted in blue and nothing is highlighted in green because we don't have any capture group as a side note I want to remind you that is not necessary to add the or inside a square brackets because each character that is inside square brackets is split and then evaluated with an or operator so if we delete this or operator we get only this uh 9 and 8 apparently we got 98 but this in reality is 9 or 8 okay now let's see what this back slashb means in regular expression so this back slashb symbol is another meta character that matches at the position between a word character so the /w and a nonword character so the slash with the w in uppercase this is called a word boundary and now let's see how it works so here I write back slashb and now we see that we have multiple characters matched in purple so all of them are word boundaries okay and among those purple characters that matched we have three different positions that qualify as word boundaries so the first position is the position that is before the first character in the string so this one that is before the h in hello okay and the second position that qualifies as word boundary is after the last character in a string if the last character is a word character and the third position that qualifies as word boundary is between two characters in a string where one is a word character and the other is a nonword character and a good example of this is the hyphen between the 987 and the 654 so this seven and this six are numbers so those are word characters but the hyphen is not a word character and as a result we got these two word boundaries that are in purple okay now as you might expect this slashb has a negation and is a slashb but now in uppercase and this meta character will match no word boundary so if I write back slashb in uppercase now we get all the elements that are not word boundaries okay before finishing this video let's see how back references work in regular expression a back reference helps us reuse a capture group we already identified and now to understand this much better let's capture this pattern we're going to capture this one two three with a hyphen to explain you much better how this back references work so let's write here one two three and a hyphen so you see here that we matched all the one two three hyphens so here's the first match here's the second and here's the third but we want to match only repetitive 1 2 3 so this means that we want to match this 1 2 3 hyphen 1 2 3 hyphen and all of these should be just one match so this means that we don't want to match this one two three with hyphen that is alone because that's not a repetitive 1 2 3 with hyphen so if we want to do that first we have to capture this one two three so we open parenthesis and in this way we capture this 1 2 3 but this is not ready here we have to use this backslash reference so we have to add the backslash with the number one and this indicates that we want to repeat that reference and as a result we got here the one two three and also this one two three and all of this is a single match so what we did here is to repeat everything that is inside this parenthesis so this is like copying and pasting one two three hyphen here instead of the backslash one so if I do this it will have the same result so this is the same but sometimes you don't want to repeat the same expression twice but you want to use like a shortcut or you want to reduce a long capture group and when you want to do that you only have to use the backslash one so that indicates that we want to use the reference that is inside the parenthesis and now if we add another backslash one nothing is matched because now we don't have a sequence of one two three with hyphen three times but only we have a sequence twice so this and this but now if I write here just to show you uh I create a new line and now one two three with hyphen now we see that we have this match and we got this match because now we have a sequence of one two three with hyphen three times and as a result all of this represents a single match and that's it that's how the back references work in regular expression and now if you have more than one capture group you can also specify which group you want to reduce by writing its order so let's say that you don't have one capture group but two so instead of this one two three hyphen we have let's say four five six hyphen so in this case this is my first capture group and this is my second capture group so if now we want to reference the second capture group so four five six with hyphen we have to write backslash and the number two so this indicates that we want to reduce this 4 5 6 with hyphen while the backslash one still represents the 1 2 3 hyphen so you only have to consider the position of the capture group in this regular expression okay that's it for this video in the next video we're going to see some examples and exercises so we can understand much better all the meta characters we learned so far all right in this video we're going to use all the meta characters that we learned so far but now using Python code so when we want to use regular expressions in Python we have to import the re module this re module is a Python builtin package that is used to work with regular expressions okay first we need to import re to see all the regular expression methods we're going to see in this video so first we write import re then we run and now we imported this module and now we're going to see two common methods that are used to match regular expressions the first one is the search method and here I have this text variable and this has the same text that we used in the previous videos so the same hello world and I love Python and all the same content so this is the same text and now let's say that we want to match only the digits so here we want to return the first digit that is matched in this text and we can use the search method to do that so we only have to write research then we open parenthesis and first we need to write the regular expression that matches the digits so as you might remember to match digits we need to write the backslash and the d so this matches digits in a string and to specify that this is a regular expression and not average text we have to open quotes and here before the quote we have to write the letter R so this letter R before the quotes indicates that what we have here is a regular expression okay after this the second argument is going to be the text we want to evaluate so in this case we write the text variable so I copy and paste it here and that's everything we need so let's review this uh method first we write re that search and the first argument is the regular expression we indicate that this is a regular expression by writing the r before the quotes and the second argument is the text we want to evaluate and that's everything you need so now I run this but first I run the text variable and now I run this and now we got this that says re match object and here we have a match equal to nine so this means that we only got one matched that was this first number nine so the search method only finds the first match in the whole text so even if there are more digits in this text it won't find all the digits but only the first digit that appears in the text so if we check here in hello world we don't have any text in I love Python neither but in the third sentence we have many digits and this one starts with nine so the only digit that is going to be returned is the number nine and here we got the number nth and the rest was omitted great now let's see how flags work when we write Python code so before when we were in this website we saw that here in this option we got all these flags so by default we had the global and the multi-line flag turned on and here we can control the flags using the flags parameter inside the search method so let's say we want to get the word hello so this word that says hello here so let's use the search method so we write re search now open parenthesis now we write the r and here the quotes so we indicate that what we have here is a regular expression so now let's say that we want to match anything that starts with hello but in this case let's write hello in lowercase so this is my first argument and now the second argument is going to be the text because we want to evaluate this text variable and in the next argument we're going to consider the flags parameter so we write flags and now equal to and now we have to specify which flag we want to turn on so here we write re and let's say we want to ignore the case of the text so if we want to ignore the case of the text we have to write here the I which stands for insensitive so with this we won't consider the case of the text so now if I run we get that the match is hello so here even though this hello is with H in capital letter we got the match because here we turn on the insensitive flag however if we didn't include this flag parameter so if we only wrote these two arguments we wouldn't get any match so as you can see we didn't get any match because here this hello is in lower case and here this hello is in uppercase and that's why we didn't get any match but with the flax parameter we got a match because we ignored the case of the text and by the way you can see the complete list of flags here in this website that we checked before so for example I knew that the I meant insensitive because here the I is highlighted and if we want to turn on the multi-line flag we only have to write the M because here the M is highlighted so here we could write the M and that indicates that we want to turn on the multi-line and you can check all the flags in this list here okay now let's see how the find all method works so unlike the search method the find all method returns all the matches so let's find all the digits inside this text so I'm going to copy this uh re.arch and now instead of writing research now we're going to write re.find find all so the regular expression is going to be the same and also the variable so now if we run this we see that now we get all the digits inside that text variable so 9 8 7 6 54 and so on so as you can see here these are the exact numbers that you see here and they are listed in the order that appear here in this text variable so it starts with 98 and it ends with 56 so let's check it out 98 or knight and 8 and then it ends with five and six so it's correct and we got all the matches using the find all method and that's it in this video we learn how the search and the find all method work okay now it's time to put everything we learned so far into practice by solving these three exercises so here are the tasks first we have to match only punctuations from this text variable then we have to match the right date format so in this case the format is month date and year and we have to match the right date format from this text variable and the last exercise is to match the right username format in this case the right format is from four to 14 characters and only letters or digits so here is the text and you have to match the right username format for these exercises I will recommend you to use the find all method that we learned in the previous videos with all the method characters that we learned in the previous videos too so try to solve these exercises yourself and after that check my solution okay let's start by solving this first exercise and we have to match only punctuation so that means that we have to match only this dot sign the here the exclamation marks and this hyphen for example so to do that first we write refind all and then parenthesis here I already imported the re module so you have to do the same so here the first thing we have to do is to build the regular expression that matches only punctuation and here we have to analyze what's the best approach and actually there are many approaches and I don't think one is much better than other but maybe one is shorter than other but what I'm going to do here is a simple approach and this consists in using the square brackets so here I'm going to write it and I write square brackets so this matches any character that is inside but now if we add the carrot we don't match any character that is inside so this means match characters not in brackets so now what I'm going to write inside this square brackets are meta characters that don't represent punctuation for example the backslash w this one represents word characters so as we've seen before this W represents uh word characters like the A to Z in lower case then A to Z in uppercase digits from 0 to 9 and also this underscore and all of them are characters that are not punctuation so we have a good number of nonpunctuation characters there and another character that is not a punctuation character and also that is not included here in this back slashw is this backslash s so this backslash s means a space so here we have word characters and here we have a space so if we now insert this back slashw and also this back slash s inside these square brackets followed by this carrot this means match any character that is not a word character and that is not an space and what it remains is only punctuation because if you think anything that is not a word character and anything that is not an space is only the punctuation so let's test it out and here I copy this regular expression so now I write the R and then open quotes and inside I paste this so this R means that we want to write a regular expression and what is inside is what we created so this is match anything that is not a word character and that is not an space that is match only punctuation so now the second argument is going to be the text so this is the text variable that we want to evaluate and now everything is ready so we can run this cell so now I run and here we have a list of only punctuation signs for example the dot sign the exclamation mark the hyphen and so on okay in this second exercise we have to build a regular expression that matches the right format of a date and in this case the right format that I specified is here within parenthesis so it should be month date and years so here what we have to do first is to write re then find all then parenthesis then we write the r followed by the quotes and inside we write the regular expression so let's build this regular expression so this date format should have these two digits followed by this forward slash and then this again two digits again forward slash and finally four digits so we can create that using the back slashd that represents digits so we write back slash d and now this represents digits but now we have to give it this shape so first two digits so to do that we open this curly braces and we write two so this indicates that we want only two digits and now I copy this and I paste it twice so one and two and the second is back slash D with number two inside curly braces and that remains the same because we need two digits but the last one should be four inside curly braces because the years need four numbers or four digits so we write four and now we have the format and now I just realized I made a little mistake here because I wrote this forward slash but I should have written here a hyphen so instead of forward slash we write here hyphen and here hyphen again and we do that because here in the text we have hyphen between numbers so it should have the same format so the final format should be two digits then hyphen then two digits then hyphen and finally four digits okay now to finish this regular expression we have to add the hyphen between the digits so here one hyphen and here the other hyphen and now this is ready so I cut this and I paste it here so here we have our regular expression and now we just have to add the second argument which is the text and now if we press Ctrl enter we'll see that only the first row so this one has the right date format because this one has two digits two digits and four digits and the others have four digits at the beginning so these two have wrong date format and this one has the right date format and this is why we got here this 13 hyphen 4-y 2021 okay now let's solve the last exercise and in this last exercise we have to match right username format and the format indicates from four to 14 characters and only letters or digits so here we have the text and now we start by writing re.find all to find all the matches and now here are and quotes and inside the regular expression okay we have to get only letters or digits and that indicates that we have to get letters from A to Z or digits from 0 to 9 and to make both of them work together we have to put it inside square brackets so as you might remember these square brackets matches characters inside the brackets and any group that is inside these square brackets are split and then evaluated with an or condition so this means that we'll get here characters from A to Z or characters from 0 to 9 and now if we want to get uh four or 14 characters we only have to add the quantifier with this um curly braces and now indicate four as a minimum and 14 as maximum and we separate them with comma and now this is ready if you want we can add also the A and Z in uppercase so this is like more complete so this is my regular expression and now I'm going to copy and paste this one inside the R with a quotes okay now we only have to add the second argument which is the text now we run and we got that the only username that matched was this username 10 because it has between four and 14 characters while the others have either three or two characters and that's it i hope you successfully solve these three exercises okay in this video we're going to see the data frame that we're going to work with in this project so the data set is about Netflix movies and also Netflix TV shows and here I'm going to read it so first I import pandas as PD and now to read this Netflix data set I'm going to write PD read CSV and inside parenthesis I'm going to write the name of the data frame which is Netflix titles so this one so now this is the name of my CSV file so I read it now I run this and here we get the data frame i chose this Netflix data set in purpose because it contains missing data and also it's going to help us apply some data cleaning techniques with pandas so let's start with the first column and here we can see the ID and ids usually are unique and here we can see the values and then the next column here is the type and here we can see movies and TV shows then we have the title of the movie or title of the TV show the director the cast the country the date and the release year and here we're going to use this column a lot to know whether an element of this data set is actually a movie or not then we have the categories of the elements and then the description of the movie or TV show okay now I'm going to name this uh data frame and I'm going to name it DF_Netflix so this is my data frame so now I run this and now let's see the data types of the columns of this Netflix data frame so to do this we have to use an attribute that we've seen before in this course and this is the dypes attribute so this attribute allows us to know the data types of each column so here I'm going to write DF_Netflix and then D types and now I press Ctrl enter to know the data types of each column and here we can see that the only integer column we have is the release underscore year so this means that this is the only column that has numerical data and that's kind of weird because here the duration and also the date added should have some kind of integer or date format but here we can see that it has object as data type so something weird is going on there and we'll see that in detail in the following videos but now let's check the shape of the data frame and by shape I mean the number of rows and the number of columns that the data frame has and we can get this uh by using the shape attribute so we only have to write df_netflix so the name of the data frame followed by that shape so this is the shape attribute and now if I run this we can see the number of rows which is the first uh value here and the number of columns which is the second value here so if we print the data frame df Netflix we will see that uh below there is the number of rows and also the number of columns so we can verify that this is correct and that's it for now in the following videos we will see how to identify missing data and how to deal with missing data all right in this video we're going to see how to identify missing data so first let's have a look again at the Netflix data frame here so here I write DF Netflix and here we get the data set and here it's the data set and we can see that there are some n a n values so this one so this nan represents null values and this means that there is no value here for example for this cast in index zero and also for this director in index one so there is no data and what we can do here is to use the is null method to recognize them because here we can easily recognize them because it says n a n but to check that in the thousands of rows that this data frame has is almost impossible so we can do it easily with the is null method so now I'm going to show you how to do it here so first we write df_netflix and now to find the null values we only have to write that is null so here I write that is null then I open parenthesis and if I run this you will see that we get the same data frame but in this case we get false and true values so the true values are those nan values so null values and the false values are nonnull values so here we can see that for example again cast index zero it's true and this is true because before we saw that this was an a and n value so a null value and the same goes for director with index one is the same is true because this one is a null value so here we can recognize all the null values inside a column by using the is null with the sum method so here I write sum and now what this is going to do is to sum all the values that are true so here for example here we got some true values so if we apply this sum method this is going to sum all the true values inside this directory column so here here we have true the first true value and then again another true so 1 + 1 2 then 2 + 1 in this case we have another true value so 2 + 1 3 so far we have three null values and then it will find true values that represent the null values we've seen before and it's going to sum all of them so in the end we're going to get that total number of null values per column so here let's have a look i'm going to uh run this one and now we see that for example the ID column doesn't have any null values however the directory column has 2,634 null values and also the cast and country columns have some null values and it's important to recognize how many null values a column has because this tell us how our data our data set is a structure so if a data set has many null values in a column that means that some columns might be incomplete so we should think about dropping those columns so deleting those columns and for this example the director cast and counter columns are good candidates to be deleted so here I'm going to show you a way to sort first these null values here because here is sorted according to the original data frame but now I want to sort these values ascending so the columns that have the higher number of null values will be on top so here I'm going to use the sort values method for that and here inside parenthesis I write ascending and now I set this equal to false so now I run this i get the column with the highest number of null values on top so in this case is the director column okay as we said before these three columns are good candidates to be deleted but still we don't have enough evidence or enough reasons to delete these three columns because here we have 2,000 null values but maybe these 2,000 null values represent a small portion of the total number of rows that the data frame has so we have to verify the percentage that these null values represent from that total data frame and to do so we have to use the mean method so I'm going to show you how to do it here but first I'm going to copy the name of the data frame and now I paste it here so now I'm going to write the name of any column I want to evaluate in this case let's try with the directory column so here I put the name inside quotes and now I'm going to use again the is null method so here if I run this we get the true and false values in this column or in this series so here again the true values represent null values and now to get the percentage we have to use the mean method so we write mean and what this is going to do is to calculate the average number of true values in this series so it's going to sum all the true values and then divided by the total number of rows so in this case 8,86 rows so if I run this we get 0.29 and this represents that 29% of the director column are null values so this is a good number of null values and we have to analyze later if we want to delete or drop this column but first let's see the percentage of null values of each column and to do that we have to use a for loop so here I write for then I write column in df_netetflix column so here I'm using the columns attribute and now I'm going to show you how this works in case you forgot so here I paste it and this columns attribute should get all the columns that this data frame has so if I run this we get a list of all the columns so from the ID to the description so all the columns this data frame has and now this is a list so we can loop through it so here in my for loop I'm looping through this list so here I'm saying for each column in this list of columns do this so execute this line of code so instead of writing here the directory column we have to introduce the column variable here so in this way we're going to iterate through all the columns that this Netflix data frame has so we're going to get the percentage of all these columns so the percentage of null values so here now to show you this I'm going to uh here assign a name for this uh percentage so I'm going to name it percentage and now I'm going to print this one so I write print and now I want to show the percentage of each column so here print percentage now I run this and now we get the numbers and the column that has the highest percentage of null values is this one with 29% of null values but we don't know which column corresponds to this value because it doesn't say anything so we have to print the column variable so here I write column and then plus to concatenate these two variables and now since this percentage variable is a number we have first to convert this into a string so we write here uh str and now we convert this number or this integer into a string so now I think everything is fine so I'm going to run this one and here we see the name of the column and the percentage so I'm going to add a column in the middle so we can see clearly the name of the column and also the percentage so here I open this quotes and now the column symbol so here uh I do this and now run and now we can see clearly the name of the columns and also the percentages of null values so here the column that has 29% of null values is the directory column as you can see here so we can now even customize these numbers so we can write here in percentage multiply uh here with 100 and then round this number and we want two decimals so I'm going to write two decimals here here comma and here two decimals so if I run this we get the same values but now in percentages and with two decimals and that's it in this video we'll learn how to identify missing data and in the next video we're going to decide whether these columns that have a good number of null values are going to be deleted all right in this video we're going to see how to deal with missing data so in the previous video we identified some columns that have a good number of null values for example this director column has 29% of null data and here we're going to decide whether we're going to delete or drop those columns or maybe if it's a better idea to drop rows so to drop those null values inside those rows so let's see what's the best solution for this case so here I'm going to use three different methods and these are the drop the drop NA and the is null methods some of them we already seen in this course but some of them we haven't seen so far but don't worry because I'm going to explain you how they work from scratch so here first let's say that we want to drop this directory column because we consider it has a good amount of null data so here if we want to do that we have to use that drop method and we only have to write the name of the data frame so df_netetflix and then we have to write then we open parenthesis and inside we have to specify the name of the column we want to drop so in this case is the director column because it has too much null data so here I write director and now here as you might remember we have to specify the axis so here we say access equal to one and we write equal to one because this is a column so column is one and row is zero so here we're saying drop this directory column so now it's ready and we can run this one so I run and now we get a copy of the data frame in this case this data frame doesn't have the directory column so we successfully deleted this director column but here I'm not going to take this approach because the director column might have some important data that we might use later so it's not a good approach to drop the whole column because it can contain some important data so sometimes it's much better if you can get rid of that specific rows that contain that null data and that's what we're going to do right now so here I'm going to write DF Netflix and now I'm going to find which rows are new so here I'm going to write here director so this is the director series and now I use the is new method so here I'm going to get the null values inside the director series and what we're going to do now is to filter this based on this condition so here is my condition so get the null values and now I'm going to filter this so with this we get the data frame with null values and if now I write index this index is my attribute that gets the index of this data frame and in this case this data frame contains only null values so this means that we're going to get the indexes that correspond to those null values so here I'm going to run this one so I comment this out and I run and as you can see here I got here a list of indexes and it says 1 three four and so on and these numbers represent the indexes that have null values inside the directory column so now let me show you here in the real data frame df Netflix so now I run this one and as you can see here in the directory column the null values are in row one or index one then row three or index three here we have n a n and then also in index four n a n again so we should get one three and four because those indexes contain null values inside the directory column so now let's verify if that's true so here we got 1 3 4 and so on so all these elements inside this list are indexes that correspond to null values inside the director column so here I'm going to name this as no director which means that the values inside this expression are null so they don't have a directory value so no underscore directory and now I'm going to drop these indexes because as you might remember these are a list of indexes that are right here and now I'm going to drop those indexes so what I'm going to do is write the name of the data frame as I did here so actually I'm going to copy this one so I have uh this to edit so here instead of writing the name of the column director I'm going to write the variable no director so here I paste it and now instead of saying x's equal to one I have to change this to access equal to zero because zero represents rows or indexes and here we have a list of indexes so now everything is ready and now let's run this one i'm going to comment this out and I run and here we got this data frame and as you can see here in the indexes we have zero but then we don't have one because we removed or we dropped that index then we have two and then we have five but three and four are not here because those rows contain null values inside the director column so that's why we dropped it and now we can easily verify how many null values are in the director column using the same method we used in the previous video so here what I can do is to write is null and then parenthesis and then sum and parenthesis so now let me show you here I run this and here I go to that director column and as you can see here it says zero which means that this column right now doesn't have any null value and this is because here we drop all the null indexes using the drop method and this null underscore directory list so now I'm going to get rid of this and here we have the data frame without null values in the directory column and that's it i consider this a better method to deal with missing data so I will add here an in place equal to true so you can add this one so you save all the changes we made with the drop method and you update the values in the df_netetflix dataf frame but I'm not going to do that because I need these null values to make the following examples so I'm not doing that only for the sake of this video but you should do it if you want because that's a better approach than dropping the whole column as we did with this line of code so not this one but we're going to take this approach so this one is much better but in my case I'm not going to add the in place parameter so here I delete it and now let's continue with the following example so here I want to show you how to do the same thing so how to drop rows but now only with the isnull method so here before we use the is null method with the drop method to drop these rows but now I'm going to use the is null method with the not operator so let me show you how to do it here so here I have to write first the name of the data frame which is df_netetflix and then we open square brackets and actually we're going to do this filter is exactly the same as we did before so I copy this one and I paste it here inside square brackets so now instead of getting the index as we did before we're going to use the not operator so here I'm going to add parenthesis and outside parenthesis I'm going to add the not operator so this one so here inside parenthesis I'm saying I want to get the null values inside the director column but now when we use the not operator we're saying hey I don't want the null values but I want the nonnull values so here by using the null operator we get the nonnull values and after this we get the filter and we get the data frame with nonnull values so it's like we're dropping the null values in the rows so it's like we were doing this part but in this case we're just doing it with the not operator so if I run this one let me show you what happens so I comment this out and now I run and here we can see that index one is not here index three and four is not here and now if I use the here the is null and then sum to get the number of null rows we're going to get that the director column doesn't have any null value so now I'm going to delete this one and as you can see here the is null with the not operator does a similar job as this is null with the drop method okay finally I'm going to show you a third way to drop null values but in this case we're going to use the drop NA method and this is the simplest method and I'm going to show you here first we write the name of the data frame followed by the drop na method so we open parenthesis and now we have to specify the column in which we want to analyze whether there are null values or not so here I write the name of the column so in this case it's going to be the director column and here I'm going to add these square brackets and also add a parameter called subset so here I write subset equal to and this is the column that I want to analyze so we do this because that's the syntax of the drop na method so we have to write a subset parameter then equal to and then open this square brackets so if this directory column has null values we're going to drop those null values by using this drop na method so now let's check it out i'm going to comment this out and these two out and now I run this and here we can see that we don't have this one index and three and four are gone too so now fast I verify the number of rows that are null values so here is null sum and here we can see again that the director column doesn't have any null values and that's it in this video we learned different ways to remove a column or row with a drop drop na or is null method and also we learned best practices to deal with missing data in pandas in this video we're going to see how to deal with missing data by using the fill na method okay instead of dropping rows and columns as we did in the previous video we can deal with missing data by using the fill na method and we can replace the null values by the mean or the median or the mode and I'm going to show you how to do it now so the first thing you have to do is to calculate either the mean the median or the mode of the column that we want to replace its null values so here let's say we want to evaluate the rating column so this rating column has some null values so we want to replace the null values inside this column for the median or the mean for example so what we have to do is to write df Netflix which is the name of the data frame open square brackets and here we write the name of the column so we write rating this is the column that we want to evaluate so now to calculate the mode let's use the mode in this example so to calculate it we have to write mode and then parenthesis so this is a method and it help us calculate the mood of this rating column so let's see what's the mode of this rating column so here I run and by the way the mode is the most common value inside a group of elements and in this case this TV- MA is the most common value inside the rating column so here we can use this value to replace the null values and this is a practice that sometimes people use because if this value is already common is already popular it's not going to affect the end result there are already many values that have TV- MA so if we replace the null values by the mood at least it shouldn't affect so much the result so here what I'm going to do now is to assign this value to the null values and to do that I'm going to use the fill na method so what I'm going to do here is to again write the name of the column and now I'm going to write the field NA method so I write field NA and then open parenthesis in here the argument that we have to introduce is the mood so this one but now I want to show you something before we do this fill na method this mood value as you can see here is not a string it's not an integer is some kind of array or some serious i mean if you see the data type it says object so if we add this value as it is to the fill NA method we're going to get an error because this fill NA method expects some kind of integer or string so what we have to do here is to convert this object and actually I think this is a series because it has this format the value and the index so we have to convert this into a string so we can add it to this fill na method so as an argument and then it's going to replace the new values here so to do that I'm going to use the join method and this join method allows us to join strings and in this case I'm going to use it to convert this value into a string so what I'm going to do here is to add parenthesis and now add here join and this join method is going to join all the elements inside parenthesis so here since we only have one element we're going to get the same element but in string format so the final format of this is going to be a string and now we only need to add the separator so I only open quotes so in this way this value inside the join method is going to be the same as the one we introduced so this TV- MA is going to remain the same but now the data type will be a string so now let's try this out i'm going to assign this to a variable which I'm going to name mode so this is my mode variable and now let's see the data type of this mode so here I'm going to use the type function and now use parenthesis run this one and as you can see it says str which means that this now is a string and now we can add it to the fill na method so we now copy this mode and we paste it here so now we have the mode and now we can replace this null values by the mode so now I'm going to add the in place parameter and I'm going to set it to true to update the values in this DF_Netflix data frame so now before I run this code I want to show you fast how this rating column looks so here this is the data frame and the rating is this one so let's see how many null values this rating column has and to find the null values we have to use the is null method but first we write the name of this column and then we use is null with parenthesis and now to filter this I'm going to insert this inside square bracket so now I run and now we get the rows that have null values and the rating column so this one uh these three and this one so we only have four rows with null values in the rating column and what we're going to do is to replace this nan values by the mode so now I don't need this and let's do this here so now everything is ready and I'm going to run this one so here I run and now let's see how many null values this rating column has so as you might remember we only need to use the is null and the mean methods so first I delete this now I write is null parenthesis and then mean with parenthesis so now I run this and here we get the columns and also the n values here on the right so now we look for the column named rating and it's here and we get zero so it means that this column doesn't have any null values anymore so we successfully replace the null values by the mode now I delete this and by the way you can use the mean or the median but in this case I chose the mode because this rating column I'm going to show you here this rating column has text so here all these values are text so this is known as categorical data and when we have categorical data we have to replace null values by the mode but when we have numeric data like the column for example here the year the release underscore year that's numbers so that's numeric data there we have to replace null values by the mean or the median so in this case for example let's say this rating column has numbers and we will use either the mean or the median but in this case this one is categorical so here we use the mode so just keep that in mind okay now I'm going to show you how to replace null values by an arbitrary number so sometimes instead of replacing null values by the mean or the median we use an arbitrary number and we replace null values with this arbitrary number usually this number won't affect the final result but it help us work with the column as if all the values inside the column were nonnull and this is useful because when sometimes you have null values in a column with numeric data you won't be able to do some operations because those null values won't allow you so now I'm going to show you how to do this and as usual we write first the name of the column in this case I'm going to use a column with numeric data and this one is going to be the duration so here I write duration and before we apply the fill NA method I'm going to show you how this one looks so here we got this column and you can see that it has a mix of numbers and text so we can see that this column isn't numeric at all but it has some numbers and I'm going to add the number zero to this column because in the next videos we're going to use some operation to split this data so we get in one column only the numbers and in the other column only the text and I'm going to show you how to do this but in the next videos but now I need the number zero because if I have null data in this column I won't be able to do this split so that's why I'm going to replace null values by the number zero so now I write here dot fill n a open parenthesis and the first thing we have to write is the arbitrary number so in this case zero I want to add number zero and now I just write in place equal to true and that's it with this we replace the null values by this number zero so I run this one now okay now I'm going to show you a parameter that allows us to fill null values backwards and forwards and first we have to write this one again actually I'm going to write I'm going to copy all of this and now I'm going to paste it here so now I'm going to delete this what is inside and now instead of using the mean the median an arbitrary number or any other thing we've seen so far we're going to use the forward or the backward nonnull value to replace any null value so to do that I'm going to add the method parameter so here this method parameter has two options so the first option that we're going to see is the backward option so if I add this one the null value will be replaced by the nonnull value that follows it so the one that is in the next position so I'm going to show you here with an example so I'm going to print the df_netflix and here we have the data frame and let's find the directory column and here we see one null value so if we apply the fill nate method with the backward option what we're going to get is that this directory Julian L cleric will replace this null value so it's going to replace backwards the nonnull value Julian Llur will replace the null value backwards but now if we use the same method but now in the second option forward so here I write it what we're going to get is that the first director here Kristen Johnson is going to replace this null value forward so this nonnull value will replace this null value forward okay and now that this is clear I'm going to show you the true names of this method so instead of writing backward this one is called B field so backward field and this one is called F field so forward field and these are the names and now I'm going to apply this to the whole data frame so not only to the directory column but to the whole data frame so now I delete this one and also I delete this one so with this all the columns will be affected by this B field or F field so let's try first with the backward field so now I run this one and we see that this director Julian L clerk replace this null value or this former null value because now it has the value of Julian L clerk and now let's try with the method fill so it's going to be the forward field and now I run and now it's different so we see that this former null value is now Kirsten Johnson so this director here replace this former null value and it was a forward field this is usually used when we have a sequence of values in a column so for example if we have the data of the weather every minute in a city and for some reason we don't have a data of one specific minute we can either forward field or backward field because the weather shouldn't change so much in one minute so in this case it makes sense a lot to backward field or forward field but in some other cases it wouldn't make so much sense so it depends on the data so in some cases maybe it's much better to fill with the median or the mode or the mean or just drop the column or the row as we've seen in the previous video and that's it in this video we learned different ways to deal with missing data remember that each data set is different so you have to analyze your data set and take the approach that fits best your data all right in this video we're going to learn how to extract data from a column and in this case we're going to extract data from the duration column using the split and extract methods so first I'm going to select the duration column so I write the name of the data frame df Netflix and then the name of the column so duration now I run this and now we can see that this data has the duration of movies and TV shows so here we can see that durations of movies is in minutes but the duration of TV shows is in seasons so we have two seasons and three seasons and so on so here we're going to analyze only the movies because we only want to compare in the following videos with minutes so our unit of measure will be minutes and we only need movies for this purpose so what we're going to do now is to filter out TV shows so we do this with the methods we learned in previous videos so we write first DF Netflix then we write here the type of this uh show so if it's a movie it's going to be in this data frame but if it's TV show is not going to be in this data frame so we're going to filter out TV shows so here I write the name of this column which is type so here type then equal to and now write only movie so here copy and paste movie so now we only want movie and this is our condition and here we put it inside square brackets and now I'm going to set this as a new data frame and I'm going to name it DF movie so this is my new data frame and now I'm going to choose this data frame to make the selection so now I paste this and now let's select this duration column from the DF movie data frame so now I run this and as you can see here we only get the duration of movies and here we got the unit of measure which is minutes and this is great because now we can make a better comparison because all of them are in minutes but this still isn't complete because here we have numbers and text so to compare we only need numbers because we compare one number with another number but we cannot say that one element is greater or less than order with text so here we have to extract only the numbers and this is what we're going to do using the split and the stract methods so first let's have a look at the split method and to use the split method we have first to get access to the str attribute so as you might remember this strand for a string and this helps to work with a string data inside this column okay now that we have the string of this duration column because we use the str attribute we now can use the split method so I writes and by default the split method splits this data every time it encounters a blank space so here between the number and the text we have a blank space so every time it finds a blank space it's going to split this whole text or this data so now I'm going to run this one and you're going to see the results so here we get a list and inside this list we have all the elements that were split so here we see that we have the numbers on the left and we have the text on the right so here the text is minutes and this is not so relevant so we're going to select only the numbers so the first element in the list and one parameter that is used frequently in this split method is the expand parameter so I write it here and this parameter allows us to expand this list into columns so if we have here two elements if we write here expand equal to true we're going to get two columns and each column will contain an element of this list so column number one will contain only the numbers and column number two will contain only this text that says minutes so here I run this and as you can see we have this data frame with zero and one as columns and here in the column zero it says 90 91 and only numbers and here in the column one it only says minute and that's what we wanted now we can select only the zero column by using a square brackets and writing the name of the column so now I write zero and now if I run this we get only the column with the numbers so here we have the duration in minutes but now only the numbers great so far so good but here there is a little detail we see that all the data here are numbers but here below we see that the data type here in this d type attribute is object and this is a bit weird because all of the data that we see here are numbers so we have to convert this data in this series into integers so what we have to do here is to use the as type method so we write as type and this method allows us to convert from one data type to another so here inside I write in T and this stands for integer so now I'm going to convert this data which is object into an integer so now I run this and let's see the results and I got an error so let's see what happened now I scroll down and I'm going to read the error and it says cannot convert float n a n to integer so this means that there are null values inside this column so that's why it cannot convert to integer and that's a bit weird because in the previous videos we replace the null values inside this duration column by an arbitrary number and if I'm not wrong that number was zero yeah it's here so we replaced all the null values inside the duration column by the number zero here and then we updated with this in place equal to true so what's going on so there is a simple explanation to this here when we use the fill na method to replace null values inside the duration column by this zero number we set this zero number which is an integer but here below you can see that here when we call this column this duration column we're getting access to the string data type inside this column and here there is a conflict because here we're expecting a strings but here we added integers so this is why when we use the split method here it doesn't find any string when we get to the rows that have these zero values so when it gets to those rows it will detect some null values so here when it makes the split we get null values as a result too so that's why when we use the as type method here to convert this data into integers we get this error because it says that there are null values and of course we cannot convert null values into integers so here a simple solution is just to modify here what we added so instead of adding this zero as integer we add this zero as a string and to do that we only have to add this quotes and that's it now this zero is a string and when we get access to this string attribute this time we're going to get the zero because now zero is a string and then we're going to split this successfully so now let's try this out i'm going to update this change and to do that I'm going to click in this option that says sell and then click on the option that says run all above and I want to do this because I want to start from scratch so all the changes are properly executed so now I click on run all above okay after running all the cells above I'm going to run this cell so before I run the cells from the beginning to this one so now I'm going to run this one to update this changes too so I run this and now we should get this zero integer instead of the null values so now I scroll down and now let's run this code and let's see if everything is fine now so now I run and as you can see here I don't get any error so now we got this series and now this data type is integer so everything is fine now okay in the following videos we're going to use this column so I'm going to give it a name so here I'm going to write df movie and now I'm going to give it a name so I'm going to name it minute and probably if we assign it this way we're going to get a warning message and it's going to be fine so I'm going to show you here here you see that we got this setting with copy warning and warnings are not errors they just let us know that something is going on and that we should check it out but it doesn't mean that we did something wrong it's just a warning so you can even turn off these warnings but I don't recommend you because sometimes they are really useful by the way we could avoid this warning by using the assign method but in this case I wanted to simplify this and make this simple assignment and it's fine we don't have to worry about the warning so now let's continue and I'm going to show you how this data frame looks so here I print this DF movie and now the last column should be this minute column that you see here and now we only have numeric data inside this column so it's great now if we check the data type we see that the minute column is integer so it's great that's what we wanted so as I said before we're going to use this minute column to do some analysis in the following videos but now I want to continue with the split method because it has some other parameters that could be really useful when working with data so now in this case let's work with a different column so now let's work with this date addit column and this date addedit column has some patterns as you can see here it has the name of the month then it has the number of the day and the year so here we could split this date by the comma so here if I write df movie and then we write the name of the column here and then strr to get to the string data and now split we can use a separator so here before we didn't add a separator because we want it to separate by the default blank space but here we want to separate by a comma so we write here split and now I write comma inside this quotes so if we do this let's see what happens so now I run and as you can see here we got a list and inside the list we have two elements and the two elements were separated by the comma so we have the year here and we have the month and the day and we can even add the expand parameter to create two columns here so I write expand equal to true and now we see that we have two columns and now if I select column number one we get only the years great now I'm going to show you how the extract method works and this is a similar method but it has some differences so here I'm going to uh delete this and now to use the extract method first we have to get access to the string so here we write str and after this we write extract so here extract then parenthesis and inside we have to write the element that we want to extract so for example if you want to extract only the commas you have to write quotes and then write the comma and here the only thing that you have to add is parenthesis because when we use the method we have to add uh capture groups and the capture groups are represented with parenthesis so here I add parenthesis and now we can extract only the commas so here you can see that only the commas were extracted but the extract method is really powerful when we use it with regular expressions if you never heard of regular expressions they are expressions that help us extract data from a text based on some patterns so here I'm going to show you a pattern that extracts only the years and the years only has four digits right so we can write the digit symbol that is represented by this uh backslash and the d so this in regular expression represents the digit symbol so now if we want fourdigit symbol like zeros have we have to add these curly braces and write the number four and this in regular expression means that we want four digits together and that's what the years represent because the years are four numbers that are together so I'm going to show you here and here we have it again a year is made of four numbers that are together and this is different from the date because a date has only two numbers so here I'm going to do this i'm going to get these four digits together that represent the year and before I run this code I'm going to add this parenthesis because as I told you before when we use the extract method we need to capture groups with parenthesis so now I run this and here I got a column with only the years so 2021 2016 2019 and so on and we got here this because we use this regular expression that represent four numbers together and by the way you can even use regular expressions in this split method you only have to add it here inside quotes and that's it in this video we learn how to extract data using the split and extract methods remember that in the following videos we're going to use this minute column that we created in this video to identify outliers okay in this video we're going to learn how to identify outliers and we're going to use histograms to identify outliers within numerical data so here first we're going to make a histogram from the minute column we created in the previous video so I'm going to show you this column so I write df movie and now I write the name of this column we created before so it's this one so this column has this numerical data that represent the duration of a movie and now we can identify outliers inside this column and in case you don't know what's an outlier outliers are uncommon elements that you will see in a group of data so for example in a movie movies usually last between 1 hour and 2 hours and if you find a movie that has like 30 minutes of duration or a movie that has like 4 hours of duration that's extremely uncommon so those type of movies will be considered outliers and the criteria to select outliers will depend a lot on your data set okay now let's find these outliers inside this minute column by using histograms so we're going to use the plot method that we learned in previous videos so this plot method help us make visualizations in pandas so here we need to add the kind parameter and here we have to set it equal to hist to make a histogram now I'm going to add the beans parameter to specify how many beans this histogram will have and in this case I'm going to set it to 10 bins so now it's ready so I run this and here we get this histogram so here it's easy to recognize that the outliers are here in the first bar and in these tiny bars that you can barely see from 200 to 300 and you can say that those are outliers because those bars have almost no values inside so only a small proportion of the data are in those ranges so if there are little data in those ranges we can say that the data that is there are outliers okay so we know that here in the first bar and in the last two bars we have outliers but now it'll be great if we can recognize the ranges so here in the first bar we know that it starts with zero but we don't know what's the value that this range ends because it doesn't say anything here in the xaxis so we can get these values here that should be here with the value counts method so let me show you here so apart from getting this nice visualization and recognizing where the outliers are we can get also the ranges so here if I write df movie and now uh here minute and now I use the value counts we can see the values of the ranges so we only need to add the beans so here bins and this one has to be exactly the same as the number of beans we added here in the histogram so here in the histogram I added 10 bins so here in the value counts I also have to add 10 bins and after this I run and here you can see that we get the wrenches here in the indexes but it's not properly sorted so here I'm going to use the sort index method to sort the indexes so here I run this and now we sort this ascending so here we got the first range which represents this first bar that I'm pointing now and here the last two wrenches represent these tiny bars that you can barely see here so we can say that in this data set outliers are movies with duration less than 31 minutes or with more than 249 minutes however if you start analyzing this data set a bit more you will notice that you can expand the range of the outliers for example here we can say that this range from 218 to 249 can be also considered outliers because here we only have six and it's only six movies from thousands of movies in this data set so it's a tiny proportion so again this depends a lot on your criteria and also on the data set you're working with okay after you decide which ranges are considered outliers for you you can filter them out so we only have to use conditions as we learned in previous lessons so here let me show you so let's say that the first range is an outlier and also the last three wrenches so to filter them out we only have to write this df_netflix then we write here square brackets and now minute so here I'm going to compare this uh column with a value 31 so I write 31 here and now this is my first condition so now my second condition is going to be this less than 218 so I write it here and now I have my second condition so here I add parenthesis to condition number one and also to condition number two and now I add this and operator and to filter these two conditions I add the name of the data frame and also square brackets so here if I run this and here I got an error because I wrote DF Netflix but here it should be df_ov so my bad here so I just replace df_netflix and write df movie and now I run this and now everything is fine so here we can see that this data frame only contains movies with duration greater than 31 minutes and less than 218 minutes so now let's check the last column and here we can see 90 minutes 91 125 and yeah it's fine so we filter the outliers out and now in case you want to see the outliers I can use the not operator here with parenthesis and now the not operator to print the data frame with the outlier so now I run and here we have this data frame with only outliers so let's have a look at the last column and here we have a movie with 23 minutes 13 minutes 229 minutes and so on so these are the outliers and before we filter them out okay now we're going to identify outliers with box plots so another way to identify outliers within numeric data is using box plots and a box plot let us see how the values in a data frame are spread out and we can use it to recognize outliers so now I'm going to show you how to do it so first we have to write the name of the data frame and then we write the name of the column so minute then we have to write the plot method to make this box plot and after this we write the kind parameter and write here equal to and here box and this is how we make the box plot so now I run this and as you can see here we have here the box plot and all those dots that you see here are the outliers and all the values that is inside these whiskers that this box plot has is the range where the data will be usually located so here now I'm going to customize this box plot and I'm going to add some parameters so here to see this box plot much better I'm going to add the vert parameter and set this equal to false so this one will allow me to put this box plot in vertical position now I'm going to add another parameter in this case the color parameter and I'm going to set this box plot to the blue color and finally I add the fixed size parameter just to make this plot bigger so I set this equal to 10 comma 5 and that's it now I run and we can see this box plot much better so here on the xaxxis we have the minutes so this is the duration of the movies and here these values that are dots and that are outside the whiskers of the box plot are outliers so this box plot is telling us which values are outliers and by looking at this box plot we can see that the outliers are less than 48 approximately and also the outliers on the right are greater than 155 but that's only a guess because I don't know exactly what are those values i just know that it's close to 50 on the left and on the right is close to 150 but to find these values we can use some formulas that I have here so here we can use some formulas to calculate the range where the outliers are located so here first we have the IQR and this IQR is the Q3 minus the Q1 and as you might remember from previous lessons this Q3 is the top of this box and the Q1 is the bottom of this box so here we can use the describe method to get this Q3 and Q1 values and with this we can calculate the IQR and with the IQR we can use these two formulas I have here and this will help me calculate the ranges where the outliers are so the first formula represents the minimum value of the whiskers so this one that is in blue and the second formula represents the maximum value of the whiskers so this one that is here and both can help me find what are the ranges of the outliers so here I'm going to find the Q3 and Q1 of this box plot and to do that I only have to use the describe method so here I write the name of the data frame then the name of the column I want to analyze so minute and now describe we've seen this method before and this method gives some descriptive statistics and we can see it here so here the 25% represents the Q1 and the 75% represents the Q3 so here Q1 is 87 and Q3 is 114 so with these values now we can replace in this first formula to calculate the minimum value of the whiskers so here instead of Q1 we can write 87 here and then minus 1.5 then uh multiply with this IQR and the IQR is Q3 minus Q1 so Q3 is 114 and Q1 here again 87 so here we have this value and I'm going to name it this uh min_box plot so this is the minimum value of the whiskers or the box plot and now I'm going to copy this one and paste it right here so now I'm going to replace this uh with the second formula so Q3 is 114 and then plus 1.5 times and this 114 minus 87 which represents the IQR so here we have the first and the second formula but here I have to change the name so in this case max box plot all right with this we calculate the min and the max values of this box plot so now I'm going to show these values so here I comment this out and now I print both values so here the first one and now the second one so max_box plot so if I run this we get the minimum and the maximum value of the whiskers so here the first is 46.5 and this should be this minimum value of the whisker here and the maximum is 154 so this one now that we have these values we can say that in this data set the outliers are movies that have a duration of 46 minutes or less and movies that have a duration of 154 minutes or more and such movies are represented by the dots that you can see here outside the whiskers of this box plot and now we can filter out these outliers as we did previously with the histogram so what we have to do is to copy this condition and now just modify the values so here I paste it here and now I write instead of 31 46 and here 154 so these are my new limits for the outliers and as I told you before this criteria to select outliers depends a lot on your data set and also on the project that you're working with so it depends a lot on the goals of your project so in this case if I want to filter outliers using the data that I saw in the box plot I will write these values 46 and 154 so now I run this and here we get the outliers because I'm using here the not operator and here we see 161 and 166 so these movies are outliers because these values are uncommon and that's how you use a box plot to identify outliers within numeric data okay now we're going to see how to use bars to identify outliers within categorical data so here when we have to find outliers within categorical data so data that has many categories we cannot use histograms or box plots because that's for numeric data but we have to use bar plots and here I'm going to show you how uh categorical data looks and here I'm going to print the column uh rating so here this rating column has uh categorical data here and here we can see that we don't have numbers but we have categories and you can see all the categories by using the value count method so here I write value count parenthesis and here we get all the categories so for example the TV- MA rating is the category that has the highest number of values and here these three last categories 84 main 66 and 74 are probably outliers because we only have one value in these categories so we can say that these three last categories are uncommon and for that reason outliers but now we can see this much better in a plot so I'm going to use the plot method so I write plot now parenthesis so I add the kind parameter and here bar then I add another parameter to make the plot bigger this case fixed size and I'm going to set this to 10 comma 5 then I run this and now it's evident that the last six categories in this bar plot are outliers because we can barely see the bars that these six categories have and that indicates that those categories are uncommon in contrast this TV- MA category has the highest number of values and we can see this because it has the tallest bar and now that we recognize these outliers we can filter them out using the conditions so in this case we could use a condition that says that ratings is different from a list and that list is made of these six categories and by doing that we filter them out of the DF movie data frame in this case I'm not going to do that because it's the same as we did in the histogram and in the box plot but you can do it as an exercise and that's it in this video we learn how to identify outliers using bar plots box plots and histograms and also we learn how to deal with these outliers in pandas all right in this video we're going to learn how to deal with inconsistent capitalization using the lower upper and title methods this will help us standardize the letters that are in one column so we can set all the letters to uppercase to lowerase or to title case so let's try this with the title column and I'm going to show you this one so here I write df movie then open square brackets and here title here we can see the titles of the Netflix movies and these titles by default have this title capitalization so here we can modify this capitalization using the methods but first we have to get access to the str attribute so here we have to write str and now let's say that we want to make these letters into lower case so we have to use the lower method so we write lower and with this we change the case of the text to lowerase so I run this and here we can see that now we have all the words in lower case so this is great but now let's say that we want that case of text in uppercase so we have to write this again so I paste this one and now instead of lower we write upper so with this we make these words inside the title column into uppercase so I run this one now and here we can see that all the words now of these movies are in uppercase and finally the last case of text is the title case so I'm going to show you here so instead of upper we write title and here we got the same title we got before in the original data frame and here is the title case so now let's say we want to update the data frame with text in uppercase so here I uncomment this and now I paste it here and here I only override the values inside the title column so here I write this and now we run this one and we'll see the data frame here i'm going to print it so here DF movie and now I run this here we get this warning but it's fine nothing wrong happened just a warning and here we see that the title column has words in uppercase so all of them now are in uppercase but this looks a bit weird because all the letters here in the other columns are in title case but this is the only column in uppercase so I'm going to change it now to title case but I'm going to use a different method and in this case I'm going to use the apply method so here instead of getting access to the str attribute I'm going to use the apply with the lambda function and then use the title method so here I'm going to show you so I write df movie then the title column and now I write apply so this is the apply method and we've seen this method a lot in this course so now I write lambda so I'm going to create a lambda function and here's the input uh random variable so I name it X now colon and now the output and now I want to convert this X variable into title case so I write X.title and as you might remember this X represents an element of this title column inside the DF movie and that's why we use the title method without getting access to the str attribute and before we run this code I'm going to print this df movie below this line of code and now I comment this out and here let's update the values inside this title column so here I'm going to override the values so I write this equal to the same column but now with apply method so now I update and run this one we get the warning message again but everything is fine and here in the data frame we have the words with title case and this means that the first letter in every word is going to be in uppercase and the rest is going to be in lower case and that's it in this video we learn how to deal with inconsistent capitalization using the lower upper and title methods all right in this video we're going to learn how to remove blank spaces using the strip L strip and R strip methods first I'm going to show you how the strip method works so here I'm going to create a variable named movie title and here I'm going to assign this to a value named Titanic so here I write Titanic and I'm going to add these blank spaces at the beginning and at the end of this word and I'm going to do this on purpose for the sake of this video here we have these blank spaces and here another blank space so here I'm going to show you how the L strip works and here I write movie title and now I write that lrip and what this is going to do is it's going to remove the blank spaces at the beginning of this string so now I print this and here we can see that we don't have this blank space at the beginning of this text anymore but we remove this blank space and now let's try with the r method so what this is going to do is to remove the blank spaces at the end of this string so here I print it and now we can see that the last blank spaces are not anymore in this string and now if we want to remove the leading and trailing spaces we have to use the strip method so instead of only removing the blank spaces at the beginning with the L strip or the blank spaces at the end with the R strip now we're going to remove both with the strip method so here I copy and now I paste this one and here I delete the R and now I'm going to apply the strip method so here I run and now we see that we only have the Titanic text and we don't have any blank space and that's how you remove blank spaces with the strip method and this is a really useful technique because sometimes when you have a data set you don't know if that data set has some blank spaces in a column and if you don't remove the blank spaces you could get some unexpected behavior and that's why it's always recommended to remove blank spaces from the text you're working with and this is what we're going to do with the title column here in our data frame df movie now I write the name of the column so title now I print this and in the result that we got it's not so obvious that we have blank spaces but this column could have blank spaces so we have to use the strip method in case this column has blank spaces so here now I write str because we need to get access to the str attribute before we use the strip method and only now we can use the strip method so we write strip and parenthesis so with this we can remove leading and trailing spaces from these columns so now I run this and here we got this column but now without blank spaces and here to update the changes we made we override the title column so now I reassign it to this title column and now I'm going to show the data frame so df movie I get this warning message but it's fine and here we see that the title column doesn't have any blank spaces great now I'm going to show you a different way to get rid of leading and trailing spaces but now using the apply method so here I'm going to write df movie and select this title column but now I'm going to use the apply method so here that apply and here inside parenthesis I use the lambda function so here lambda then the input then colon and then the method I want to apply so I write x strip with parenthesis and here this x variable is an element from this title column and that's why we can use the strip method without getting access to the str attribute so now I comment this out and I run this and we got the same result and that's it in this video we learn how to remove blank spaces with the strip l and r strip methods okay in this video we're going to learn how to replace strings using the replace and sub methods and as a first example we're going to remove punctuation signs using the replace method and also regular expressions but first I'm going to show you the column we're going to work with which is the title column so here I write title now I run this and here we have this data so some of the movie titles that are listed here have some punctuations for example this second movie has the colon sign and this is a punctuation sign and sometimes when we work with text data it's not a good idea to have numbers or some punctuation signs and sometimes it's recommended to remove everything but letters so that's what we're going to do we're going to remove punctuation signs and here I'm going to use a regular expression to do that and I'm going to use it with the replace method but first I'm going to show you the regular expression we're going to use for this example so we have to create a regular expression that matches punctuation signs so here first I'm going to show you some popular symbols in regular expressions and here I have two of the most popular symbols in regular expression so the first one /w represents letters from the alphabet so a to z also digits from 0 to 9 and also this underscore symbol and the second symbol / s represents wide spaces and with these two symbols I'm going to create a regular expression that matches punctuation signs so to do that first I write the /w so with this we get these elements and now I write the slash s and with this we get white spaces and letters from a to z and digits from 0 to 9 and also the underscore and something you need to know is that all the symbols that don't fall in these two categories are considered punctuation signs so here we can use a not operator to get the opposite of these two symbols and as a result get punctuation signs and in regular expressions the not operator is represented with the square brackets and the carrot symbol in the beginning of this square brackets so here I'm going to show you much better with a code cell so I type Y and here we have the code cell and now you can see much better this symbol so here in regular expressions we get the not operator by writing this carrot at the beginning of square brackets and this means match anything that is not letters or digits or underscore and also not wide space so the remaining is only punctuation signs so this is the regular expression we're going to use here in the replace method so now that we have the regular expression I'm going to use the replace method so here I write that and we need this str attribute to get access to the string that is inside this title column and now I can write that replace open parenthesis and as first argument we have to write the text we want to replace so in this case we have to write this regular expression so I write quotes and now this regular expression so this means that we want to replace anything that matches with this regular expression then as a second argument we have to write the text we want to insert instead of this first argument so here in this case we want to remove punctuation sign so I only write quotes and nothing inside and as a final argument I have to write this reg x parameter and I'll set it equal to true so this means that we are using regular expressions in our first argument if we write reg equal to false here the first parameter will be considered as a standard string and not a regular expression so now let's try this out i'm going to run this code but first I want you to remember this second movie because this one has punctuation which is this colon so after running this line of code we shouldn't get this colon in this second title so let's try this out i run this and now we see the results and here in this second movie we don't see the colon sign anymore so we successfully remove punctuation signs using this regular expression with the replace method okay now I want to show you how to do this but only with the replace method so this one looks similar to what we built right now but now we don't need the str attribute so I can delete it and we can use only the replace method and the replace method works in a different way than the str.replace because this second replace method can replace not only strings but also other data type like integers however this strreplace method only can replace strings because this str attribute limits the capabilities of this replace method so we only can replace strings with the str replace method and that's the big difference between these two methods so now to show you this much better I'm going to use the title column again and now I'm going to try to replace numbers with the two method so first with a replace so write replace now one two and here reg x equal to false because this is not a regular expression and now I do the same but with the str.replace so here I write str.replace so now let's see how it works with the first line of code but first I comment this out and now let's see how it works with the replace method so I run and as you can see here it replaces all the one numbers with this two number and now let's try with the second method let's see if it works with the str.replace replace so now I run and as you can see here I got an error and here it says that it must be a string so this str.replace only accepts a strings in the first argument and that's one of the advantages of the replace method over the str.replace okay now I'm going to show you an alternative of the replace method and in this case we're going to use the sub function and this sub function belongs to the re module and this is the regular expression module so first let's import it so I write import re after that we can use this function but first we select the title column so I copy and paste it here and now we use the apply method to apply then parenthesis now the lambda function the input x and now the output here will be the reub so this is the function and now I open parenthesis and the first argument is going to be the same regular expression here so I copy and paste it now the second is the element we want to insert instead of this regular expression and here I write this empty quotes to remove this punctuation and as a third argument I have to write the variable I want to evaluate so in this case this x that I had it as input and now this is ready so I'm going to comment this out so I don't have this anymore so I comment this out and also this too now I run this and let's see the results so here we see that this second movie doesn't have the colon sign and with this we verified that we successfully remove all the punctuation signs in this column and that's it in this video we learn how to replace strings using the replace method and the sub function in this video we're going to see the data set we're going to use in this section so in this section we're going to use a data set that contains Boston house prices and we're going to use this data set to predict the values of the houses in Boston so let's start by importing pandas as pd and then let's read the CSV file so pd read csv and then the name of the file is Boston house pricesc so now you run this and here we got a data frame that contains three columns the first column is the rooms column this will contains the number of rooms that the house has so usually if the house has more rooms the price of the house will be more expensive and if the house has less rooms the price of the house will be less expensive then we have the distance column this one contains the weighted distance to five Boston employment centers and in this case houses that are closer to employment centers will be more expensive and finally we have the value column this one contains the median value of the owner occupied homes in thousands of dollars so what we're going to do later is to predict the price of the houses in Boston so we're going to predict the value column and to predict this we're going to use two columns as inputs so we're going to use the rooms column and also the distance column okay now let's name this data frame so now I write equal to and now df Boston so this is the name of the data frame and now I run this and this is my new data frame so now I show it again and now let's see some statistical values of this data frame so now I'm going to use the describe method so I write describe with parenthesis and now we can see the number of rows this data frame has also the mean and the minimum and maximum value so for example in the rooms we can see that the mean number of rooms in these houses is six also the minimum number of rooms is three and the maximum is eight so now is your time to familiarize with this data frame so you can understand much better the following lessons in this video we'll see one of the most basic machine learning algorithms this is the linear regression the linear regression is an approach for modeling the relationship between two or more variables when we have only two variables we're dealing with a simple linear regression but when we have more variables we're dealing with a multiple linear regression in a simple linear regression one variable is considered the predictor also known as the independent variable while the other variable is known as the outcome also known as the dependent variable this is the linear regression equation here y is the dependent variable also known as the target value and x1 x2 until xn is the independent variable here also b 0 is the intercept while b1 b2 until bn are the coefficients and n is the number of observations this equation is represented by this graph so in this graph we can see a linear relationship so if one independent variable increases or decreases that dependent variable will also increase or decrease linear regression can be used to make simple predictions such as predicting exams scores based on the number of hours studied also the salary of an employee based on years of experience and so on okay as I mentioned before the equation behind this graph is this one that now you see so the first one is the general equation or the complete equation and the second one is the simplified equation this one represents a simple linear regression because we only have one dependent variable and one independent variable some people use this notation but some others use y= a multiply by x + b so these are two popular notations and you have to keep in mind that people use both okay now let's see what are the dependent and independent variables of our data set that we've seen in the previous video so in the previous video we read a Boston house prices CSV file and this CSV file had three columns this data set has information about the number of rooms also the weighted distance to five Boston employment centers and also the value of a house so here first we're going to see that simple linear regression and in a simple linear regression we have one predictor and one target and in this case our predictor is going to be the number of rooms and our target value or target variable is going to be the value of a house so we're going to predict the value of a house based on the number of rooms so we can say that the number of rooms is the independent variable and the value of a house is the dependent variable so this is for our simple linear regression but now for the multiple linear regression we're going to have two independent variables so the number of rooms and also the distance and the target variable is going to be the same so the value of a house so for this multiple linear regression we have two predictors and one target all right that's it for this video and in the following videos we'll implement all of these using Python okay in this video we're going to implement linear regression in Python and there are two ways to implement linear regression in Python and two simple ways to implement linear regression in Python is using the stats model and the scikitle learn libraries both are popular libraries used in machine learning and we're going to learn how to do it with both and in this video we're going to use a stats models so to use a stats models we have to import this writing the following so we write import stats.model and then we write API then as and here we write SM so this SM represents this stats models API so now I press Ctrl enter to import this stats models library and now we imported this and stats models is a module that helps us conduct a statistical test and estimate models and some of the benefits of stats models over scikitlearn is that it provides an extensive list of results for each estimator and now before we create this linear regression we have to read the data set we've seen before so this Boston house prices so now I run these two cells and here we have the data set so now let's create this linear regression so we'll start with a simple linear regression and the first thing we have to do is to define a dependent and independent variable this is the first step you have to do always when you want to create a simple linear regression in the previous videos we discussed what are the dependent and independent variables so the dependent variable is the value we want to predict and it's also known as the target value on the other hand the independent variable is the predictor in our data set so this DF_Boston data frame we have two predictors so we can use any of them in our simple linear regression in this case I'll choose the rooms column as the independent variable and the dependent variable is going to be the value column so now let's define the dependent and independent variable but now in python so here I'm going to write u df_boston and then I'm going to select this column the value column so this is my dependent variable and now I'm selecting this column so now I set this equal to y and y is my dependent variable so now I'm going to do the same but now with the independent variable so now my independent variable is going to be named X and I'm going to select the rooms column so with this we have the dependent and independent variables okay and before we create our simple linear regression let's explore our data set a common way to explore a data set is using scatter plots so a scatter plot allows us to see how the data is distributed so to make this a scatter plot I'm going to use the plot method that comes with pandas so I'm going to write df_boston.plot and now I write the kind parameter equal to scatter then I add the x and also the y parameter so now x and y and now I'm going to set the x equal to these rooms and the y equal to value so with this we can plot this uh data frame and now I press ctrl enter and here we have our scatter plot so in the x-axis we have the rooms and in the yaxis we have the value and we see that most of the data is located here in the middle so what we're going to do in the next step is to create a simple linear regression that fits best these data points so the line that fits best these data points is the one we're going to choose so let's create this line and let's write the code here and here to make a regression we have to add a constant and fit the model first we need to add a constant because a stats model by default doesn't add the constant that we have in the equation of the simple linear regression so if you remember there was a b zero that represented the constant and by default stats model doesn't add this constant but we can add it manually so here I'm going to write the code that helps me add this constant so I write SM then I use the add underscore constant method so with this we can add a constant so now I have to write the independent variable in this case the number of rooms which is X and with this we add a constant so now I'm going to write X equal to this and this X represents the independent variable but now with a constant so now that we added the constant we can fit the model so to fit the model we only have to use the OS method so I write SMOS and then parenthesis and here first we have to write the dependent variable which is Y and then the independent variable which is X then to fit this model we use the fit method so I write fit then parenthesis and with this we fit our model and by the way fitting the model means finding the optimal values in the equation of the linear regression so we obtain a line that best fits the data points a model that is well fitted produces more accurate outcomes so only after feeding the model we can predict the target values using the predictors okay before I predict the values I'm going to name this so I'm going to name it LM which stands for linear model so I run this and here I got an error because I didn't run the previous cell so this one in which we define the dependent and independent variable i didn't run this cell so now I press Ctrl enter and with this I define the dependent and independent variables so now I can run this cell too and now we added the constant and fitted the model so now we can predict the values of these houses so I'm going to use the ln which is the linear model we created and now I use the predict method so I write predict and now I introduce the independent variable which is x so this x represents the number of rooms so with this x we're going to predict the values and here we got the values of the houses so we predicted these values based on the number of rooms okay now let's see how this linear model performs and to do that we're going to use the regression table this table provides an extensive list of results that reveal how good or bad is our model so to get this table we only have to use the summary method so we write lm summary and then parenthesis with this we get this table and this table has two or three parts so the first one is the OS regression results and OS it stands for ordinary list squares and this is the most common method to estimate linear regression then in the second and third tables we'll see more values so now let's see which are the most important values in these tables so let's start with this dep variable which means dependent variable so we can see that our dependent variable is the value so this is the value of the houses then we have the R squar and this R squar takes values from 0 to one the R squar values close to zero corresponds to a regression that explains none of the variability of the data while values close to one correspond to a regression that explains the entire variability of the data so the R square we got here is telling us that the number of rooms explains 48.4% of the variability in house values then we have the co column and here we can see the coefficient of the linear regression equation so the first one is the constant and the second one is the coefficient that corresponds to the rooms variable then we have the standard error column and this one represents the accuracy of the prediction the lower the standard error the better prediction finally these tes and p values are used for hypothesis test okay now let's use these coefficients to create the line that best fits the data points that we've seen before in the scatter plot i already have the coefficient so 9 for the rooms variable and minus 34 for the constant so here the linear equation is a * x + b and we're going to replace the values of the coefficients a and b in this equation so now I create a y print and this is equal to 9.1021 multiply by the x variable which is the number of rooms so this one is x and here rooms so I'm selecting the rooms column and now plus b and in this case b is minus 34 so I only write this minus 34 and with this we have this equation that corresponds to the line that best fits the data points we've seen before and now let's plot this line and to plot this line I'm going to use advanced data visualization libraries such as seabour and mplot lip so here I have the code to plot this line with the scatter plot and I'm using in this case Seabour and Mattplot lip because these two libraries allow me to put the scatter plot and the line plot in one graph so you don't have to memorize this code the most important part is this line and here we indicate that we want to make the line plot and as a y parameter we write the y predict and here this is the equation we created before here and also in the X parameter we set it this equal to the rooms column in the data frame and that's everything you need to know so now I run this and we're going to see the scatter plot but now we also see the line plot which is in red so we can say that this red line is the line that best fits the blue data points by the way you don't need to create this plot i just made it to show you how this regression line looks and that's it for this video in the following video we'll see how to make a multiple linear regression with Python all right in this video we're going to see how to create a multiple linear regression with stats models so first we have to read the same data frame df_boston and now we have to import stats model so we write import stats models API as SM so with this we can start this and here I just need to add the S and with this I imported the stats models and now let's define the dependent and independent variable so in this case it's going to be independent variables because we're going to have more than one variable so I'm going to show you here and uh the df_boston is this one and here we have the dependent variable which is the value column and the independent variable is going to be in this case the rooms and the distance columns so now let's define this using pandas so the dependent variable is df square brackets and here I write value so I'm going to set this equal to y and with this we have the dependent variable so now I can copy this one and here I forgot to write underscore Boston okay so now I copy this one and with this we can create also the independent variable so I write here x and here instead of value I'm going to write rooms and distance but here we have to write double square brackets because we want to select two columns so here I write rooms and distance so I copy and now I paste it right here so now I write comma and add the quotes so with this we have the dependent and independent variables so I run this and now it's time to make the regression so let's add a constant and fit the model so first to add the constant we have to follow the same steps we did in the simple linear regression so we write SM add constant and then inside parenthesis we have to write that independent variable so in this case is X and now we set X equal to this so with this we're going to get the independent variables but now with a constant so here I'm going to feed the model and to feed the model we used OS so I write SMOS and here first we have to write the dependent variable which is Y and then the independent variable which is X then we fit the model using the fit method and I'm going to set this equal to lm which stands for linear model so with this we run and we created our multiple linear regression so now we can see how this model performs by using the summary method so we only have to write lm summary then parenthesis and we get the same table we got in the simple linear regression so the table is similar but in this second table you're going to see that there is an extra row and this extra row corresponds to the independent variable that we added so before we only had the rooms variable but now we have also the distance variable here also we can see that the R squar increased a little bit and overall the analysis of this table that we got after creating the multiple linear regression is similar to the one we obtained after creating the simple linear regression and that's it for this video in this video we learn how to implement the multiple linear regression in Python all right in this video we're going to see how to create a linear regression using sklearn skarn also known as scikitlearn is the standard machine learning library in Python and it can also help us make either a simple linear regression or a multiple linear regression in this case we're going to see how to make a multiple linear regression with sklearn but the steps to make a simple linear regression are going to be the same so now let's start by importing linear model from sklearn so we write from sklearn import linear model and this linear_model has everything we need to create our linear regression so now I run this and after we imported this we have to define the dependent and independent variables so now this is going to be exactly the same as we did previously with stats model so I'm going to write it again so first dependent variable is df_boston and here I'm going to select the value which is the target and now I copy and paste it so I create the independent variables and here I'm going to write x equal to and here I'm going to select two columns so I write double square brackets to select two or more columns and now I copy and paste this rooms and distance so with this we create that independent variable so after we do this we're going to fit the model so now to fit the model we only have to use the fit method and by the way here with scikitlearn we don't need to add a constant as we did with the stats models library because scikitlearn adds this constant by default so we only have to feed the model so now I write linear model which comes from this one that we imported before and now I use the linear regression method so I write linear regression now parenthesis and now I set it equal to ln which is stand for linear model so now to fit the model I use the fit method and here I write x comma y so keep in mind that here the order is different from stats models because in stats models first you write y and then you write x but in scikitlearn first we have to write the independent variable and then the dependent variable so keep that in mind okay now let's run this cell and with this we fit the model so now let's predict the values so we write again lm and we use the predict method so predict and inside parenthesis I write the independent variable which is x so now I run and with this we predict the values of the houses using the independent variables so here we got an array and this represents all the predicted values and now if you want to select only the first five predicted values you only have to write square brackets and then write column five and with this we select only the first five values in this array then if we want to get the values in the summary table that we got using stats models we have to use individual methods because with scikitlearn there is not a single method that can helps us get the same summary table that we obtain with the stats model library so in this case for example if we want to get the R square score we have to use the score method so I only have to write lm.core and then write x comma y so the independent and then the dependent variable so now I run this and with this we got the r square which is the same value to the one we got with stats models then we can also get the coefficients we only have to write lm coeff and run this and we got the same coefficients we got with stats models and we can also get the intercept by using the intercept underscore attribute so we only write lm intercept and underscore so now I run and we can see now that the intercept is minus 34 which is the same value we got in the summary table using stats models in this case I'm not going to explain the meaning of these values because they are the same values we got before so the analysis is the same and if you want you can compare the values to verify that we got the same result in the following videos we're going to use the scikitlearn library because it's the standard machine learning library used in Python and is more powerful than stats models all right in this video we're going to see the data set we're going to work with in this section so first let's import pandas and then let's read this data set so import pandas as pd and then we use the read csv method to read this CSV file so read csv then the name is imdb uh data set so here data set CSV so this is the name of our data frame now we uh read okay the data set that we have now is an IMDb data set that contains 50,000 movie reviews and this data set contains two columns the first column is the review column and in this uh review column you will find what the users uh think about the movie so that's what they think if that movie was good for them or terrible so based on that review we will get a positive or negative sentiment and you will find that data in the sentiment column so the sentiment column contains the sentiment of the review so if it's a good review you will have a positive sentiment so if the review has words like great awesome good movie the sentiment will be positive but if the review has words like bad terrible the sentiment will be negative so our sentiment column only have positive and negative values i chose this data set to create a basic machine learning model so the model that we're going to create will predict whether a review is positive or negative and we're going to use this data set as input to fit our model but first we need to transform this data frame in order to build the model so we're going to try different machine learning models in this section and our final goal is to find which machine learning model is best suited to predict sentiment given a movie review so the sentiment is going to be the output of our model and the movie review is going to be the input of our model remember that because that's the foundation of this project so the sentiment is going to be the output and the movie review is going to be the input all right as you can see this data frame has 50,000 rows however to train our model faster we're going to take a smaller sample of 10,000 rows this small sample will contain 9,000 positive and 1,000 negative reviews to make the data set in balanced so I can teach you under sampling and oversampling techniques in the following videos all right we're going to create this small sample with this code here so here below we're going to write the code that would allow me to take a sample of 10,000 rows to make a processing faster and get invalanced data all right before we create this small sample let's give a name to this data frame so now I'm going to set this data frame equal to DF reviews so I write df reviews and then I run this code so here I run and now we have our data frame so now let's take a sample of 10,000 rows to make processing faster and also get imbalanced data so now I write df reviews and then I I'm going to uh make a condition so I select only the data I want so here I write uh sentiment because that's the column we want so I'm going to show you here again the data frame i'm going to select the column sentiment which is this one and now we write double equal sign to select only the positive uh sentiment so positive and now I'm going to select this by writing the name of the data frame and then putting this inside square brackets so this is the selection we make and we learned this in the previous videos how to filter a data set and now I'm going to run this and let's see the output so here we have the data frame and we see that all the data inside the sentiment column is positive so we only have positive sentiment so now we have here 25,000 rows but as I mentioned before we only want 9,000 rows that correspond to the positive sentiment so here I'm going to select only 9,000 so I write colon and then 9,000 and with this we're going to get only the first 9,000 positive sentiment all right now let's do the same but now with the negative sentiment so now I write negative and now I select only the first 1,000 rows all right with this we have 9,000 positive reviews and 1,000 negative reviews so now let's give a name to these two data frames so here I'm going to name it DF posositive and the other is going to be uh DF negative so now DF negative and here we have our two data frames so here uh we have 9,000 df positive and 1,000 uh negative so now I write it here and now let's run this code so to create our two data frames so now we have our two data frames now let's concatenate these two data frames so we get only a single data frame as we had before here in this df reviews so here let's concatenate vertically so I write PD concat then parenthesis now I open these square brackets and now I write the name of the two data frames we created so first df positive and then df negative so these are my two data frames and now let's run this and here we concatenated these two data frames vertically so now we have uh this data frame and this data frame contains positive and negative reviews and 9,000 positive and 1,000 negative reviews so now let's give a name to this data frame i'm going to name it DF reviews imb this IMB stands for imbalanced and we can say that we have imbalanced data when we have a large amount of data for one class and much fewer observations for other class so this is known as imbalanced data because the number of observations per class is not equally distributed so here we have 9,000 positive reviews and we have 1,000 negative reviews so 9,000 is way bigger than 1,000 so we can say that we have imbalanced data because the sentiment of the review is not equally distributed okay now this DF reviews imb represents our imbalanced data so now I run this and now let's see how this uh data frame looks so now I paste this and this is the data frame we got after concatenating the positive and negative reviews now let's see how is distributed so now I use the value counts method so I write value counts and now we write the name of the sentiment column so here I open quotes and write sentiment so now I run this and here we have that there are 9,000 positive and 1,000 negative uh reviews so we verified that this data is indeed in balance and in the following videos we're going to see how to deal with imbalanced data and that's it for this video in this video we had a look at the IMDb data set and also we uh selected only a few positive and few negative reviews so we made that data imbalanced and we created this DF reviews imb data frame in the next video we're going to see how to deal with imbalanced data in this video we're going to see the under sampling and oversampling techniques oversampling and under sampling in data analysis are techniques used to adjust the class distribution of a data set both techniques helps us deal with imbalanced data in some cases we should avoid imbalanced classification because it can influence the performance on our machine learning algorithms that said both oversampling and under sampling involve introducing a bias to select more samples from one class than from another to compensate for an imbalance that is either already present in the data or likely to develop if a purely random sample were taken okay let's see how both under sampling and oversampling work okay first let's see under sampling and under sampling is just deleting samples from the majority class so here we have two classes the blue class and the yellow class so the blue class is the minority class and the yellow class is the majority class okay now let's imagine that we have movie reviews and the minority is represented by the negative reviews and the majority is represented by the positive reviews so if we want to balance the data set we have to under sample the majority class that is deleting samples from the majority class so in this case we will delete samples from the positive reviews so from the majority class we do this because we want the same number of observations in both categories so we want the same number of positive and negative reviews so we reduce the number of positive reviews so we get the same number of reviews okay now let's see the oversampling and over sampling is just duplicating samples from the minority class so here again the majority class is the positive reviews and the minority class is the negative reviews so now if we want to balance this data by oversampling we need to increase the number of observations of the minority class so we have to duplicate some samples in this minority class and we do this and in the end we'll have the same number of observations between the minority class and the majority class so this means that we will have the same number of positive and negative reviews so now you might be wondering which one is better over sampling or under sampling so the answer depends a lot on the data that you're working with and also depends a lot on your project's goal so sometimes it'll be better to over sample your data but in some other cases it'll be much better to under sample your data all right now let's implement the oversampling and under sampling techniques to our project so let's go to Jupyter notebook and here we're going to start by making a bar plot to show how the data is distributed so now I'm going to uh write the name of the data frame that we created before which was uh df_reviews imb so this imb stands for imbalance and this is our imbalance data and we're going to use the plot method to plot this uh data frame and now I'm going to use the kind parameter and set it equal to bar then I'm going to use the value counts method to only get the number of elements the sentiment column has so here I copy sentiment and now I write value counts then parenthesis and now I write sentiment so I'm going to plot this output and now this is ready so I press Ctrl enter and now we have this bar plot and this bar plot is useful to show how the data is distributed for this particular example it might look a bit necessary but when you have many categories this becomes really useful okay in this plot we can see that the majority class is the positive reviews because we have 9,000 reviews and the minority class is the negative reviews because we only have 1,000 so here we have two options we can under sample the positive reviews or we can over sample the negative reviews for this example I'm going to under sample the positive reviews and now I'm going to show you two different ways to under sample uh class so now let's scroll down and here let's sample the positive reviews with the sample method and to do this first we have to calculate the length of the negative reviews and to do that first we write df reviews underscore imm then we select only the negative reviews so here I write sentiment then equal to negative and now we filter this with these square brackets so now we only have negative reviews and now to calculate the length of these negative reviews we use the len function so now I'm going to set this equal to length underscore negative and with this we can see the length of this uh negative reviews so here I'm going to print it n is 1,000 okay now we're going to extract a sample from this positive reviews and this sample is going to be equal to the length negative so we get in the end the same number of observations so I write first df reviews imb and now I'm going to select only positive reviews so now I write sentiment again then equal to and here positive now I filter this with square brackets and here I'm going to use the sample method so I write that sample and inside parenthesis I write the end parameter and set it equal to length negative so with this we're going to extract only a sample of 1,000 from this positive reviews so now let's try this out and here I have the output and we see that in the sentiment column there are only positive reviews and here we see that we have only 1,000 rows so we got 1,000 positive reviews so this is the output we wanted so here I'm going to set this equal to DF reviews positive so this is my new data frame okay now let's extract only the negative reviews and then we're going to concatenate the positive and negative reviews so I'm going to use this code because this code already gets the negative reviews so is this one I run and as you can see we have in the sentiment column only negative reviews and we have 1,000 rows so I'm going to set this equal to DF reviews negative so now I run and I created this data frame now let's concatenate these two data frames the positive and the negative okay now I write PD concat then open parenthesis and then open square brackets then I write the name of this positive data frame and then comma and then I add the second data frame so the negative reviews data frame so now we concatenate these two data frames and let's run so here we have a data frame with positive and negative reviews and we see that now we have 2,000 rows and if we check the distribution of this uh sentiment column we will see that now the data is equally distributed so now we have 1,000 negative reviews and 1,000 positive reviews so this is what we wanted now the data is balanced so now I'm going to set a name to this data frame and the name is going to be TF reviews ball and this ball stands for balance so now I run this and we created this data frame so now let's check again because I think the indexes are not properly assigned so here first I'm going to delete this value underscore counts so we get the concatenation so now I run again and now I run this df reviews ball and now uh here we see that the indexes are not sorted ascending so let's reset the index so it starts with zero and then one 2 3 and so on so now I write reset index and here in the parenthesis I write drop equal to true and then I add the in place parameter and set it equal to true to update this data frame so now I run this and here we have the result so now I'm going to print this data frame and see the index and now we see that the index starts now with zero and as you might remember if we use the value counts we can verify again that the sentiment is equally distributed so now we have 1,00 positive and 1,00 negatives so this is the first way to balance data using the sample method and now I'm going to show you an easier way to balance data and the second way to balance data is using the random and sampler and to use this random and sampler first we need to install the imblearn library and to install this library we have to write uh exclamation mark and then pip install imbarned so now we run this and wait a couple of seconds until the installation is done so now the installation is finished and now we can import the library so I'm going to delete this and now to import the random sampler we have to write the following from imbarned underore sampling import random and sampler so here I'm going to just copy and paste this to write it faster so here is import random and sampler so now I run this and now we successfully imported this random ander sampler so now to sample the positive reviews first we have to create an instance of this random and sampler class so here I write rus equal to random ander sampler and here in parenthesis I write the random state this is just a way to control the randomness of the random and sampler but it's not necessary but here I set it equal to zero so I write equal to zero and now we create this rs that stands for random and sampler okay now to resample the imbalanced data df_re imb we have to fit the rus variable that we created here and we have to use the fit resample method so here the x represents the data which have to be sampled and the y corresponds the labels for each sample in x in this case X is the review column and Y is the sentiment column so now let's reseample the imbalanced data so here instead of X I'm going to write DF review_ this is the data set that is imbalanced and now I'm going to select only the review column so now I add double square brackets and now here I select the review column and here in the Y argument I write the sentiment column so here I copy the name of the data frame and now I select only the sentiment column so sentiment and now keep in mind that the first argument should be a data frame while the second argument should be a series so this is a 2D and this is a 1D this is how the fit underscore example works and you can also see that I used here double square brackets for the data frame because if I use double square brackets I get a data frame as output but if I use only a single pair of square brackets I get a series as output so keep that in mind so now I'm going to run this and let's see the result and here I got an error because I didn't write the name of the data frame correctly so I add the S and now here the S2 and now I run and this is the output of the last line of code that we wrote so this is the review column and this is the negative column so now let's assign this to a new data frame and this data frame is going to be the balance data frame so here I write df_re review_bal and something I almost forget to tell you is that this output that we got after using the feed underscore resample method has two elements so the output has two elements and the first one is this data frame so this is a data frame and this data frame has only one column and the second output is a series so this one is a series so here apparently is only one element but actually we have here two elements so this is the first one and this is the second one and the first one is a data frame because here we introduce a data frame in our first argument so makes sense that we input a data frame and we output a data frame in the first element and here in the second element we insert a series so again we input a series and we output a series 2 so this second element is a series so now let's um continue here so first this is my data frame and the name of the data frame is df_re_val and this represents the first element so the data frame so this one and now I'm going to write the series that corresponds to this second element so the sentiment column so now here I write uh DF_review_Ball but now I select only a column and it's going to be sentiment since this DF_re_bal data frame is new what we're doing here is to create a new column for this data frame so this column is going to be sentiment and actually we're adding this column to the df_re_val that I'm creating right here so now let's set all of this equal to the expression that we wrote here and anyway you can see here that we reample the reviews and also the sentiment so now let's run this and see the result so now I run this code and now let's see the data frame we just created so here we have the DF review and now we have this data frame so here we have the review and the sentiment column and we see that now we have only 2,000 rows and apparently we under sample the data but now let's verify if that's true so let's use the value_c count method to count the elements inside the sentiment column so now I copy the name of the column i open parenthesis and now I run this code so now let's verify and here we have 1,00 negative reviews and also 1,000 positive reviews so we successfully undersampled the data using the random under sampler by the way in this example I use the random under sampler because I chose to under sample the positive reviews but in case you chose to over sample the negative reviews you only have to use the random oversampler and you can do this by importing the random oversampler from the imbarn and that's it in this video we learned the oversampling and under sampling techniques okay in this video we're going to see how to split data into train and test before we build our model we have to split the data into a train and test set the train data set will be used to fit the model while the test data set will be used to provide an unbiased evaluation of a final model fit on the training data set remember that we always have to split our data into train and test set before building our model all right now to split the data into train and test set we're going to use the sklearn library and this library comes with Anaconda so we already have this library installed so you don't have to install it so let's use the sklearn library so I write from sklearn and then we write dot model selection and we import the train test underscoreplit so this train test_split will help us split data into train and test so now the only thing we have to do is to use this train test split and now open parenthesis so after we do this we have to introduce the name of our data set in this case is DF reviews_bal so this is the balance data set that we created before and now we have to split the data and we have to set the size of the split so here I write the test size parameter and set it equal to 33 so this 33 indicates that we want to set the 30% of the dataf frame df_re ball to the test data and the remaining 67% will be assigned to the train data so now I'm going to assign this to train comma test so this train test_split will return two values and these two values are the train and the test so the 67% and the 33% and before running this code I'm going to add an extra parameter that you don't need to add but I'm just going to do it to control the random state so I add random state and this set it equal to 42 this helps me set the same random values no matter how many times we run this code so with this I control the randomness so here now I can run this so now I press Ctrl enter and here I got an error because I didn't write correctly the name of this model underscore select actually is model selection so I write selection and now I run and now we successfully created this train and test data frames so now I'm going to show you the trained and the test data frame so this is the train data frame and here we have uh approximately 67% of the data and in the test data frame we have 33% of the data and another good practice that we have to follow before building the model is to create an X and Y variables so the X represent the independent variables and the Y represent the dependent variables so for example we should split that train into train_x and train underscorey so in the train underscorex we should have only the predictors also known as independent variables and in the train underscorey we should have only the dependent variables also known as the target label in case you don't remember the definitions of independent and dependent variables I'm going to make a recap so the dependent variable is the value we want to predict and is also known as the target value and the independent variable is the predictor in our example the dependent variable is the sentiment because this is the value we want to predict and our independent variable is the review because this is our predictor so with the review we predict the sentiment okay okay now let's set the independent and dependent variables within our train and test set so here I'm going to write train and then I'm going to select only the review column and then I copy this i write kuma and I write train and here I'm going to select only the sentiment column so the review is my predictor so this is my independent variable so here I set this equal to train X so X is the independent variable also known as the predictor and here the sentiment is the dependent variable so I'm going to set this equal to train_Y so the Y is the dependent variable so X is independent and Y is the dependent variable and I'm going to do the same for the test data set so instead of writing train I'm going to write test but here I'm going to show you a shortcut so here if I press the key F we're going to see this find and replace and here I can write the word train so it matches four words that have the word train and now I can replace it for test so if I write test here we can replace the word train with the word test so now I click in replace all and now voila we have test instead of train so now I'm going to copy and paste this one here again and now I'm going to run this code and we'll see the result okay and as I mentioned before the train data set so these two will be used to fit the model so now let's see how these two data sets look so first that train_x so this one contains only the review column and now the train_y should contain the sentiment column so yeah we have this and now let's count the number of positive and negative reviews so I use value count now parenthesis and we got 675 positive reviews and 665 negative reviews and that's it in the following videos we'll use this train_x and train_y we just created in this video we're going to learn what is bag of words the bag of words model is a representation that turns text into fixed length vectors this helps us represent text into numbers so we can use it for machine learning models the model doesn't care about the word order but it's only concerned with the frequency of words in the text the typical back of word workflow involves cleaning raw text tokenization building a vocabulary and generating vectors we can do all of this in Python using a tool called count vectorzer that takes care of most of the back of word workflow and to understand how the count vectorizer works let's see an example and in this example I have two reviews that show the opinion that people have on Python and on Java so to implement the bag of words through the count vectorzer we only have to count the number of times each word appears in both sentences so we can build a table and this table could show the number of times each word appears in the sentences so let's have a look this is a table I built and as you can see here I have the sentence the two sentences or two reviews and also in the table I have how many times each word appeared in each sentence or each review so for example the word code appeared two times in the first review so in the first review here we can see that the word code appears after the word writing and also after the word Python so yeah it appears twice in this first review so this is how you build this table and this table represents a document term matrix that we're going to see later and we'll tame this document term matrix by using the count vectorzer in Python by the way keep in mind that words with two letters or fewer are not taken into account by the count vectorzer all right now let's see how to implement the count vectorzer in Python so now we go to Jupyter notebook and here let's import the count vectorzer so we write from sklearnfeature extraction.ext so here text then import and here we write count vectorizer so I write count and then vectorizer so now I'm gonna paste the text that we had before okay I just pasted that text we had before and now I'm going to create a data frame so I use the PD dataf frame and now I open parenthesis and now I create this data frame with a dictionary so I open curly braces now I create the first column which is going to be review so the first column is review and then I create the elements so the first element is going to be review one and the second element is going to be review two and now I create a second column and it's going to be named text so this is the name of my second column and the value that is going to have is the text uh list that I have here so this is my list and now I just paste the variable so now I'm going to create an instance of the count vectorzer so here I copy and paste count vectorzer then open parenthesis and that's how I create an instance of this count vectorzer so I'm going to name it this U CV so CV that stands for count vectorzer now we can add a parameter here so I want to add the stop words parameter and this one will help me filter out words after this I'm going to use this CV I created and now I'm going to feed this count vectorzer so I use the fit transform and inside parenthesis I have to write the name of the data that I want to fit to this count vectorzer so in this case I want to fit this text list so I'm going to put the name of the text column so I write df then open square brackets and then text so this uh column has the data that I want to fit to this CV that is the count vectorzer and by the way the fit transform method works in this way first the fit will find the best parameters that fit to this data and then the transform will transform our data using these optimal parameters okay so we can run this to see the results we have so far so now I run and as you can see the result after feeding and transforming this data is only this spar matrix that we cannot actually see but now if I create another data frame we can see this sparse matrix into this data frame so I'm going to do this now and I'm going to create this data frame and I'm going to name it DF_DTM and this DTM stands for document term matrix this is what we're going to build so now I'm going to use here that PD dataf frame again so I just copy and paste this one because we're going to build a data frame and now I open parenthesis and here the first argument we're going to introduce is the data so the data that we want to show in this data frame is this sparse matrix so I'm going to name this sparse matrix and the name is going to be cv matrix so this is the name now I run and now we add this sparse matrix to the first argument so we show this sparse matrix in the data frame now the second argument is going to be the index and this index is equal to df uh square brackets and here the review so here we're going to get the review one and review two in the indexes by the way here I almost forget to add that two array because here now this is a sparse matrix but I want to get an array so I use that two array so I only get the the values inside and also here we want the values right now we have the series so it's the review series but I only want the values so I use the values attribute and with this we get only the values next I'm going to add the columns parameter and set it equal to CV.get feature names so this one now I open parenthesis and with this we're setting each word that appears in these two sentences in the column of this DF_DTM data frame so with this our data frame is ready so now I run this and now I show you the result so here this is the result so this is the same uh table we had in the slides and we build it with Python so in this case it might seem unnecessary to make this little table but now imagine that you have many many sentences or you have a movie script with a lot of text and in that case you will need to use a tool like the count vectorzer to show the frequency of the thousand of words that that movie has and that's it for the count vectorzer now let's see another way to implement the bag of words in Python so now we're going to see the TF IDF and TF IDF stands for term frequency inverse document frequency so the first one the term frequency is the TF and the last one is the IDF unlike the count vectorizer the TF computes weights that represent how relevant is a word to a document in a collection of documents also known as a corpus the TF value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word the TF IDF has applications in information retrieval like search engines that aim to deliver results that are most relevant to what you're searching for so for example Google when you search a keyword Google returns the most relevant information to you so that's some kind of basic TF IDF application there okay before we see the Python implementation of this TF IDF let's see an example so you have a good idea of how the TF and IDF are calculated the TF IDF is calculated by multiplying the TF and the IDF and now let's see how to calculate the TF so the TF is that term frequency and this one looks like the count vectorzer because here we only count the number of times a word appears in a text so as an example let's take the same two sentences that we use in the count vectorzer so here we have similar output but in this case the two columns are TF1 and TF2 so the term frequency one and term frequency 2 and in the table we see that for example the word loaf appeared two times in the TF1 but it didn't appear in the TF2 so TF1 represents the first sentence and TF2 represents the second sentence so with this we calculated the TF this is exactly the same as the count vectorzer but keep in mind that some people have a different definition of the term frequency but in most cases the term frequency is only considered as the number of times a word appears in a text so exactly what the count vectorizer does okay now that we have the term frequency we're going to see how to find the IDF so the inverse document frequency so the IDF is calculated with a formula you see on screen this may look intimidating but when we implement this with Python we don't have to make any calculation but only use the sklearn library and with that we can easily calculate all of this by the way sklearn assumes natural logarithm instead of the log that you see in this formula so now let's calculate the IDF values for each word as skarn will do it so using the natural logarithm okay in this table I replace the values of the formula using the data that I had in the two sentences that I showed you before and as you can see the numerator in all the calculations is three so you see three in all of them and this happens because here the numerator is the total number of sentences plus one and the total number of sentences remains the same because we have the same number of sentences so we have two sentences and two + 1 is three and that's why we have three in all the numerators however the values in the denominator change so we have here two and then three and then two and so on and this happens because here the number of sentences containing a specific word in this list will change so for example we'll see that the word writing will appear more times than the word love and that will change depending on each word and also on the sentence okay now that they have all the values of the IDF so here it's IDF1 i didn't write it correctly but this is IDF1 and here is IDF2 so now that we have IDF1 and IDF2 we can calculate the TF IDF1 and the TF IDF2 so we only multiply the TF1 with the IDF1 and the TF2 with the IDF2 and here we have the values and in this table the higher the TF IDF score the more unique the term so for example here we have the highest score for love in TF1 and also Python so these two words are more unique to this first sentence while in the second sentence the words hate and the words Java are more unique to this second sentence because they have the highest score so we can say that if the TF IDF score is high that term will be more unique or more valuable in contrast if the score is low that means that that word is not unique so probably that word belongs to more sentences or more documents for example here the words writing and code appear in both sentences and this is why we got a low TF score okay now let's see how to implement this TF IDF in Python so let's go back to Jupyter notebook and now I'm going to copy this code that helped me create the account vectorzer so this text variable and now here I paste this code and now I'm going to import a library which is the TF IDF vectorzer so I write from sklearn.feature feature uh extraction.ext import and then TF IDF vectorizer so this one so now I'm going to create an instance of this TF so I write TF IDF equal to and now I copy and paste this TF IDF vectorzer then I open parenthesis and here I add this stop words parameter so I get rid of stop words in English so now it's here and now I'm going to create a data frame so I can show you this uh table with these values that we calculated manually before but now we're going to do it with sklearn so here I write df and I'm going to create the same data frame so here I just copy this because it's going to be the same so now I paste it here now let's fit and transform this TF so I write TF fit and underscore transform so this one and now I open parenthesis and now I introduce that text data that we want so df uh then square brackets and inside text so now I'm going to run this to see the result so now I run and again we got this sparse matrix and now I'm going to name this uh result so I'm going to name it TF_matrix so here underscore matrix and now we run this and we created this new variable so with this I'm going to create a new data frame so I can show the TF score in this new data frame by the way here the fit transform works the same way that in this count vectorizer so the fit finds the best parameters and then the transform transform the data that we have using these optimal parameters so now let's create the data frame so now I write pd dataf frame then I open parenthesis and write the name of this TF underscore matrix then we transform this to an array with parenthesis now the index going to be actually it's going to be the same here so I'm just going to copy so I save some time so I copy this and now I paste it right here so this is the same index and the same columns because we're using the same text variable here it's the same we're just changing the approach in this case we're using the TF IDF vectorizer and not the count vectorzer but here we have to make a slightly change so instead of CV here we write TF and now this is ready so now I can show you this one and now we run this and by the way the TF IF vectorizer so this one uses L2 normalization so this is why we got these numbers and to get the same result I got here in these uh slides using the formulas I showed you before we have to add a parameter to this TF vectorzer so this parameter is norm and we have to set it equal to none so by default we use the L2 normalization but with the norm equal to none we'll be using the same formulas I showed you before and we're going to get the same result we got in the slides so let's try this out first i run this and now I run this one and as you can see we have now the same numbers that we got in the slides for example review one love is 2.8 8 and here review one uh TF IDF1 love is 2.8 8 so it's the same result so with this we successfully built our TF IDF vectorzer and in the next video we're going to see how to implement this in our project all right now it's time to put into practice the concepts we learned in the previous videos to turn data into numerical vectors so in the previous videos we balance our data so we have imbalanced data and then we use the random andro sampler to balance this data actually we saw two ways to balance the data so first we use the sample method and then we use the random and sample but for the analysis that we're going to make in this and in the following videos we're using the data we got with the random and sampler so in case you only use the first method the sample method there is nothing wrong the result is going to be almost the same but I'm going to choose this one and the output will slightly change but it won't affect so much the result so in case you want to have the same output that in this videos you should use the random and sampler too so I'm going to run this and after this we split the data into train and test so here I run this again and now we're going to use this train_x and we're going to turn text data into numerical vectors using that TF IDF that we learned in the previous video so I'm choosing the TF IDF because I want to identify unique and representative words for positive and negative reviews and the TF IDF gives higher score to unique words within sentences okay now to do this let's first uh import TF vectorzer as we did in the previous video so we write from then sklearn then feature instructionction so this one import but before I have to add here the that text so now I can write TF IDF vectorzer so TF IDF vectorizer so now it's ready so we press Ctrl enter and we imported this TF IDF vectorizer now let's initiate an instance of this TF IDF vectorzer so I write this then parenthesis I add the stop word parameter so stop words equal to English then I'm going to set this equal to TF IDF and now let's feed this TF IDF with the data that is in the train_x so we created before this train_x and this variable has the reviews that we're going to fit into that tf so now we write tf and now I write the fit transform method so fit transform and then I copy and paste that train_x so with this we're going to fit and transform this train_x now I'm gonna assign a name to this and I'm going to name it train_x vector now I run this and with this we have this train_x vector so now I'm going to print it here and this is a sparse matrix and a sparse matrix is a type of matrix in which most of the elements are zero so if I use the to array method we can transform this to an array so here I run and as you can see most of the values are zero so if you see here in the output that we got here you see that this is a matrix of 1,340 rows and 20,625 columns if you multiply these values you're going to get a lot of elements in this matrix however only 118,000 are nonzero values so the percentage of nonzero values is little and we can see that this sparse matrix has a lot of elements that are equal to zero and here we can also create a data frame from this train_x vector but it's not necessary but it can help you visualize how this matrix or this vector looks i'm going to create a data frame as I did in this example but in this case we're going to have a lot of columns and also a lot of rows and here I pasted a code that helps me create this data frame and it's similar to the code we wrote before in this example but in this case I'm not uh transforming this matrix to array but I'm using the sparse attribute that this data frame has and also the from SP matrix method and with this I don't need to transform the train_x vector into an array as I did in the previous example so I run this and now you can see the data frame and as you can see it has a lot of zeros and in the columns we have the vocabulary used in the reviews so all the words that were mentioned in the reviews in this train underscore x that I'm going to show you right now so in case you forgot about it here all the vocabulary used in these reviews are here in the columns so every word that has more than two letters are taken into account in the columns and these indexes represent the number of the review okay now before we finish the video we're going to also feed and transform the test_x because the test_x will help us evaluate the model that we're going to build so now I'm going to add here a cell and now I'm going to copy this one and paste it here so now instead of uh using the fit transform on the train_x I'm going to use it on the test underscorex and actually we don't need to use the fit transform method because we already got the best parameters when we fit and transform the train_x and those parameters are here in the TF IDF so here I can only use the transform method this one is also a method actually there is a fit method and there is a transform method and also a fit transform that does both but in this case we only need the transform method because the TF already knows what are the optimal parameters so with this we transform the test_x and we name this test_x vector so here test_x and now I run this and with this we're going to get a sparse matrix again so let me show you and this is the result all right in the following videos we're going to use the train_x vector to build the model and we're going to use the test_x vector to evaluate the model in this video we're going to see the machine learning algorithms that we can use in our project but first let's see the difference between supervised and unsupervised learning machine learning algorithms are divided between supervised and unsupervised learning in supervised learning models are trained using label data while in the unsupervised learning patterns are inferred from the unlabelled input data in our project the input and the output are clearly identified so the input is the movies reviews and the output is the sentiment so either positive or negative so we can say that we have labeled input and output data therefore we're dealing with supervised learning two common types of supervised learning algorithms are regression and classification regression are used to predict continuous values such as the price the salary and the age then classification is used to predict categories such as male female spam not a spam or positive and negative so in this case it can be more than two categories so it can be two positive positive negative and two negative so it doesn't need to be only two categories so now let's make a recap and now let's see the types of algorithms we should use in our project so first we're dealing with supervised learning because we have labeled input and output data then we have to use classification because we want to predict the category of our review so we want to know if the review is either positive or negative so this is why we choose classification and some of the popular algorithms used in classification are SVM decision tree naive phase and logistic regression and we're going to see the concept behind each of those algorithms and also how to implement them in Python and we're going to see that in the next video in this video we're going to see the support vector machine algorithm the support vector machine algorithm also known as SVM is a supervised learning algorithm that is mostly used in classification problems we only need to feed the SVM model with label training data in order to categorize new text this algorithm is a good option for text classification problems because it has a high speed and good performance with a limited number of samples and we usually work with a data set that has a few thousand of tagged samples in text classification to understand much better how SVM works let's see an example so in this graph we have two tags green and yellow also we have two features X and Y in a data set a feature is simply a column that represents a measurable piece of data that can be used for analysis and in this example we want to build a classifier that finds whether our text data is either green or yellow so we plot each observation also known as data point in an n dimensional space where n is the number of features used in our example we only have two features x and y so the observations are plotted in a two-dimensional space the SVM takes the data points and makes a hyper plane that best separates the classes since the observations are plotted in two-dimensional space the hyper plane is simply a line so you can see this line now and this red line represents the hyper plane that is plotted in a two-dimensional space this red line is also known as the decision boundary the decision boundary determines whether a data point belongs to one class or to another class in our example if the data points falls on the left side it will be classified as green and if it falls on the right side it will be classified as yellow and now you might be wondering what's the best hyper plane well it's the one that maximizes the margins from both classes so you can draw many hyperplanes but the one that is the best is the one that maximizes the margin from both classes so in this case the red line is the one that maximizes the margins from both classes so now let's see how to implement the SVM algorithm in Python and in Python we can use the sklearn library to import the SVM algorithm so we only have to write from sklearn then svm and then we import svc so this svc belongs to the svm algorithm so now let's create an instance of the svc that I should have written in uppercase so now it's in uppercase and now I initiate an instance by writing svc then parenthesis and now we add the kernel parameter and set it equal to linear so now we set this equal to svc and then we fit this model so we write svc then fit and inside parenthesis we have to write the input and the output so in this case train underscorex vector and the output is train_y so this train_x vector is the vector we got here when we turn our text data into numerical vectors using the tf vectorizer here we got the train_x vector that we're using right now and later we're going to use that test_x vector to evaluate the models we're going to build so now let's go back and here I'm going to fit this model by running the cell so I run and now I wait and now it's done so here I use a linear kernel but you can change that type of kernel if you want and now we can use this svc to predict sentiment so we only write svc predict open parenthesis and inside we use that tf we created before so this is going to transform the text that I'm about to write so let's write a text that says a good movie so this is my movie review and I write a good movie so with the TF IDF that I initiated before here in the previous video with this TF IDF I'm transforming this text into vectors and then with the CSVC I'm predicting the sentiment based on these vectors so now I run this and now let's see the result but first I'm going to write this square brackets so this is a list and now I run and we got that the sentiment is positive so probably the SVC detected this good as a positive adjective and this is why it output this positive sentiment so now let's try with two more sentences okay I just came up with two more movie reviews and this one says an excellent movie and the last one is I didn't like that movie at all i gave this movie away so let's see what the SVC will predict so now let's run this and we got that the first is positive so a good movie that's a positive review so it was correct the second uh is positive too so an excellent movie so it's a positive review so it's correct and the last one is I didn't like the movie and more text and in this one it predicted that is negative so here it successfully predicted these three reviews but with more complex sentences maybe it can predict wrong so we have to build the other algorithms and compare with each other which is the best and that's it for this video in this video we learn the concepts behind the SVM algorithm and how to implement it in Python all right in this video we're going to see how the decision tree algorithm works the decision tree algorithm is a supervised learning algorithm that can be used for solving both regression and classification problems we use a decision tree to build a model that can predict the class or value of the target variable by learning decision tree rules inferred from the training data to predict a class label for a record we start from the root of the tree so the root of the tree is this first node that you see here each node in the decision tree will evaluate the record following a specific rule then we follow the branch that corresponds to the result of the comparison and then we jump to the next note let's understand much better how the decision tree works with an example in this example we decide whether a customer will turn or not based on some rules turning is defined as customers who leave a company and in this case we're going to see whether a customer turns or not based on the monthly charges so here in my root I have monthly charges less than 40 and this is kind of a question because if the answer is yes then we'll follow this branch on the right and if the answer is no we'll follow the branch on the left so let's say the monthly charges is 30 and now we evaluate this note so let's imagine this is a telco company if a customer has been more than 3 years with the company we say that the tenure is more than 3 years so if that's the case if the customer has been like four years with the company then we follow the branch on the right so we set char equal to no so if this is an old customer probably this customer won't leave the company but if the tenure is less than 3 years so the customer has been with the company only two years or one year or only a few months this customer probably will leave the company so if the answer of this question is no then we follow this branch and then turn is equal to yes because if the customer has a few months or a few years with the company most likely this customer will leave the company if the monthly charges has been less than $40 so with this we finish the analysis on the right branch and now let's say that the monthly charges are greater than 40 so let's say that this customer charge $100 per month so we'll follow the branch on the left and then we'll ask a new question and this question is customer has more than one product and if the answer is yes then turn is equal to no so if the customer has many dependencies most likely this customer won't char because usually companies give promotions and some benefits to customers who have more than one product however if the answer of this question is no then we'll start asking more and more questions and this is basically how a decision tree works so now let's see how to implement the decision tree in Python so let's go back to Jupyter notebook and we're going to use here again the sklearn library so first we write from sklearn tree and we import decision tree classifier so this one then we create an instance so we initiate an instance from this decision tree classifier open parenthesis and I'm not going to add any parameter i'm going to leave it with that default values and I'm going to assign a name to this decision tree classifier and it's going to be D c_3 so now I copy and paste and now I fit this model so I write fit then parenthesis and use again the x and y variable so now train_x vector and then train_y so now I press Ctrl enter to feed this model and we successfully build this model and that's it for this video in this video we learn the core concept behind the decision tree algorithm and also how to implement it in Python all right in this video we're going to see how the naive vase algorithm works naive vase is a supervised learning algorithm that uses conditional probability to predict a class it assumes that every feature is independent of each other which is not always the case so we should always analyze our data before choosing the algorithm the naive phase algorithm is based on the base theorem and the base theorem is the one you see now on screen so the first element this one represents the probability of event A given event V has already occurred then this one represents the probability of event A this one represents the probability of event B given event A has already occurred and finally this one in the denominator represents the probability of event B the assumption that features are independent of each other makes this algorithm fast compared to more complex algorithms however this assumption also makes the naive base algorithm less accurate so now let's see an example of this naive base algorithm in this example we have this data that shows how many times players played a match under certain weather conditions so we have to find whether players will play if weather is sunny so this is our task and here we have the formula we replace the formula but now with our task so we have to find whether players will play if weather is sunny and this is represented here so we want to know the probability of event yes given event sunny has already occurred and this is the formula so now let's calculate each element to replace it and finally calculate this probability that we want so now let's calculate the probability in each row and column but first we have to calculate the total sum in each row and column so here we have the total sum of each row and column and also the sum of the total values so this sum of total values is nine and based on this data we can calculate the probability of each row and column so here we have uh the probability of each row and column and as you can see the denominator is always nine because that's the total number of values so now let's see the probability of Jess and the probability of Sunny given Jess and the probability of Sunny so the first one the probability of Jess is this first column so the probability of Jess is four divided by 9 so four represents the number of times is yes and nine the total values so we got 44 for this first probability and now let's calculate the second one in the second one we have to calculate the probability of sunny given has already occurred and in this case we have to divide three which is the number of times players played in a sunny day and we have to divide this by four because this is the number of times that player played in general so three divided by four is 75 and finally we have to calculate the probability of sunny and this is uh this row so sunny five divided by the total values so 5 divided by 9 and in this case is 0 55 so now we replace these values in this formula and then we calculate the final probability which is 6 so now let's see how to implement this in Python so we're going to use sklearn again and we write from sklearn naive_base and then here I forgot to write the n so now naive_base and now import gausian nb so this one so now we initiate an instance of this gausian nb i open parenthesis and I'm going to name this one gn that stands for gausian nb so now I'm going to fit this model so I write that fit parenthesis and inside train_x vector and also the train underscore y and here there is a little detail you have to know unlike the previous algorithms we've seen here we have to include that train_x vector as an array so we have to use that to array method so with this we transform this train_x vector to an array and if you don't do that it's going to throw you an error so now let's run this and with this we fit this model and that's it in this video we learn the concepts behind the naive base algorithm and also how to implement it in Python okay in this video we're going to see how the logistic regression algorithm works logistic regression is a supervised learning algorithm that is commonly used for binary classification problems we can use logistic regression to predict whether a customer will turn or not a male is a spam or not a sentiment is positive or negative and more the logistic regression is based on the logistic function also known as the sigmoid function which takes in a value and assigns a probability between zero and one then we get this S-shaped graph of logistic regression that you see on the screen to understand much better how logistic regression works consider a scenario where we need to classify whether an email is a spam or not in the graph if this Z goes to infinity our target value so Y will become one so the email is spam and if Z goes to negative infinity Y will become zero which means that the email is not spam as I mentioned before the output value is a probability so if we obtain a value of 0.7 this means that there is a 70% chance that an email will be spam okay now let's see how to implement this algorithm in Python so we go back to Jupyter notebook and here I'm going to use sklearn one more time and I write from sklearn linear_model import logistic regression so I write logistic regression and now I initiate an instance of this logistic regression i open parenthesis and now I assign this to log wreck so this is the variable I'm creating and now I'm going to fit this with the train x vector and with a train y so I write fit then train_x vector and then train y so now with this I'm going to feed this model and now I just fitted and that's it for this video in the next video we're going to compare the four models we built in this project all right in this and in the following videos we'll see traditional metrics that are used to evaluate models and the first matrix we're going to see is the confusion matrix a confusion matrix is a table that allows us to visualize the performance of an algorithm this table typically has two rows and two columns that report the number of false positive false negatives true positive and true negative okay to understand what these four values mean let's see the following example suppose that we have a computer program that recognizes dogs in a photographs and we give it a task to recognize the five photos you see now on the screen so here we have three photos of dogs and two photos of cats so our computer program should recognize which photos belong to dogs so this computer program recognizes that these three photos are photos of dogs however we see that two photos are photos of dogs but one of them is not a dog is a cat so the two photos of dogs are the true positive because we wanted to obtain dogs and we obtain dogs but the photo of the cat that was incorrectly recognized by this computer program is called a false positive because according to this program the photo is a photo of a dog but we know that this is the photo of a cat so this is a false positive because it was selected by the program but we know is false because it's not a dog but a cat then the values that were not selected by our program are either true negative or false negative in this case the cat that was not selected is a true negative because we ask to recognize the photos of dogs and this is actually not a photo of a dog but is the photo of a cat so this is a true negative in contrast the photo on the right is the photo of a dog but our program didn't select this photo for some reason and this photo should have been selected and this is known as a false negative okay now that we know the concept of the values in this confusion matrix let's see the convention for access so the confusion matrix that I showed you before and that is not on the screen is the traditional confusion matrix that you will see in books so usually confusion matrix follow this convention for access so the true positive is in position number one false positive is in position number two false negative is in position number three and true negative in position number four however sklearn doesn't follow this same convention so sklearn has a different convention for access and we should keep that in mind so now let's see how the confusion matrix looks by default in skarn so by default here we have the true negative in the first position then false positive here and false negative here and true positive here so this is the default order that sklearn follows and as you can see it's a bit different from the traditional confusion matrix however we can customize the position of this confusion matrix a little bit by using the labels parameter so now we're seeing the skarn but with the labels parameter that I'm going to show you later when we go to Jupyter notebook but if we set the labels parameter to the values positive comma negative or also to the values 1 comma zero we're going to get this order and this one is slightly similar to the traditional confusion matrix and we're going to use this confusion matrix in this project okay now let's see how to implement the confusion matrix in Python so we go back to Jupiter notebook and now we're going to calculate the confusion matrix of only the SVC so this one that we created before so as you might remember the SVC was the first model we built and this one came from the support vector machines which is this one so we're going to create the confusion matrix of this SVC and then you can create the confusion matrix of either the decision tree the knife phase or the logistic regression so it's going to be the same so now to do this we're going to import confusion matrix and we write from sklearn matrix import confusion matrix so now we use this confusion matrix and we open parenthesis and the first argument we're going to introduce is the test underscorey so we write test_y and this is also known as the true labels so now the second argument is going to be the prediction so we have to predict using this svc model so we write svc.predict predict and then we introduce the test_x vector so I write test_x vector and now with this we predict the values that correspond to this test_x vector and here we're going to get the prediction of our svc model so we have the true labels and these are the predicted labels so remember that order when you build your confusion matrix so now let's add that third parameter which is the labels parameter and with this we're going to customize the axis of the confusion matrix so now I'm going to set it equal to positive comma negative so you can also write 1 comma 0 and with this we're going to get the confusion matrix in this order that we've seen before so now I'm going to run this and here we got an array so we got four numbers and this one is the true positive this one is the true negative this one is the false positive and then false negative so 60 false positive and 45 false negative and that's it in the next video we're going to use these four values to evaluate the models we built in this video we're going to see the first metric to evaluate our models and this one is the mean accuracy the model accuracy is defined as the number of classifications a model correctly predicts divided by the total number of predictions made that is the true positive plus the true negatives divided by the total number of predictions made so in the previous video we calculated the four values using the confusion matrix and now let's see this confusion matrix so these are the four values and we calculate it using skarn and by the way this is the confusion matrix of only the SVC model so there are confusion matrix of other models too but here just to simplify things I'm just calculating the confusion matrix of the SVC model so now let's see the accuracy of the SVC model so now we replace the values in the formula and with this we can calculate the accuracy of the SVC model so this one is 0.84 and now let's see how to implement this accuracy in Python so in Python we have to use the score method so we only have to write the name of the model in this case the SVC then use the score method and then inside parenthesis we write the input and the output so now let's go to Jupyter notebook to see how to do it so here in Jupyter notebook I'm going to start by calculating the accuracy of the SVC so first I write SVC then score then open parenthesis and here we use the test_x vector and also the test_y so in this case we use the test and not the train because the test data set provides an unbiased evaluation of a final model fit on the training data set so we cannot use the training data set again but we should use the test data set now to evaluate the model so here I'm going to copy this test_x vector and test_y and now I paste it here and actually I can calculate the accuracy of all the models i only have to write the names so here I can write D tree which was the name of the decision tree model and now the G NB which stands for Gaussian NB and then the log_rec that stands for logistic regression so now to show all the values I'm going to print the four of them and then we'll compare which one has a better accuracy so now I print this and here I got an error because I forgot to transform this test_x vector to an array so here we have to use the two array method and this is only required for the Gaussian NB as we've seen before so now we run and we got the accuracy of the four models and we see that the model with highest accuracy is the SVC model with 0.84 and this is the same value we obtain by calculating this manually that said this is not the only way to assess the performance of a model so in the next video we're going to see a different metric to evaluate the performance of a model okay in this video we're going to see another metric that will help us evaluate our models and in this case we're going to see the F1 score so the F1 score is the weighted average of precision and recall accuracy is usually used when the true positives and the true negatives are more important while the F1 score is usually used when the false negatives and false positives are crucial also F1 takes into account how the data is distributed so it's useful when you have data with imbalanced classes the F1 score is calculated with this formula so we have to multiply precision and recall and then sum precision and recall then we divide these two values and multiply it by two so now you might be wondering what's precision and what's recall so to explain you the meaning of these two values let's see this graph and in this graph we see the elements of the confusion matrix so true positive false positive true negatives and false negatives so let's say again that we have a computer program and let's say that our computer program selected this circle area with green and pink sides so this circle is the selected elements and this left rectangle is the relevant elements so these relevant elements represented by this left rectangle cover all the elements that we're interested in this is why the selected elements that are relevant are called true positives and the relevant elements that were not selected are called false negatives okay with that in mind to get the precision we should ask ourself the following question how many selected items are relevant so we see that the selected elements are in this circle and those that are relevant are in this green area so the result of the precision will be the true positive divided by the true positive plus the false positive that is the green area divided by the whole circle so if we replace the values we got in the confusion matrix we'll get 290 divided by 350 so this will be 0.828 and this is the precision of the SVC model now to calculate the recall we have to ask ourselves another question in this case the question is how many relevant items are selected so the items that were selected and are relevant are inside this green area that you see now and all the relevant elements are inside this left rectangle so this recall will be the true positive divided by the true positive plus the false negatives and if we use the values we calculated for the confusion matrix of the SVC model we'll get 290 divided by 335 which is 0.865 so now if we replace the precision and recall in this formula we're going to get the value of 0.845 and that's how you calculate the F1 score so now let's implement this F1 score in Python so we don't have to make all these operations manually but now with sklearn so we go back to Jupyter notebook and here I'm going to import F1 score and I write from sklearn and then we write import f1 score so now we use this f1 score we open parenthesis and again we use the test data set so we write test_Y then the prediction which is SVC because we're going to get the F1 score of the SV model so SVC.predict then parenthesis and we use the test_x vector so here I write this and here we have the true labels and the predicted labels so now we add the labels so labels equal to and then positive and comma negative so with this we can calculate this F1 score but here I'm going to add another parameter which is the average and I'm going to set it equal to none okay now I press Ctrl enter to calculate this F1 score and we see that this first value is similar to the one we obtain manually here and that's it in this video we'll learn the concepts behind the F1 score and how to implement it in Python okay now let's see how to create a classification report a classification report shows the main classification metrics we've seen so far for example the precision the recall and the F1 score and now let's see how to create a classification report in Python so we go back to Jupyter notebook and here we create a classification report so we use sklearn and we write from sklearnmetrics import and here uh we import classification_report so this one and now we use this classification report and now we open parenthesis write the test_y then we predict the values using the svc model so we write svc.predict then we predict based on the test_x vectors so I write this one and now I add the labels so the labels are going to be uh positive and negative so I write positive and then negative so as we can see the arguments are in the same order that we seen in the matrix before so now I press control enter to run this and now we get this classification report but it's not properly printed so I'm going to use the print function so I write print now I open parenthesis and now I run this and we get this classification report properly printed and here we got the metrics we calculated so far so the precision recall and the F1 score if you want you can compare these values with those we calculated before and they should be the same and that's it in this video we learn how to create the classification report in Python okay now it's time to see how to maximize our model's performance based on the metrics we've seen so far we got that the SVC model performs better than the other models but we can still maximize our model's performance and we can use the grid search CV to make an exhaustive search on a specified parameters in order to obtain the optimum values of hyperparameters and to do this we have to import grid search CV from sklearn and after this we have to set some parameters to find which is the optimum parameter so here I write parameters equal to and here I open a dictionary and the key is going to be C and here I write a list of values so first one then four then 8 then 16 then 32 and then I write the kernel key so I write kernel and then I open a list and write two types so linear and also the RBF then we initiate an instance of the SVC algorithm so we write SVC with parenthesis and then here equal to SVC then we have to use the grid search CV so I write it here and in the first argument we write SVC then a second argument we write the parameters and now I'm going to add another parameter which is CV and I'm going to set it equal to five now I'm going to name this SVC grid because this one comes from the grid search CV and now I'm going to use this variable and I'm going to fit this with a train_x vector and also with the train_y so now I write that fit and here I write the train_x and then the train_y so as you can see this code is similar to the code we wrote to fit the SV model before but now we specify some parameters to obtain the optimum model so now let's fit the model i press Ctrl enter and now this might take some time so now I'm going to cut the video and come back when it finishes so now the execution is finished and before I show you how to get the best parameters here you can add more uh parameters to this grid search CV by pressing shift tab so here there is a list of more advanced parameter so for example you can set the number of jobs and customize this even more with this parameters so now let's see how to get the optimal parameters of this SVC model so here I copy svc grid and then we have to write the following attribute so best params underscore and we run this and here we got the optimal parameters of this svc model we can also get the best estimators of this model by using the best underscore estimator attribute so now I copy and paste this and instead of params I'm going to write estimator so with this we get the vest estimator of the SVC model so now I run and here we see the vest estimator and that's it in this video we learn how to find the optimal parameters of a model using the grid search CV

Transcript for:Data Science Basics in Python

Transcript for:
Data Science Basics in Python