Transcript for:
Introduction to Data Analysis with Python

Welcome to our data analysis with Python tutorial. My name is Santiago and I will be your instructor. This is a joint initiative between Free Code Camp and remoter. In this tutorial, we'll explore the capabilities of Python on the entire PI Data stack to perform data analysis, we'll learn how to read data from multiple sources such as databases, CSV and Excel files, how to clean and transform it by applying statistical functions and how to create beautiful visualizations will show you all the important tools of the PI Data stack pandas, matplotlib, Seabourn and many others. This tutorial is going to be useful both for Python beginners that want to learn how to manage data with Python, and also traditional data analysts coming from Excel tableau, etc. You learn how programming can power up your day to day analysis. So let's get started. Welcome to our data analysis with Python tutorial My name is Santiago and I am an instructor@remoter.com an online Data Science Academy. This tutorial is a result of a joint effort by remoter and Free Code Camp, and it's totally free. It includes slides, Jupyter, notebooks and coding exercises. Let me tell you a little bit more about remoter were an online hands on Data Science Academy. We specialize in data science, including data analysis, programming and machine learning. We have a complete course catalog and we're adding more content every month. If you're interested in learning data science or data analysis, check us out. As part of this joint effort between Free Code Camp and remoter you can get a 10% discount in your first month by using the following discount coupon. Let's quickly review the contents of this tutorial. In the description of this video, we have included direct links to each section, so you can jump between them. This is the first section and we are going to discuss one is data analysis. We'll also talk about data analysis with Python and why programming tools like Python SQL and pandas are important. In the following section will show you a real example of data analysis using Python. So you can see the power of it will not explain the tools in detail. It's just a quick demonstration for you to understand what this tutorial is about. The following sections will be the ones explaining each tool in detail, there are two more sections that I want to especially point out. The first one is section number three Jupiter tutorial. This is not mandatory, and you can skip it if you already know how to use Jupyter notebooks. Also the last section Python in under 10 minutes. This is just a recap of Python. If you're coming from other languages, you might want to take this first. If that's the case, again, you can use the links in the video description to jump straight to it. All right now let's define what is data analysis. I think the Wikipedia article summarizes perfectly the process of inspecting, cleansing, transforming and modeling data with the goal of discovering useful information, you forming conclusions and support decision making. Let's analyze this definition piece by piece. The first part of the process of data analysis is usually tedious. It starts by gathering the data and cleaning it and transforming it for further analysis. This is where Python and the PI Data Tools Excel. We're going to be using pandas to read, clean and transform our data. Modeling data means adapting real life scenarios to information systems using inferential statistics to see if any pattern or model arise. For this we're going to be using the statistical analysis features panelists and visualizations for matplotlib and Seabourn. Once we have processed the data and created models out of it, we'll try to drive conclusions from it finding interesting patterns or anomalies that might arise. The word information here is key. We're trying to transform data into information. Our data might be a huge list of all the purchases made in Walmart in the last year, the information will be something like pop tarts sell better on Tuesdays. This is the final objective data analysis we need to provide evidence of our findings, create a readable reports and dashboards and aid other departments with the information we've gathered. Multiple actors will use your analysis, marketing sales, accounting executives, etc. They might need to see a different view of the same information. They might all need different reports or level of detail what tools are available today for data analysis. We've broken these down into two main categories, our managed tools, our close products, tools you can buy and start using right out of the box. Excel is a good example. Tableau and luchar are probably the most popular ones for data analysis. In the other extreme, we have what we call programming languages or we Call them open tools. These are not sold by an individual vendor, but they are a combination of languages open source libraries and products. Python R and Giulia are the most popular ones in this category. Let's explore the advantages and disadvantages of them. The main advantage of close tools like Tableau or Excel is that they are generally easy to learn. There is a company writing documentation providing support and driving the creation of the product. The biggest disadvantage is that the scope of the tool is limited, you can cross the boundaries of it. In contrast, using Python and the universe of PI Data Tools gives you amazing flexibility. Do you need to read data from a closed API using secret key authentication for example, you can do it? Do you need to consume data directly from AWS kinases, you can do it a programming language is the most powerful tool you can learn. Another important advantage is a general scope of a programming language. What happens if Tableau for example, goes out of business. Or if you just get bored from it and feel like your career is taught you need a career change? learning how to process data, using a programming language gives you freedom? The main disadvantage of a programming language is that it's not as simple to learn as with a tool, you need to learn the basics of coding first, and it takes time. Why are we choosing Python to do data analysis? Python is the best programming language to learn to code. It's simple, intuitive, and unreadable. It includes 1000s of libraries to do virtually anything from cryptography to IoT. Python is free and open source. That means that there are 1000s of PI's very smart people seeing the internals of the language under libraries. from Google to Bank of America, major institutions rely on Python every day, which means that it's very hard for it to go away. Finally, Python has a great open source spirit. The community is amazing, the documentation, so exhaustive, and there are a lot of free tutorials around checkout for conferences in your area, it's very likely that there is a local group of Python developers in your city. We couldn't be talking about data analysis without mentioning r r is also a great programming language. We prefer Python because it's easier to get started and more general in the libraries and tools it includes. R has a huge library of statistical functions. And if you're in a highly technical discipline, you should check it out. Let's quickly review the data analysis process. The process starts by getting the data where is your data coming from? Usually it's in your own database, but it could also come from files stored in a different format, or a web API. Once you've collected the data, you will need to clean it. If the source of the data is your own database, then it's probably in writing shape. If you're using more extreme sources like web scraping, then the process will be more tedious. With your data clean, you'll now need to rearrange and reshape the data for better analysis, transforming fields merging tables, combining data from multiple sources, etc. The objective of this process to get the data ready for the next step. The process of analysis involves extracting patterns from the data that is now clean and in shape. Capturing trends or anomalies. statistical analysis will be fundamental in this process. Finally, it's time to do something with data analysis. If this was a data science project, we could be ready to implement machine learning models. If we focus strictly on data analysis, we'll probably need to build reports communicate our results, and support decision making. Let's finish by saying that in real life, this process isn't so linear, we're usually jumping back and forth between the step and it looks more like a cycle than a straight line. What is the difference between data analysis and data science? The boundaries between data analysis and data science are not very clear. The main differences are that data scientists usually have more programming and math skills, they can then apply these skills in machine learning on ETL processes. The analysts on the other hand, have a better communication skills creating better reports with stronger storytelling abilities. By the way, these Weiler chart you're seeing right here is available in the notes in case you want to check out the source code. Let's explore the Python and PI Data ecosystem, all the tools and libraries that we will be using. The most important libraries that we will be using are pandas for data analysis, and matplotlib and Seabourn for visualizations. But the ecosystem is large and there are many useful libraries for specific use cases. How do Python data analysts think if you're coming from a traditional data analysis place using tools like Excel and Tableau you're probably used to have a constant visual reference of your data. All these tools are point on Click. This works great for a small amount of data. But it's less useful when the amount of records grow. It's just impossible for humans to visually reference too much data, and the processing gets incredibly slow. In contrast, when we work with Python, we don't have a constant visual reference of the data we're working with. We know it's there. We know how it looks like. We know the main statistical properties of it, but we're not constantly looking at it. These allows us to work with millions of records incredibly fast. This also means you can move your data analysis processes from one computer to the other, and for example, to the cloud without much overhead. And finally, why would you like to add Python to your data analysis skills aside from the advantages of freedom and power theories, another important reason, according to PayScale, data analysts that no Python and SQL are better paid than the ones that don't know how to use programming tools. So that's it. Let's get started in our following section will show you a real world example of data analysis with Python, we want you to see right away what you will be able to do after this tutorial. We're gonna start this tutorial by working with a real example of data analysis and data processing with Python, we're not going to get into the details yet, the following sections will explain what each one of the tools does, and what is the best way to apply them combining and the details of them. In general, this is just for you to have a quick on high level reference of our day to day processes, data analysts, data managers, data scientist using Python. So the first data set that we're going to use is a CSV file that has this form, you can find it right here, under the data directory, the data we're going to be used is this, I have just transformed it into a spreadsheet. So we can pretty much look at it from a more visual perspective. But remember, as we said in the introduction, as data analysts are not constantly looking at the data, right, we don't have a constant visual reference, we are more driven by the understanding of the data right in the back of our head, and we understand how what the data looks like, what's the shape of it. And that's what it's conducting our analysis. So the first thing we're going to do is we're going to read it this CSV into Python, and you can see how simple it is just one line of code gets us the CSV read into byte, then we're going to give a quick reference. And this is what the data frame that we have created looks like data frame is a special word is a special data structure, we use independent tool. And again, we're going to see that in detail in the pan this part of this tutorial. The data frame is pretty much the CSV representation, but it has a few more enforced things like for example, each column has a strict data type. And we will not be able to change it to tetra, it's a better way to conduct our analysis, the shape of our data frame tells us how many rows and how many columns we have. So you can imagine that with these amount of rows, it's not so simple to again, follow a visual representation of it's like, it's pretty much infants crawling, in this point 100,000 rows. But the way we work is by immediately after we load our data we have we want to find some sort of reference in the shape and the the properties of the data we're working with. And for that we're going to do first an info to quickly understand the columns we're working with. In this case, we have date, which is a date time field, we have day, month year on that are just complimentary to date, we have the customer age, which is uninjured, which makes sense right? age group, you can say it's right here. It's age group youth, customer gender, we have an idea again, of the of the entire data set, we know the columns we have, but we also know how large it is. And we don't care what's in between, we will be cleaning it probably, but we don't need to actually start looking row per row, right just with our very limited eyes, we have a better understanding of the structure of our data in this way. And we're going one step further, we will also have a better understanding of the statistical properties of this data frame with a describe method. For all those numeric fields, I can have an idea of the statistical properties of those. So for example, I know that the average age of these data set is 35 years old. I also know that the maximum age in this case if these Or is the sales data is 87 years old, I know the minimum is 17 years old. And again, I can start building right if my understanding of this that physical properties of it. So in this case, the median of my age is very close to the mean. So this is telling me, all is telling me something, and the same thing is going to happen for each one of the columns that we are using. For example, we have a negative profit here, and we have very large values here are these correct, is maybe there's a mistake, again, it's by having a quick statistical view of our data, we're going to be driving the process of analysis without the need of constantly looking at all the rows that we have. It's a, it's a more general holistic overview. So we're gonna start with unit cost, let's, let's see what it looks like. And we're going to do a describe only if you need coast, which is pretty much what we had right here. In the previous in this line, what we did was for the entire data frame for the entire data, in this case, we're just focusing in the unit coast, cost, sorry, column, the mean, the median, all fields, we know already pretty much from this, and we're gonna quickly plot them, we're going to use these tools to visualize them. And it's the same tool, it's paying this that it's using on top, right? It's using matplotlib. So the visualization is created with matplotlib. But we're doing it directly from pandas. And again, don't worry, this is all explained in pandas lessons. So this is unit costs, right is what this is the box, but we have just created, we have the whiskers that mean that shows us the the first and third quartile, the median. And then we see all the outliers that we have right here. So we see that our product study is around $500 is considered to be an outlier. And the same thing if we do a density plot, right. So this is what it looks like. We're going to draw two more charts, right, in which we're going to pretty much point out the mean and the median, right in the distribution charts. And we're going to do a quick histogram of the costs of our products. Moving forward, we're going to talk about age groups with the age of a customer. And at any moment, we can always do something like sales sort here to give a quick reference, we know that the the age of the customer is expressed in actual years old they were but also they have been categorized with three, four, actually four age groups, seniors, youth, young adults and adults, right. So they we have given categories were creative, right to better understand these groups, and we do that with values. Value counts, we can quickly get a pie chart out of it, or we could get a bar chart out of it. As you can see, right here, we're doing an analysis of our data, we see that adults right here are the largest group in our for our data at least. So moving forward, what about a correlation analysis? What is a correlation between some of our properties, we will probably have high correlation for example, between profit and unit cost, for example, or order quantity, that's kind of expected, but that's all something that we can do right here. This is matrix right of correlation showing in red high correlation. So order quantity, and unit cost or where is profit right here. Profit is right here. So we see high correlation with unit with cost with profit. Now with profit, actually, it's the opposite blue is high correlation, I'm sorry, the diagonal, which is blue, is correlation is equals one. So high correlation is blue. And we see that profit has huge correlate has a lot of correlation, positive correlation with unit cost and unit price. And negative correlation is with dark red. So we again can have a quick idea. Let's see, for example, here profit, it has negative correlation with order quantity, which is interesting, right? It's we wouldn't dig deeper into that, of course, the profit has a high correlation positive with revenue, right? And again, it's just a quick correlation analysis. We can also do a quick scatterplot to analyze the customer age and the revenue right to see if there is any, any correlation there. Right? And the same thing for revenue and profit. This is obvious, right? We can we can quickly draw a diagonal here, right. So there is a lot Linear depth and dependency between these variables. So a form a few more box plots, in this case, understanding the profit per age group, right, so we can see how the profit will be, will change depending of the customer's age, and a few more box plots. And we're creating these these grid of year customer age, unit costs, etc, for multiple things. So moving forward, something that we can quickly do when we're working with Python, especially within this is Drew shape or data or derive it from other columns, right. So this is pretty common in Excel, we can create these revenue per age column, if you're here in Google spreadsheets, you're going to do something like revenue, per age, and you're going to do something like equals, right? Equals revenue, divided, I don't remember if this correct formula we're using, but just for, for you to have a reference. And we're going to pretty much extend this whole thing. There we go, Oh, well is processing, and I have 100,000 rows. So you can see how slow it is, I let's compare that just to the way Python works, I'm gonna execute this thing. It was instant, you know, extremely fast. And it was all calculated seems that we have the same results as expected. same results as expected. And we can quickly plot both the in a density plot and in a histogram, as you can see, right there, now that revenue parade is going to be relevant. In any case, it's just to show you the capabilities of what we can do. Let's annual analyze, well, we're gonna create a new column, which is calculated cost is the total, the total orders the total, the quantity of the order, times the cost, right, extremely simple formula, very fast process. And we're gonna get right here, how many rows had a different value than what was provided by cost? So what we're doing right here is like, we're quickly checking if the cost provided by the data set, at some point doesn't align with the actual cost we are calculating. So is there any mistakes that were made by the I don't know the original system, or people doing a data entry, if these new column is different from cost, we want to know about that. And that doesn't happen. So again, quick, quick, regression plot. In this case, it's very obvious that there is some linear dependency between calculate cost and profit. So more formulas, in this case costs part cost plus profit. So we're going to adding a little bit more, there is no difference with the revenue and the calculated revenue that we are having. So that all makes sense, we're going to do a quick histogram of the revenue. We can, for example, on 3%, to all the prices that we are using, we need to increase prices. How are you going to do that? Well, it's very simple with Python, we're just going to do increase everything by point 03. And now all the prices have changed. What else we're going to be able to do quick filtering, let's get all the sales from the state of Kentucky right. So these are all the sales from the state of Kentucky, we can get only the average of the sales by these age group on only revenue, right. So these, all these filtering options, and extremely simple to get with Python. In this case, we say, give me all the sales from these age group, and also from this country, right, and we're gonna get the average revenue from these groups that we are selecting. And again, to modify the data, we can make just a few quick modifications, like in this case, we're going to say, all the sales from country right to revenue, we're going to increase it by 1.1. I don't know why, which is doing it arbitrarily. It's just for me to show you how it works. So far, so good. Again, we've done a couple things, you don't need to know about the details, we will actually go through that in the NumPy independence sections in this tutorial. So just for you to have a quick reference of it. There are exercises associated with these given lectures. So if you want to pause right now and get into the exercises, that's going to be very helpful. We're going to move forward now with the second lecture in which we will be using a database this Akila database and we're going Be erasing data, instead of from a CSV file, as we did before, we're going to read data now from a database. Reading data from a SQL database is as simple as it is from an Excel file or a CSV file, as we were doing with our previous example. And once you've read the data, that's we're going to do now the process is the same. So what we have right here is a query a SQL query, if you don't know about SQL, you can check our courses or other courses online. Basically, we're pulling the data from the database. This is one of the advantages of Python, it's not, there are connectors for pretty much every database provided out there, Oracle, Postgres, MySQL, SQL Server, etc. In this particular example, we're going to be using MySQL. So once you construct the query, and you pull the data from the database, then the process is the same, we have just converted these outside data into a data frame that we can use with our Python skills. The first step, as usual, is to check the shape information description of our data of our data frame. In this case, we want to, again understand the structure of it. So we want to know how many rows we have 16,000, we want to know a little bit more about our rows, we want to know about a little bit more about our columns, and how many rows how many records we have for each one of them and the type of each one of these columns. And we also want to have a better statistical understanding of our data. So we do a quick describe, and we have more details about it. If we want to focus in individual columns, right, we can just do that by in this case, we're gonna focus in film rental rate, right, pretty much how much you pay to rent a film. Um, we're gonna see the kind of distribution we have, we can call it distribution, it's pretty much a categorical field in this case, but basically, the rentals are divided into three main categories are prices, zero 99 299 499. So that's these box plot these pretty much perfect, never seen in real life plot box plot gives you those prices. And move forward, we can also check very quickly a categorical analysis, understanding the distribution of rentals between cities, so we have two cities. And it's pretty much even as you can see right here, creating new columns and reshaping the data for further analysis, etc, is relatively simple. In this case, we're going to analyze their return in rentals, right, which, which films are going to be more profitable for the company div, dividing the rental rate, how much we charge, divided by the cost, how much it costs us to acquire the film. So in this case, we can see the distribution of that, right. So most rentals are here in the beginning. And then we have more profitable rentals, were making up to 60% above the rental. And we can quickly analyze the mean and the median fit right to have a quick idea of all that. Finally, selection and indexing, if you want to start focusing, if you want to go into data, right, you want to zoom in, you want to have a better understanding. So you start filtering, in this case, we can filter by customer, but if you want to do it per city, if you want to do it per state, if you want to do it per film, per price category, etc. It's very simple to filter to filter and zooming, which is one particular characteristic of your data. So you can perform a more detailed analysis. So in this case, we have all the the films are rented by the customer last name, Hanson, which doesn't mean it's the same person. But again, it's very simple to filter dot. And here, we can do we can very quickly see which ones are the price, the film's sorry, that have the highest replacement cost, right. So basically, what we're doing is we're going to isolate those films that have the highest replacement cost. And also we can see right here just for you to have an idea, all the films that are in the category PG or pG 13. It's very simple to to filter that data. So this is the process we usually follow. we imported the data, we reshape it somehow create columns, there is an important process of cleaning up or not highlighting this part of the tutorial, we're going to talk about it in the tutorial itself. There's the process of cleaning, then reshaping creating new columns, combining data and creating visualizations. This is the process, right? We're following here with our Python skills, but it's a tone more to odd as you might imagine, from creating reports to running machine learning processes, creating linear regressions, etc. For now, this is just a quick understanding of the process. We follow. Now starting now we're gonna move forward with more details of each one of the individual tools we're going to talk about. We're going to talk about Jupyter notebooks. We're going to talk about NumPy. We're going to talk about pandas, we're going to talk about mapa, lib, seaborne, etc. Starting now, right? The first thing we're going to see is, what is this whole thing that I've been using this Jupyter Notebook, I want you to now too, if you want, if you if you don't have experience with it, I want you to have an idea of how it works. And then we're going to move forward the individual tools, NumPy, pandas, etc. Remember, there are exercises also associated with this particular lecture. So you can always go back again, and work with them. Once you get more a better understanding of the tools we are using. Before we jump into the actual data analysis course, and we start talking about Python, pandas, all the tools, we're going to use import files, read data from databases, etc, I want to show you the environment that we work with. It's our primary environment, it's the tool that we use 99% of the time on its Jupyter Notebook, there are going to be different terms here, I'm going to be referring to it as Jupyter Notebook. But as you are going to see in this, in this part of the of our tutorial, you can see that Jupiter is actually a whole ecosystem of tools. And it's a very interesting project. Jupiter is a free and open source, again, ecosystem of multiple tools. And primarily, we're gonna talk about first, what is a Jupyter Notebook. What you're seeing right here, and you're gonna see live in a second, I can actually show it to you is this thing we're going to use. And we are also going to talk about Jupiter lab. Okay, which is the evolution of the regular Jupyter Notebook. So, I think this could be familiar to you already. Usually the questions in the question is, what's the difference between Jupyter Notebook and Jupiter lab? Well, the difference is that Jupiter lab is just a nicer interface on top of Jupyter notebooks. It's not just the plain notebook. This is a notebook, but I'm scrolling right now. It's also the addition of tree view, it's an addition of get tools, as an addition of command to lead and multiple other things. You can open some files with a nice preview in it, etc. So, Jupiter lab Jupyter Notebook, they are similar Jupiter lab easy, again, the evolution of a Jupyter Notebook. And that's what we're using. Again, Jupiter is a free and open source project. So anybody can install it, anybody can download it, it's very simple to get it set up in your local computer. In this case, we're using something we call notebooks AI, it's a project that provides Jupiter environment for free in the cloud. So you don't need to install things locally, you don't need to put things in sync in your own hard drive, right you That means you don't need to buck it up, for example, because it's just a service, it's all worked in the cloud. So said that, I want to tell you that we have compiled a very quick list of everything, we're going to talk in this part of the tutorial, in this list of two, it's just a thread of with multiple, multiple hints of how to use Jupyter notebooks. So after the video after the course, if you forget some of these concepts, you can always go back to this to it, it's a quick reference for you to have. So let's get started. Why do we use a Jupyter Notebook? Because it's an interactive real time environment to produce our or to to explore our data and to do our data analysis. It's a tool you're gonna fire commands, and it will immediately respond with something back. It's a very interactive tool, when we're working with data analysis, and this is mainly main difference with some other tools like for example, Excel, tableau, etc, is that we are not constantly looking at the data, there is no visual reference, like for example you have in Excel, right? So in Excel, you're constantly looking at the data, you have it in front of you, there are 100,000 cells and you can stroll and see them. The problem is that that's not scalable, right? It's like nobody can work with 100,000 rows in their, in their, in their mind, we will always forget something. So the way we work with Python indeed, analysis is by always having a reference of how our data looks like but always at the back of our head and we're not constantly looking at it. We're like this person from the matrix, you know, the, the the commander of the matrix that commands people to get get in and out. We're basically telling people telling people that basically asking data, right asking questions to the data, and having a picture in our mind of how that's going to work, we're not constantly looking at it, we're just having a reference, or in our in the back of our heads of what our data looks like. So that's why this tool is very useful. This tool is useful Also, if you're just training your Python skills, and or their permanent language skills, because what you're gonna see is it's just a regular Python interpreter. In this case, I can execute some code, that's two one times, actually one plus three, there we go. And the result is four. Right. So this is a Python is a fully featured Python interpreter. The good thing is that again, it's going to respond to us pretty much immediately I create a command and I immediately get a response, I can do something a print here, hello world. And I immediately get a response, I can do Hello, world, times, times three. Again, it's a again, a Python interpreter, a fully feature Python interpreter, but it's not being accessed from a terminal, which you can write this is the good thing about Jupiter lab to have a terminal, I can do Python, right. And I can do two, time three, and I get an answer back. But this is not convenient to work with our data, we need something a little bit more interactive, we can also mix with documents, that's going to be the advantage of a Jupyter. notebook. So what what's the way we work with Jupyter notebooks, there are a few concepts, very important concepts that we are going to follow a Jupyter Notebook is just a sequence of multiple cells, okay, everything is a cell. And as you can see, when I click on these cells, even if even if it doesn't look like being a cell, it is, you will see that these blue thing right here, right is pretty much following me because I'm clicking on the cell, and I'm selecting that particular cell. Everything happens within a cell, if I want to execute some code I can do, again, one plus five, and to get a result or a result back, right, that's, that's how it works. So I'm creating a cell, I'm deleting a cell, I create another cell again. So it's everything happens with a cell, and I'm going to tell you how to add the cells, how to remove them how to execute code, etc. The interesting thing about a cell is that it can either be Python code, or any other programming language you're using in this case is a Python data analysis course. It can be Python code, as we're we were doing before one plus three, this is Python code, or it can be what we call markdown, okay, which is a formatting format, right? To create text, that will be a render with sort of HTML ID at the output. So in this case, this is what the source code of the markdown looks like in markdown, any line that starts with this part, it's going to be a title, in this case, it's going to be the largest, the biggest title you can have is just one pod, and then you keep adding to reviews the size in this case, level three title. And then you can have for example, this is a quote this is bold, this is it Alex, this is a link, right? So let me actually, I could copy the cell and open the source code. There we go. So this is a link right issue, issue is created or it's rendered as a link. So markdown, what is is that is a text formatting tool, right or protocol, we could say that in this case, we just specify us we have some some rules to use in our in our text, and markdown knows how to interpret them and format right or return a formatted document after them. So for example, here, we have green divider, which is a picture and we know it's a picture because it starts with an exclamation marks. And that's that what you're saying right here. So again, a cell can be either Python code, or it can be markdown. markdown is an entire thing on its own. You can get any tutorial online free, it's it's fairly simple to get started with. And it's also very important because when you're formatting your reports, right, when you're creating your reports, you want them to look pretty, you can use markdown for not and what we're going to see later So you can export these notebooks and they will generate PDFs, right. So this whole thing can be a PDF or an or an HTML page. So after you're done with your data analysis, you can hand over to whoever asked for the analysis, a PDF report, which is pretty neat. So moving forward, again, any cell is going to be either markdown, or it's going to be code right here. So these ones code, and you can switch the modes, you can say, this LS code, or actually, let's make it markdown. So right now, if it's a code, it doesn't doesn't matter, or just, it's not executing anything, because the cell is interpret as markdown. So now, I'm switched back to code. And now it works. Again, I said, Sure. It can also be raw, but to be honest, we don't use raw very often. So again, you have this this general cell type, this cell we're using, what type is it? Is it code is it markdown, you can switch it with these with the selector right here. So a few more things that I have to tell you right away, so you can start internalizing them, and it's gonna take some time to get used to it. But once you get used to it, you're gonna move very fast in your data analysis with Python Jupyter notebooks. The first thing is, as you're seeing right here, every cell has been given an execution number. So any, the cells will be moved, right, they will be moving around, you will be moving them around. But you will always know which one executed before another one. And that's because every execution, you run will be assigned an execution number. In this case, this is the seventh time I have executed code. If I execute code again, for example, I don't know, two times two, this is the eighth time that I've executed code. And if I move this thing, right here, if you're reading this thing, top down, you will not be full, right? You will understand this thing. It was moved, the cell was moved, the structure of the notebook changed. But these thing was executed after this other cell, right? xact. And this is seven. So the execution order is always preserved. So that's an important thing. Something else that you're seeing me change the structure, and do things with the notebook without using any menu. And that's because I know how to use keyboard commands keyword shortcuts to run most of these commands. So for example, how can I add a new cell I have these is a markdown cell. This is a code cell, if I need a cell before these one, what's what's that command that I'm going to issue in order to create the cell, in this case, the command is going to be the letter A, I just type A, and there is a new cell creative. How can I delete the cell, it can be two times that the key two times the D key. And again, this is all these reference with built. So for example, right here, whereas hit at some point, you can. Here, you can type, you can press A to create a new cell, you can press B to create a new cell, what we call below. So let me put something here, this is a reference. And I'm going to put here the letter B and it's going to create a cell B below the currently selected one. So the selection here is here in the blue, I hit let me delete this one, I hit B. And again, it's going to create a cell below the previously selected one, if I hit a, it's going to create a cell above that previously created one. So these, these are the mnemonics of the creation. Something else and it's very important why when I'm in this cell, and I hit the letter, a leader, literally it just hits the letter A in my keyword, no control, no command, just a, it creates a new cell, and it doesn't type A inside the document, right? So right here, if I type A, it's adding an actual a character in the cell. Why didn't that happen before. And you're going to notice that when I change, when I'm going to call a mode in a second, you're going to see that the content of the cell is grayed out, show what now when I when I press on the letter A it actually creates to sell and it's not adding content to the sell itself. If I go back again to the other mode, and I'm going to give you a better explanation in a second. If I type anything, in this case, a it's actually appended to the text within it. So this is my interaction to sell modes and this is very important. The Jupyter Notebook is a mode base editor, right? So there are multiple editors are, for example, vim or VI, vi, those are mode based editors, which basically, the behavior of your work will change depending on the mode that it's currently activated. So for example, in this case, I am in addition mode, because any character that I type will be appended to the cell, A, B, C, D, etc. If I switch out of editing mode to what we're gonna call command mode, I switch out of that mode. Now the cell is grayed out, and any key that I hit, it's gonna do something different associated with that key. So A is going to create a new cell above, B is going to create a new cell below, Double D is going to delete this cell, right. So that's, that's the important part of Mo. That's one of the most important parts in order to understand how to work with Jupyter notebooks, the mode that you're currently working with, and there are only two modes, so it's fairly simple. This is command mode. And we recognize command mode, because this cell is grayed out. When we get into edit mode, there is a regular prompt, as you're saying before, the number one on the cell is actually subjects of addition. So that's the way we can realize that, how are you going to switch from modes, in this case, I'm in editing mode, if I'm using my mouse just pointing, I can click outside, I'm gonna get out of the edit mode into command mode. If I point inside and going back again, to the Edit Mode, but let me tell you something right away and then say, we don't like to use our mouse, we don't like to point and click, because that's very slow. We like to use our keyboard, we move very fast with our keyboard. So how are you going to switch from, from editing mode back to command mode, that's going to be with the Escape key to go from editing to command, edit as Escape key, it's going to switch out of editing, but when mode. And if you actually want to make modifications to the cell, basically, you want to get into edit mode, you're going to hit the return key, that's going to get you into edit mode, again. So we have tackle multiple things are writing, again, we said in Jupyter notebooks, we're going to use Python code very quickly to interact with our data, we need a real time, you know, I'm asking a you're answering type of editor. That's what the Jupyter Notebook is. The Jupyter Notebook has these two modes, edit and, and command mode. And then the cells which is pretty much everything is the most important, it's a fundamental part of the notebook, the cell is going to have two types can be either code, or it's going to be markdown, right. And now I'm going to start showing you more features. And I'm going to show you, I'm going to show you the most important commands. And of course, how the what the keyboard shortcuts for those commands are, so you can move freely. And and and work with Jupyter Notebooks in the most efficient way. So let's get started. First of all, for for from the most important commands is moving right. So navigating, it's very simple to navigate, just use your arrow keys, up and down, up and down. And you're going to move around in your notebook. If you wanted to switch the type, right going from markdown to code, etc, you can switch use these drop down or you can press the specific key is to switch to either markdown or Python. So for my markdown, you're gonna switch sorry, hit the M key, that's going to make it markdown. For Python, you're going to hit the Y key, that's going to make it Python code. So M and y are going to switch you back and forth. Keep an eye on the selector you're going to hit y m y m is going to switch it from code to markdown. What else how can you execute code once you are within your typing code and you want to execute it, there are two types of executions you can run. The first one is going to keep the selection the currently selected an active cell is going to stay the same place you are and that's going to be my by keeping press the Ctrl key and hitting return that's going to run decode on the cell there the prompt or the current selected cell will remain being the same. So I'm running this thing a couple of times already on this selection or the currently highlighted cell stays the same, I can change that by using shift return. So I keep the shift key pressed. And I hit return and is going to execute the code. But it will immediately switch the prompt or the currently selected cell to the following one. And that's useful when you have multiple cells, you want to execute one after the other. So you can keep hitting shift, return, return, return return, and it keeps you moving right from top to bottom. Alright, so Ctrl return or shift return to change the execution is the same is just what's going to happen with the currently selected cell. We already saw how to create cells with the A key, we create a cell above with B key we create a cell below. To delete a cell, you're going to hit the D key, the D key two times one after the other very quickly, dd is going to delete these the cell. What happens if you made a mistake, and you want to undo the previously issued commands? Well, the mnemonic here is going to be Ctrl Z, you know the mnemonic, it's not the command, it's going to be Ctrl Z, you only need to press the Z key, you know, you don't need Ctrl Z, and it's gonna undo whatever you did in your previous command. Alright, so a B, D deletion, and then Z to undo the all the commands were saying they all have a correspondence in this toolbar or in this command palette. So for example, right here, I could run this code by pressing these play button right here you see it, the execution is changing. There are multiple ones and you can search them if you don't remember right here. And the neat thing about it is that you actually have the shortcuts to issue the same command. So let's say you don't remember how execute and stay stay in the same cell, or move whatever you can search for run. And you can see what's the name, and what's the actual command that you have, right there, right. So you can, at least for your first ad or a month working with Jupyter notebooks, you will usually need to go back to these commands, and try to remember the the quick shortcuts. And with time and practice, those will just come naturally. So moving forward, what else, we have a few other commands, in this case, we have something to cut and paste the cell somewhere else, just cut and paste, that's going to be x to cut it, or you can also use the scissors here, x to cut it. And to paste it, you can use this button or actually these buttons sorry, or you can just press the V key V is going to paste it wherever you're currently standing it. So I'm going to cut it, I'm going to remove it from here, and I'm going to paste it below there. Or you can also copy it. So instead of cutting it, you can press the C key just going to cut, sorry, copy. And then you can actually say where you want to paste it. In this case, we have duplicated the same cell. And it looks something interesting here, the execution count remains the same. So again, there is like this unique identifier for your executions, which means that you know, when and where something was executed. Moving forward, we're going to use some code here, we're going to import some tools, you can see some characteristics or advantages of Jupyter notebooks and why we use it so often compared to, for example, the regular Python terminal. One very important thing is visualizations, we as data analyst, we're constantly getting data on expressing it through images, or animated animations, right. But most commonly, images. The main library we use in Python is model live. And model lib is a first class citizen in Jupyter notebooks, which means that you can just run the figures from matplotlib. And they will just show up directly in your notebook without the need of doing anything. Crazy. So can you imagine showing these these beautiful picture in this terminal? That's that's very hard, of course. So again, that's one of the main advantages of a Jupyter Notebook. Moving forward, what we're going to do is we're going to first we're going to get some data from a public API. So there is these crypto watch service, which basically has crypto information, Bitcoin, ether, etc. And you can check the docs, we can actually open them. It's gonna give you market data Tesla. You can check the docs and How you can get in this case it's BTC Bitcoin to euro, sexual see if we can change it to USD USD price. There we go. So this is the current price of bitcoin results, surprise, etc. And we're actually going to do markets do we have crack and BTC USD, let's do, let's actually issue the same query we're going to use which is open high, low, close Oh h LC. And don't worry, this looks ugly. But this is actually what we're using. There's a list of results write for all different candles, we call them, we get the idea of the open price, close price, high price and low price. So we're going to issue those, we're going to issue these requests to the internet to these API, the crypto the crypto watch API, so you can get information about bacon to do some analysis, I say they can, you can actually get it from ether for for ether for author different types of crypto or currencies. So the function we're defining is get history, get historic price, it's a very simple function that uses pandas is one of the most important tools, we're going to be using this course. And the requests library, which is also very famous library for Python. And what we're going to do here is we're going to get Bitcoin on ether prize for an entire week. Right. So from ferreted that the second February sorry, February 25, up to today, right? So depending on when I'm shooting this video, and we're gonna get a quick reference of the prices open, high, low, close. So in this case, we have four information per hour. Okay, so this is something you can actually change in the in the, in the request you're making to the API, you can reuse the candles eyes. In this case, we're keeping it per hour. So we have by the hour information about Bitcoin, in this particular market, which is bitstamp. Here, we have these day these day, and these are right, when I'm in the morning, open, close, highest price and lowest price, and also the volume that was operated within this time period. And we're gonna immediately plot the price. So we see that in these time, which I think is an entire day, we the price dropped, it's actually a few days, like an entire week, the price dropped from $9,600 below, right 9000. So it was a pretty significant drop. Let's see ether highperformance. We have here all the records, and how it moved. So this is what I tell you that when you're doing data analysis with programming tool like Python rar, you're not constantly looking at the data. So what I'm showing you right here are the first five records, we actually have. Let's do that. We actually have 169. Records, okay, 169 Records. And this is per hour. So if we do 169 hours divided by 24 hours, we have seven days, right? So we have seven days of data 169 Records, and then we have a little bit more information keeps this to go. I'm gonna get to that in a second. But basically, this is one I tell you 169 Records, to be honest, something you could be saying in a spreadsheet. But I want you to get the concept here. We're not just looking at our data, we have it in our brain, we know what did it we know what shape it has. We know how many records it had, we know information standard deviation, what's the mean of that? Right? So close price was the standard deviation, right? What's the the average, the mean, the median, right? So we have information about our data. It's sitting behind, you know, in our brain, but we're not looking at it. And that's because with a very simple example, with only 169 Records, but in real life, we're dealing with millions of records, so it's impossible to see it. Have you ever tried is crawling in an Excel spreadsheet through millions of records. It's crazy. It's not possible. It's just unusable. So that's again, the way we work with data analysis in Python and R and other tools. We don't constantly keep an eye on the data. We know the shape of it. And we just have these quick references like show me the first five records. I mean, the last five records, show me this chunk here down there, but that's it. So again, these are the visualizations we're creating on Jupyter notebooks. Again, it's just very simple to get the plot done right there. We're going to also see in Jupyter notebooks, a few other pretty neat things. The first one is that we can use another library, which is called bokeem. And the difference is that boakye will have charts that are interactive. So I'm moving it right here, it has JavaScript. And it's interactive, you look back again, to what we had here. This is a static chart, it's just a PNG, you can actually export it as a PNG, there is nothing you can do with it. With bokeem, it's actually a dynamic, dynamically generated interactive charts. So I can, I can zoom in piece of data, right, I can move it around, I can just do whatever I want with it. I can refresh and reset it to whatever it was. And it's a dynamically generated chart. The difference is, if you're working with data, dynamically in your analysis, sort of in your exploration, then boek is a planning tool because you can zoom in, right, so what's going on here, let's, let's look at these things. If we're working on a mean, reverting strategy, for example, we see a high volume, we see a low volume, the mean is going to be here. So we see some mean reversion in there. It's very interesting. If you need to, for example, export a PDF, export a huge HTML file, then static images are going to be probably better. So that's the difference between them. To be honest, model lib is a lot more popular than bogey, we use model live a lot more because it's we actually have a few other tools like seaborne that make it very easy to access and use it. What else Jupyter Notebooks work very well with some Excel, Excel files with all the file formats csvs, XML, Excel files, etc. And that's also the the availability of Jupiter lab. So Jupiter lab can immediately interpret and opens his v files can open with some extensions, XLS files, XML files, JSON files has a very nice editor and tree view for Jason. So the Jupiter lab environment combined with Python Jupyter Notebooks will give you a good idea of Jupiter in general. So in this case, we have just saved I'm not going to execute these you can try it out. But you can execute and run what we have just done and export this crypto file us an Excel spreadsheet. So you can just click on here and you can basically download it, you're going to open it and see what has There we go. So let me reduce the size of this thing. There we go. So you can see that we have just exported to spread two sheets, in this case, Bitcoin on ether, right? With the data that we had in our previous notebook, right. So that's all again, the combination of Jupiter, the combination of Python and the combination of Jupiter lab, which are tools just work very well together. So we're gonna keep moving forward, in this video, this tutorial, I'm talking about more data analysis, in general, we're going to talk about Python, we're going to do a quick review of Python. Maybe when we when I was running these commands, you felt you felt a little bit lost what I was doing with it. So we're gonna do a quick review of Python and all that. And of course, we're gonna get directly deep into data analysis with pandas with some other tools, I want to tell you something before we finish this chapter. And it's not, it's very important for you to get familiar with data analysis, with sorry, with Jupyter notebooks, because you're going to spend a ton of time with it. And it's a very, very valuable skill that you can get if you get proficient, comfortable with Jupyter notebooks, you know, like creating cells, deleting cells, cutting, pasting, moving things around, etc. For you to generate reports Jupyter notebooks are going to be excellent. So keep an eye on it. Keep practicing, it's the only way to learn it to the to the analysis. Keep practicing it, keep open the command Polat. So you can always want if you forgot, how can it caught a cell? Well, there is here it is command x, right? It's gonna just tell you upfront, keep an eye on it, keep working with it and practicing it. And once you get familiar with Jupyter notebooks, you're going to move very, very fast. Remember, they have these nice list of compiled commands and reference you can always access if you need extra help. And we're going to keep moving forward now with more data analysis. Now it's time to talk about NumPy, one of the most important libraries in the Python ecosystem for data processing. In general, it's the one that got pretty much everything started. And if you trace back NumPy, it, it's a very old developed library. 20 years, maybe it's it's an extremely popular library and important library, I'm not gonna say popular. And I'm going to explain why in just a second. But it's a very, very important library in the Python ecosystem for data processing. NumPy is a library that will lead you it's a numeric competing library, it's just to process numbers to calculate things with numbers. And that's it. So NumPy has a very limited scope, we could say, and this is an on purpose, a very simple library, when you look at it, and when you look at the API, which is very consistent, by the way, why is NumPy so important? Well, in Python, numeric processing, and just pure Python processing numbers is very slow. Okay, Python is not slow as itself compared to other programming languages. But when you go down, right to very deep levels of performance, when you are processing large amounts of data, right, and you need to squeeze, even, you know, that tiny bite at the end of your pipeline, you need to squeeze every flow up from your CPU, then Python is not the right tool for non Python as as a pure python programming language. NumPy is actually solving that NumPy is a very efficient numeric processing library that sits on top of Python, and gives you the same API as you're going to work with with just writing Python code, as you're saying here. But low level, it's going to be using high performance, numeric computations and, and arrays of numbers and representations, etc. That's it. That's it for pi NumPy. It's extremely simple from from an API perspective, but it's extremely powerful. Why did I say that? It's not so popular. But yes, it's so important. Well, because in reality, we don't usually employ NumPy directly, you will not see yourself using NumPy. So often, but you will be using other tools in Python, like for example, pandas, and matplotlib. And they are all working on top of NumPy. They're all relying on relying on NumPy for their numeric processing. So that's why NumPy is so important. So the for, at least for this part of the tutorial NumPy. I'm going to divide it into pieces. The first one is going to be a very detail, low level explanation of how NumPy works, why we need to use NumPy. And what are the differences between different bite sizes for numbers, we're going to talk about integers. But this is going to apply for decimals and data types also. And why you need a very low level, optimize to us number. Now you can, you can skip this part, you're going to find in the description of this tutorial, the precise moment in time. So you can just skip and go directly to the second part, which is when we actually start using NumPy. And I show you how to create arrays, how to make computations, etc. So for now, we're going to divide it in two parts, we're going to start first with the low level explanation which you can escape if you want, because it's not going to be crucial, you can easily use NumPy. Without it. We have found that for some of our students, it's it's important to understand the low level basics of it, especially if you didn't have a computer science background, it can help you get you know, raise your right your level of understanding of computers, and how to make your computations more efficient. But don't worry if you if you don't want to go through that now it's fine. You can skip this part and come back later or any other at any other moment. You don't need the ease to use NumPy seriously, you don't need it. It's going to be beneficial, but you don't absolutely lead so you can just skip and come later. So with that said, let's actually go into into a deep understanding and explanation of how computers store integers, numbers in memory and what are bytes bits etc. In order to understand why NumPy is so important. We have to go back again to the basics. What are numbers, how they are represented in computers, etc. As you might know already a computer can only process ones and zeros bits, it can't process numbers or just decimal numbers to be more correct, sorry, it only can process ones and zeros. A computer is just always storing and processing ones and zeros. It's a binary machine. Your memory is the central place around the random access memory in your computer is the the central place where your computer is storing the data that it's actively processing, right. So you have, for example, a hard drive, which stores long term data. But the computer can process data directly from your hard drive. Before doing that, it has to load it into your ram into your random access memory again, usually, right a computer is going to have what eight gigabytes 1632 doesn't matter. Let's say you have eight gigabytes of memory, that at some point is going to translate to number of bits that your computer can store. So if you follow, if you follow these we have right here, you can see the total number of bits available in a regular computer with eight gigabytes of memory. Why is this important? Because again, the objective of these of these tutorial is the objective of this bar, at least is to explain how you can squeeze out of every single bit you can in your computer, right? How can you make it more efficient? For your numeric processing, both in storage? use less memory for the same data? And also how to make it faster, right for your calculations. So in terms of physical storage, or actually memory storage, right? How can we make it? How can we optimize to use the least amount of memory for this given problem? That's the objective of optimizing it, we need to understand how numbers decimals or sorry, integers into decimal numeric system are represented in binary, right. So these table right here shows you the first nine numbers, 01234, etc. and their binary representation. In your computer. Let's say you want to store the age of user age of a user, which is 32. You can't store 32 in here, because your computer again doesn't know about decimals, it only knows about binary. To do that, you will need to find the correct representation in ones and zeros of 3030. All right, sorry, two, which is not this one, to be honest, I'm just making it up as we go. But again, you need to know the correct binary representation of these number in norther. To store that data, how can you know that? Well, there is this whole binary arithmetic, right? There's a whole part of math dedicated to binary doesn't matter for now. But I'm going to just drive the intuition of it so you can have a better understanding. And if you're interested, you can dig deeper later. So basically, any decimal number needs to be stored in a binary format, which of course only steaks ones and zeros. And what we usually do is just we keep increasing zeros and ones in positions, right. So in this case, we have the number zero, the number one, that's fine. Once we need to store the number two, winning now to increase the number, the position right here we need to increase, right, so we need to go from two to one zero, we'd go to the number three, it's one one, and then we need to go to number four, we need to increase positions again, because we only have two symbols, zero and one. So as you're seeing right here, up to these level, we need only one position. Up to this level, we need two positions. This level, we need three positions. And these levels going to need four positions. And you'll see how the size of each of these is increasing. And it has a an explanation behind it that we're going to see in a second. So the question is how many decimal numbers you can store with n bytes and bits, sorry, bits. So let's say we have n bits. And let's say n is equals to three. That means that you only have three positions, right three bits, how many total decimal numbers, you can store with it? Well we can store 000 we can store zero, we can store 100 we can start stores are you one zero, right? So in this size, we can store up to here, we can store up to seven numbers 111 is equals to seven was, once we've filled all the positions, right, we've reached the limit, right? The largest number, the largest binary for this amount of symbols or positions. That's the number seven. So these means that with three numbers, you can start from zero from zero, here, zero up to one, one. In total, you can store eight decimal numbers, here you have eight decimal numbers 012345678, total decimal numbers from zero to seven. The equation if you want behind this is as follows. If you have n equals three, and it's, in order to know how many decimal numbers you can store with those bits, it's two to the power of n, in this case, is total a bit. So if we go back into our drawings, we said that with three bits, we can store up to eight decimal numbers. And again, the equation is two to the power of n is going to give you how many decimal numbers you need. You can always do the opposite process using logarithm and get how many bits you're going to need to create to store a given decimal number. I'm, I'm not going to get into that. So we don't complicate it. But again, the math behind it is extremely simple. So now, moving forward, we're going to delete this whole thing. Moving forward. Why is this important? When you're working with your data, when you're doing your data analysis, you know what, what data you're what type of data, you're working with their own numbers, but numbers only usually have a connotation behind, right? So let's say that you have here it's a table of people, and you have the total net worth of the person. And also you have the age of the person. The age is a value that will range between what zero, right? Just born to, I don't know, 120, we can say I don't know, what's the maximum age registered right now, the oldest human being but zero to 120, it seems, seems reasonable. In your other column net worth for this person, the range is it's completely difference. We can go from something like $0 up to, I don't know $60 billion, I think Mark Zuckerberg or Jeff Bezos or one of those. So we go from zero to 62 billions in this case, if there are dollars, what happened if this is a highly devaluated currency? Would we have to go to trillions, right? So these two even though they're just plain numbers, and we can say they are integers, even though these are pulling numbers, they have an integers, they have a different connotation, and they will need different requirements in terms of storage size, right? So if we say that nh goes from zero to 120, we don't need so many. So many bits to store it in memory, right? We can do the math, actually, how many bits Do we need in order to store 120 100? And what do we say 120. Right? Well, if you do the math, you will see that two to the power two to the power of seven is 128. So if you have if you have seven bits here, seven bits, you're going to store from zero, up to 1111111, which is actually 127. Okay, these number, all ones, seven ones in binary is equals to 127. in decimal, in total, we can store 128 numbers 00 matters, up to 127. So that means that for our column right to column, age, here age, we only we can use the size of the men We need to use is going to be seven bits per user, or costumer or person, whatever. What about these number right here, if we have to go up to a couple billions? Well, in that case, the numbers a little bit more complicated, we're going to need, for example, we can say 64, or 3232. It's actually 64, probably, but with 32 bits, right, you can store up to from zero up to these volume. So again, I don't know about the currency we're using or anything, so we can assume. But here, we need 32 bits in order to store that. And now you can do the math, how many how much memory space you need, in order to process this data? How many records Do you have, if you have only 1000 Records, that's not significant. You can use whatever, you can use 64 bits here to store the age, and you're not going to have a problem. But what happens if you have more what happens? What happens if you have the entire population of the earth, you have 7 billion records here 7 billion records, then every bit that you're saving in these columns is going to be important? Because he's going to take a ton of data. And of course, you have a ton more columns, right? What happens if you are processing trillions of records from financial transactions, right, you want to be very, you want to be very efficient and optimize every single bit, you can't. And that means again, selecting the correct number of have a bit per the columns you're currently processing. So so far, so good, again, when there's 10, that the the number in decimal that we need to store has a correspondence with emits, right? eight bits is one byte. And the more we can optimize that, the less memory we're going to use for our obligations. Where does NumPy come in place? Why are we talking about data in these NumPy lessons? Well, they're right. The idea is that NumPy is a library that will lead you has a very advanced numeric processing, in order to let you select the number of bits you want to take for an integer. Even more, let's say you for forget about NumPy, you want to process this thing with pure Python. So you x equals five, for example, working with Python, you create a number, we're storing age as a five, how many bytes? How many bits? Do you think the simple variable takes in memory? How many? Well, in reality, even though we think it should be around, what, three, three bits, eight, let's say to be simple, too simplistic. In reality, for Python, this is going to take around 20 bytes. Okay, so we are wasting a ton of memory in order to store this number. And why is that? Well, because Python is a high level, object oriented programming language. The reasoning behind it is that Python is simple to write, write simple to also read and, and, and code on top of it. But again, in order to create that simplicity, in its syrup, all the numbers in objects, which have all these attributes, that if you know, advanced Python, you're going to recognize that are not necessary. So these is taking a ton of memory. And a regular, very simple number in Python ends up consuming 100 times more memory than what it should be consumed. And this one NumPy comes in place in NumPy, you can create numbers that are for example, you can control the size, in terms of bits, you can say I want to create a number that has only eight bits. And that's it, that you're going to create a one byte integer, and you're very precise and how much memory it takes, you can create a number that it's actually need a little bit more space, we're going to do NP int, and we can hear us a talkie, you're going to get auto completion 6016 bit or eight or 32 or 64, right. So we can actually be a lot more precise in the number of bits that we need. And this is extremely important for again, our high level processing. On top of that, NumPy is our array processing library at NumPy is 99%, about processing a race constantly processing erase the data structures we have in Python, the built in data structures we have in Python, for example, the list dictionary, they are not optimized for high level computing. So if you have a list of numbers in Python, let's say you have, I don't know, l equals 3224, right, you have three numbers in your list. In Python there, it's not guaranteed that the least they'll the list is gonna contain all the numbers, three to four in contiguous positions is gonna, it might put them in separate positions in memory. On top of that, you can't rely on advanced CPU directives and instructions for processing matrix matrices, sorry, because Python, again, is wrapping these things in objects. So there is no access to these high performance, low level instructions with NumPy, that changes because when you create an array NumPy, you say, I want to create an array of three numbers, and they are all into eight, then imposition forget about this is not these are not bytes I am, I'm using these drawing as a general representation of memory. So in that case, in NumPy, when you create these three element, int, eight array, it's going to create those three elements in contiguous positions in memory, three to four, and they are only going to take that amount of memory the police said they were going to take and on top of that, we can rely on a bunch of very efficient low level instructions from your CPU for matrix matrix calculation, this is something that it's a little bit more advanced. And it's something that has exploded in the past 10 years CPUs with more with richer instruction sets, and the same thing for GPUs, you might have heard, especially with machine learning and all that we need, we need fast array processing, when we are storing features and weights and all that's a topic for for different story. But again, the idea is we need right a ton of week, sorry, we can use all these important and very efficient, low level directives from our CPU, which makes our computations a lot faster. So again, as a recap, you don't need to know all these to work with NumPy. That's the first thing. Second, you don't need to get extremely, extremely conscious about all the numbers you use. At the beginning, you're just going to use NumPy as it is, and you're going to use just the default types that it picks in 38 cents or in 32. In 64, that's fine. But then, with when you get into bottlenecks, when you're working with with larger amount of with more amount of data, then you might need to get into the details of that size of the integers that you're using. And this all applies to float. So I'm just using integers because it's simpler. But it's all applies to floats. So again, NumPy, the main advantage is that it's it has built in very fast and I raised kit, take advantage of CPU instructions for matrices and arrays and all that. And it also has a very efficient representations of numbers, right are not the regular objects of Python. Again, recap, you don't need a list. If you want to get into more details, I recommend you to get a little bit more understanding about binary arithmetic, and how numbers are uncomputable architecture, how numbers are stored in memory, etc, especially for floats and all that's a completely different representations. So with that said, we're going to see now how we actually use NumPy without worrying so much about the low level details. And that's the beauty of NumPy. So we have already done our low level explanation of binary arithmetic, why unknown vice important and all that if you skipped it, that's perfectly fine, you will not need it. The reasoning was to include was that if you're in this tutorial, you're probably looking for fast and efficient options to process large volumes of data. And that's when all those things come into play. So let's without further ado, let's just get started and start using NumPy as a library. So again, as I told you, a NumPy is a very simple library for array, processing and numeric powers. To sing, it has a few objects, numbers, floats, integer floats, arrays, and that's it. And it's very simple, but it's extremely powerful. So, in NumPy, we're going to create these arrays, which look a lot like Python lists, but there are going to be significant differences. The first one is, of course, performance. If you go to the previous part, when we were discussing the binary representation of an array of numbers, in Python and NumPy, you're going to see the difference between them. So in this case, we're creating two arrays. And you will see right that the creation is extremely simple. The only thing that changes we need to add this NP dot array, and then we're passing in this case, a list of numbers. This is something we will usually be reading from external sources. Now, how can you access individual elements of a NumPy array is this works in the same way as with a Python list, you can say give me the first element, give me the second element. And it's zero index, like, again, in a Python list. Slicing works the same way. So in this case, up a zero to something, a one up to three rights, just getting low level, right, on high level of the index, negative indexing, and steps, they all work in the same way as with a Python list. So if you know how to use a Python list, you will know how to use a NumPy array. There is one new thing right here so differently from a Python list. And it's what it's called multi indexing. Let's say you have a, an array this case B, and you need to extract three elements out of it, you need the element of the first position, third position and last position, you can just type B of zero, B, A to B at minus one, or, and this works, this also works for a list. Or you can use again, mod the indexing, which is from B, I want to select the elements in zero to n minus one first element, third element on last element, right, so you pass an int, another list containing the indices of the elements that you want to select. And in this case, the important part is the result. It's another NumPy array, it's not just individual elements, you're creating another NumPy array, which again, if you're processing, it's gonna be a lot faster. So arrays have types associated. And this is related to what we were speaking before. As a NumPy array is a continuous is continuously assigning memory, the NumPy library needs to know what's the type of the object you're storing, you can't just or you know, anything, a string a number within it, because it will not be able to provide performance and optimizations for arrays or non consistence insights. So for example, when we create these arrays only had injures by default NumPy selected in 64, is because of the platform, it's a 64 bit platform, you can tune this, and you can select us, we're going to see other sizes in a second, when we created the array B that contain decimals or floats, it assign a different type, which is float 64. Again, the default type is always six, at least in this platform that is 64 bits, it's going to be float 64 and integer 64. You can always change that you can say Actually, I want these, even though these are all integers, I want you to create them using a float type, or as we saw in our previous video, we can say it should be actually type integer x. So smaller integers, for performance, for performance for better performance. Alright. So moving forward, we were also going to see a few other types like for example, strings on the regular objects. But as you're going to see this, there is no point of storing these things in NumPy NumPy, stores numbers date Booleans, but not a regular individual objects as we're seeing right here. There is a way to store strings, it's perfectly valid and it has its own time. Its own type sorry, and it's related to the Unicode representation memory etc. But again, NumPy is usually used for numeric processing. So the idea of NumPy arrays is we can create multi dimensional arrays we can create the what we had created before. This is a one dimensional array right? Just one dimension, you can create matrices, which in this case are two dimensional, we have two rows and three columns. And NumPy has a ton of attributes and functions to work with multi dimensional arrays. So the first thing we're going to see is the shape of an array, which is two rows by three columns, how many dimensions it has, it has one vertical and one horizontal, we have two dimensions. And what's the total size of the array in this case, the total size is six, the total number of elements we have, let's go one dimension. Further, let's create a three dimensional object, a three dimensional array, which is basically a cube. In this case, for B, we have that the shape is two by two by three, the number of dimensions is three, and the size is a total count of elements. 12, you always have to be careful when you're creating these multi multi dimensional arrays. If the dimension dimensions don't much, like in this case, right here, where we have this second list that only has one less than bits in it, then the dimensions will not match. And it will just tape you they'll use sorry, that the array is of type objects. And the shape is only two only has two elements, these one element, and there's another element. So in this case, we've done we've done it wrong, basically. And you have to be careful when creating these these objects by hand. So how can you index and slice matrices? We've done it for a one dimensional array. So we were selecting elements, individual elements, give me the first element, give me the second element cetera? How can we do it with a matrix with a matrix, what we're going to do is going to be very similar to what we did before. The difference is that now we have to account for multiple dimensions when I do give me a at one, is it the column add one, or is it the row at one? Well, as you can see, it's the row. So this is going to be right here. 012. Right. And there is also another dimension, right? So this is 012. In terms of index, index positions for our slicing. So here, how can you get the first element, the first element of this second? rope. In that case, you're going to first select the first row, the sorry, the second row, and then select the first element. And that's what you get number four. But there is a better way, which is by using the multi dimensional selection of NumPy. In this case, you're going to say from this matrix, I want to select and here you're going to pass a in this case, you're going to pass dimension one dimension to dimension three, dimension four, etc, right. And these are selectors for each one of those dimensions that you're passing. In this case, we say, for a row level one, the element, the position one second element, and for a column level, we want the first element in it. And it's the same thing as we did before. The advantage of this index and keeping in mind and remaining it is that it will also let you add slicing, right, so you'd say I want to select every thing from dimension one, which is rows. So in this case, you say from zero up to two is these two ones, the two is not included upper limit the same as as Python. And then you can also pass other other dimensions, you say, I want to select every row, that's fine. But then I want to select from column level, I only want to select the elements up to two. So these two and these two, and the two, right, so 124578. These all works as intuitive as it gets. Remember this syntax is the important that you need to keep in mind. Moving forward for modification, you can say I want to assign these new array to this entire row, right? So if the dimensions match, that is going to work now 10 is equals it's added to the second row, or you can just use what we call usually an expand operation. We're just going to say for row number two, I want to assign the number 99 and NumPy is going to take care of expanding it into this corresponding array, given the number of dimensions that you have So so far that selection, it's simple, we're going to see also is that NumPy has a huge advantage of containing a ton of operations you can perform on top of your arrays and matrices, your multi dimensional arrays in general. So the first one is the all the summers basic methods we have. So given an array, all these methods are already built in the sum, the mean average, right, standard deviation, variance, etc. And that also works for matrices. So in this case, we can get the sum the mean standard deviation, or we can do it per axis. So this is very useful, we can get the, the here, let's compare these two, there we go, we can get the some of these, what is this first column, the second column or the third column, we can get it the first row, second row and the third row. So it's either this dimension, this dimension one, or it's a vertical dimension, which is x equals one, right? So per row per column. Or, if you have more dimensions, you can just keep increasing the number of this answers. And that's just going to work as expected. Broadcasting vectorized operations, this is a fundamental topic that we're going to talk about. And it's going to be extremely related to Boolean arrays. And these are a few new things that you have to keep in mind with working with NumPy. And now we're going to talk about vectorized operations and broadcasting, which can be a counterintuitive topic at the beginning, but then you're going to understand how much sense it makes. It's one of the fundamental pieces of NumPy. We've seen how NumPy works in a very general way we saw the multi dimensional arrays and all those advantages. But you might be thinking, I mean, I don't need another library just to complete the summer domain. When I show you the vectorized operations and broadcasting part, this is going to make a little bit more sense of why NumPy is so important. So to get started, we're going to have these array, which is a right, that's just very simple array vectorize vectorized operations are operations performed between both arrays and arrays and arrays and scalars, like in this case right here, which are extremely fast, they're optimized to be extremely fast. In this case, what we're going to do is we're going to sum the entire array plus 10. And what it means we're going to see an example of what happens without with Python. But what it means is that let me show you the results, that each one of the elements within the array will be applied the same operation. So usually, that's the concept of vectorizing an operation you have the number and then this operation is applied to each one of the elements in here are actually in these other one, right, so here and here and here. And here to result in these new array, the operation is expressed at an array level, right, we say a plus 10. That's it. But then again, internally, this is broadcast said to each one of the individual elements within the array. And this gives me how a plus 10? Well, a times 10, for example, which also in this case is we're playing the times 10 operations to each one of the elements in the array, resulting in a new array with the result of that operation. And these resulting in a new array is very important, because as we're going to see, NumPy is an immutable first library, it will not any operation, you performing an array will not modify it, but it will return a new array, if we check the status of a, you're going to see that the elements are the same, it has never changed, we are creating a new array and returning it. There are ways to override these behavior if you want. And this they all these operations were performing these way always have the interface of plus equals minus equals times equals etc, which will indeed modify their rights. In this case, we're making a broadcasting operation, adding 100 to each one of the elements in this array. And now this operation was immutable. A was modified and did it hasn't returned a new operation. If you remember from your pure Python skills write the correspondence of vectorized operations are list comprehensions, in which you're expressing an operation for each one of the elements in your collection. Right. So that's a list comprehension. It's a it's pretty similar to what we're doing with NumPy. The main difference is that this is all optimized and extreme. It's extremely fast. So, the operations are these vectorized operations are reduced broadcasting doesn't need to be only between arrays and scalars can only be between arrays and arrays. So in this case, we have a and we have B and showing you right here. And we can do something like a plus b. And what you're saying is that if there is a correspondence, right, so zero plus 10, one plus 10, two plus 10, right? Let me, let me do it in this way. 110 210 and 310. There we go. And that's the result that we get right here. So these for these to work, you of course, need the arrays to be online and to have the same shape. But when that does work, then the operation is extremely fast in memory. And it's aligned, it's a vectorized operations with seen so far. Why is this topic of vectorize operations so important? Well, because of the following, which is bull in a race. And this is a very, very, very important thing. If you don't completely get it now, I asked you please, to go and check the exercises we have for this lesson, because we're gonna use it a ton. And we're gonna, we're gonna see that in pan, this, the same syntax, the same primitives of Boolean arrays, a play apply, and we're going to use the same things. So why are Boolean arrays similar to vectorize? operations? Well, all these operations we've had performed here are just arithmetic operations, mathematical operations, plus something times something, etc. If you look at the operators that you have in your programming language, it's it's not only mathematical operators, like plus or minus, or times, you also have Boolean operators. And the question now is going to be what happens when you apply Boolean operations, when you apply Boolean operators to it. So given our right, we had, what ways we had to select different numbers. For example, in this case, we need the first and last element, we do zero and minus one. That's, that's the way we saw with NumPy. We also saw the traditional Python one, right, so we can say a zero, and also want to get a minus one. So this is the first, the first way of selecting these elements, we know there's a second way with multi index selection. And there is a third way and this is new, which is with Boolean arrays right here. So in this case, we're gonna say I want to select the elements in this order, right? And you're gonna pass either true or false if you want to actually select the element or not, right, so if you have four elements, you have to pass four Boolean values, saying, I want to select this element, I don't want to select these ones. I mean, I don't want to like this element. And I do want to select this element right here. So I want the first one, and the last one, and the result will be the same 030303. So so far, it's it's nothing terribly new, right? So this is new, but it's not extremely complicated. We are showing you a brand new way of selecting data, you can select regular Python multi index, or a Boolean array. Now, you might be thinking, well, I manually write true false false, true, true false, for I don't know how many records you have a million records, this is not scalable, right, you will not say to write all the strong forces. But this is actually very important, because these arrays are the ones that are the result of broadcasting Boolean operations. So we saw again, regular arithmetic operation like this, but we also have it for Boolean operations. So we what happens if we ask a greater than or equals to the number two, right, and array A is this right here is 0123, then the result is false for zero, false for one, because they are not greater or equal to do true for number two, of course, and two untrue for number three. So all the individual elements that match this condition will have true and false. In other cases, this is the power of Boolean arrays, we will be able now to combine these operations. So now we can do a greater than or equals to two, right that a equals A being greater than or equals to the number two. The advantage of this is just filtering, we're filtering No, no numeric arrays very quickly with a very familiar syntax a greater than equals to and we just provide that as the index of the operation. It's pretty much What is happening right here? We're saying use these Boolean array. It's a willing list, right? is a Python list with Boolean, to filter or sorry to select elements based on that. But the question is, how do we construct that list of Boolean? Well, in this case, we have constructed it by including a predicate by including a condition that needs to be matched. The result, again, is filtering. It's a query method, you're looking, looking up some data, you're saying, Give me all the elements that match this condition. So you can say, for example, these values can be of course calculated, you can say, give me all the elements that are greater than the mean. Or you can actually provide other Boolean appraiser operators like for example, all the elements that are not greater than the mean. So that means they're less or equals and the mean, or you can also include all their Boolean operators like or, or, and so or n and in NumPy, are expressed with a pipe or an ampersand ampersand. Because we can't use just the regular or and then in Python, we can, but it's a good choice, they've selected this. So again, this is the concept of Boolean arrays, we are going to construct these arrays that artist Boolean representations or Booleans, based on conditions, right, so we have this matrix, and we're gonna say I want to select these one, and these one end is one, etc. So in that case, this is the result right here. This is the result of that. And we can generate a dynamic Boolean array, we never manually type all these right, we don't sit and say true, false false through etc. We just Run Query filtering option, a Boolean operation, which results in a Boolean array. And now we can use it as filtering. So again, the idea here is that the operations we saw in broadcasting before, a timestamp are also defined for Boolean operators. Boolean operators return Boolean, a race, which can be used in filtering, that's the idea of all of it. And you can even combine these operations, you can say, A equals zero, or a equals one, a less or equal to two. And it's also divisible by zero, you can combine all these queries. So now it looks a lot more powerful than when we were doing before. So moving forward, talking about linear algebra very quickly. And this is we're approaching the end of the NumPy lesson. The part the important part of of linear algebra is that NumPy already contains all the most important operations for it already optimized with low level semantics, it's going to be extremely fast, adult product cross products, and all that transposing majors is all that works as expected. And again, these might be very important, specially, for example, machine learning, etc. It's it's extremely important. And finally, to wrap up what we saw in our, in our binary explanation at the beginning, what you might have escaped is the difference in sizes between NumPy and Python, the differences in terms of performance between them. So in Python, a regular number, this is just a regular engine in Python, that total size is 28 bytes in order and just let this thing for a second. The total number of bytes, not bits bytes that you need in Python to store a simple number, as the number one is 28 unit 28 bytes to store just the number one is extremely, super space consuming, right? It's not very efficient, larger numbers will even take more bytes to store them. What's the size of the integers? Well, we've seen it we have, for example, we can create integers with eight bytes. We can create integers with one byte right which were something like here we have np.int eight will already know how many bytes has only one byte, right, but you can have control of how many bytes or bits write your numbers will take. And you can see here the difference between the size of an integer in Python which is extremely large 28 byte on NumPy and also the difference in performance. Let's say for example, we want here you have the ultimate difference in size of lists, which is also significant. But I want to focus on performance, we have two elements two, we have one list that has the first 1000 numbers, I will have a NumPy array that has the first 1000 numbers, we're going to perform the same operation in both of them. Let's use the Python one. First, we're going to do the Python one first. In this case, we're, we're squaring all the elements in the list, okay, the elements A squared, and then we're summing all the operations might so we express it at saying, create a new list, x times x, sorry, squared, 4x, nl, and then some everything, how much time it takes 321 microseconds, we're gonna do the same thing with NumPy, we're gonna say NP dot sum, a square. And you're gonna see that it's a lot faster in the NumPy perspective, then the Python perspective. And these are all very, very tiny, tiny operations with small numbers. What happens if we add more numbers, let's add two more numbers here. That's odd. Two more numbers here. And we're going to do the same two operations. So as you see here, that that the units have even changed, we're still in the microsecond layer here with NumPy, we've gone to the millisecond layer in Python. So as the size of your objects increase, NumPy will prove to be extremely fast compared to Python. So there are a few other functions you can see here, for example, extracting normal, random numbers, etc. I'm going to live let these for you to look, if you're interested in them, I remember you have the exercises, which can help you solidify all the concepts we discussed. And we're going to move forward now to work with pandas, we're going to see also visualizations are gonna keep moving forward this data analysis with Python tutorial. Now, it's finally time to talk about pandas is the most important library that we use for data analysis in our day to day basis with Python. It's a library that will aid in the entire process of your data analysis project, you're going to start getting the data, step one, getting the data from multiple sources, like databases, Excel files, CSV, files, etc. That's all gonna get into pandas, you're going to be processing the data, right? So you're going to be combining merging, doing different types of analysis, you're going to be visualizing the data, right, so a bar chart, you're going to be visualizing the data with pandas, and you're going to be creating reports, you're going to be also doing simple statistical analysis, you're going to be doing machine learning close to it, with the help of other libraries, but everything from the platform that provides the pandas library, it's, again, one of the most important libraries in in in the data analysis data science ecosystem with Python. pandas has recently released the version 1.0. So we are talking about a very mature library. It's been around for a long time now. And again, it's the primary library that we use in Python for data analysis and data science. So I'm going to do a quick introduction to the data structures of pandas house, and we're gonna understand how they work. So you can start building right the phone, we're gonna start building the foundations, I need you to be very familiar with the way the data structures from pandas are processed. And then we're going to move into other things like reading files, grouping data, etc. So to get things started, we're going to talk about the first data structure to pandas house, which is this series. In reality, pandas has two main data structures that it uses all the time, and it's the series under the data frame. The data frame is the one you will probably be more familiar with. It looks just like an Excel table. But we're gonna start first with a series. Okay, so just stay with me here. We're going to talk about a series for a second. In this case, we have important pandas, and we have also imported NumPy. As, as you might imagine, as I told you before, in the NumPy part of this tutorial, we're saying NumPy is fundamental for data analysis because every other library pandas, matplotlib, they all sit on top of NumPy and you can see it right here. We're gonna be using some features from NumPy within this lesson, too. So these is a series in pandas, what you see right here, it's The concept of a series is this ordered sequence of elements right? Or indexed right with they are all indexed by a given index, of course. And you might think that this looks a lot like a Python list, right? So in this case, we're storing the population of countries, right in millions of inhabitants. In this case, it's jevelin. g7. pub is because we're getting the population of the Group of Seven, you can console the Wikipedia page. But basically, we are storing population in here in this series. And again, it looks a lot like a list, but we're gonna find a ton of differences in here. So the first one is that the the series has an associated data type. And this is something we saw in NumPy, when a NumPy array couldn't hold different types of objects, we were all we were only having one type of object. In this case, it's float 64. So all the numbers of the series will be of type float 64, the underlying data structure, the 10, this is using to store these objects is a NumPy array. So a second difference we see very quickly is that zeros can have a name, right. So now when we display the series, we see that it has a name. Now it might not make a ton of sense. But once this series is part of a data frame in the form of a column, then the name is going to make a lot more sense. So moving forward, again, we saw that A has a type. And again, this is because the backed the data is backed by a NumPy array that you can always consult, you can check values of a series. And you're going to get the array that it's backing up that pandas series, right, so you can see that it's a NumPy array. Once you have these series, we were just consulting here, design pop, you can in you can select elements as you good in a regular list, right? So for example, give me the first element, give me the second element, the last element, etc. And that's because a series inherently has an index, similar to list a list when you create a list in Python, right? So if I create L equals a, b, and see, but there is something wrong here missing, quote, this list, we don't say it right. But the idea is that there is an index here, zero, this is one, and this is two, right? In the pendous series, this is a lot more explicit, each element has an associated value within it. And you might think that is pretty much the same thing. They're all they're both the list on the series, there are both sequences, they're ordered sequences of elements. But we're going to see that there is a fundamental difference, and is that we can arbitrarily change the index of a series. So by default, when we created it, we didn't assign any indices. So by default, it was a range index from zero up to n minus one elements. But you can actually arbitrarily again, say, what is the index of your series. And in this case, these data structure these series has now these indices that we're seeing right here. Why is this important? Because now we're going to be referring to these values, not by a sequential position, but by a name, but by a label by the index, which has a meaningful name for us humans. Okay. So now, these thing looks a little bit more like a dictionary we could say, than a list, we started thinking that a series was similar to list but now, we can think that a series is limit similar to a dictionary. But wait, don't get me wrong here. The series has a fundamental trait, and it's that it's still ordered something that didn't happen with. With dictionaries, dictionaries in Python, are not ordered, actually, in python 3.7. They're ordered, but we shouldn't be thinking that they are ordered their unordered data structures. In this case, a series is in the order. So it has both those advantages. It's ordered candidates always before friends, that's as we decided to create it, but also it has names or labels or keys associated with the values as a dictionary. So this is creating the series from scratch. Right? All these methods, you can see you can create a series bypassing the index it doesn't have To be a two step process where you first created the series, and then add the index, in this case, you can do everything at once. And the indexing is now going to be done by those indices, right. So those labels that make up the index will be used to index specific data. So g7 pop, we see has these countries with these population. And now, before the index, we were saying, I want to get what's the population of Canada, and then we had to remember, what was the position of Canada, oh, it's the first help countries, we have to do g7, pop zero. With the index, now we can just consult what's the population of Canada, what's the population of Japan. And as you can see, the syntax is the same as with a Python dictionary, it's just pretty much same, you pass the key and is going to get the value. So again, summary, the advantage of a series is it's it's a ordered sequence of elements, backed by a NumPy array, very efficient very fast. But it also has an index that can take any labels we pass, so it's going to make it a lot better for indexing, you can steal when you have a series, you can still get the elements by the sequential ordering. After all, it's a sequential data structure, and doesn't matter if you have in an index, you can still say, Hey, I know we have on the index. But if you want to get the last element, or the first element or the second element, you're going to do that by using the attributes, I look at it and say to this series from this series, I'm going to ilok locate by sequential position, these element the element in position zero or the last element. And that still works as expected series also support multiple indices as we saw with NumPy. So in this case, we can get two elements out of two, three n elements, you can pass multiple indices. And the same thing happens with more with sequential multi index series also support range or selection or slices. But there is a fundamental difference here, this is very important here attention, there's a fundamental difference with Python, and it's not in Python, the upper limit of a slice is not returned. So from our list that we created before, if I do l, up to number two, I don't get the index See, right, so this is zero. This is one, this is two, two is not included in our pandas series, the upper limit is indeed included. So if when you asked from Canada up to Italy, Italy is in the result. Okay, so this is something to consider when using index selection in pandas, I think this is still valid, it's very, I understand the reasoning behind it's just different from Python. So, you should remember, Boolean arrays, which was a topic we discussed in our previous lesson of NumPy. Boolean arrays is still a thing in pandas, the difference is we instead of saying Boolean arrays, we should say Boolean series right, the idea is that we will be able to perform operations on top of series. So for example, right here we have mathematical operations on top of series in this case, we have the zero D seven pop, which as I told you the beginning is in millions of inhabitants. If we want to get the series of interest units, we will need to do Jessamine pop times 1 million and there we go now is in terms of units these operations right these vectorized operations the bras these broadcasting operations can also be performed with Boolean operands. So instead of a multiplication, a summation and subtraction, etc. We can add we can use a Boolean operators. So in this case, we get asked what are the countries that have more than 770 million inhabitants we will receive receive their assault is a bull in aerates, Nebraska, right? Well, let's hear it you know, but basically, it's the same concept of with us with a NumPy Boolean array. Canada, friends, they do not have more than 70 million inhabitants in Germany does have seven more than 70 million inhabitants here. 80 on the same for Japan, so Japan Here is the same on the same for the US, the US also have past more than 70 million inhabitants. So again, the Boolean array or Boolean series in this case, works in the same way, as with NumPy. And selection also applies. So I can now select, I can say, give me from these series g7 pop, all the countries that have more than 70 million inhabitants, the value is more than 70. So now, again, we are building filtering, we're building a query language if you want on top of pandas, we're selecting data based on this condition. Remember, when if you ever have trouble remember all these, the idea is that you can always track down the way this index is being built. In this case, we are it's not that the selection knows anything, these first election knows anything about how to select countries with more than 70 these operation was performed first, which resulted in these series. And now this series will be indexed by these array, this Boolean array. And the result is as you can see it, and again, these operations can be run with calculator methods, and all the operators we saw in our previous lesson, which was not, which was or this irregular pipe, or, and amberson, which is the and all these can be applied in any order you want. So if we read this thing, which is complicated in purpose, it's worth saying give me all the elements that are above the mean, minus two standard deviations or below the mean, actually, above the mean, and here was below the mean, or if this isn't correct, but it doesn't matter. It's just an OR operation between two ends of the it's actually, it's above the mean, minus the standard deviation. So we are applying this operation or right, that operation we have before so they're not the or, and the and they all work with Boolean selection as well. The operations we saw from a mathematical perspective mean in in statistical operations, we saw a NumPy. Some mean, average standard deviation, we're actually using standard deviation before, they're all still relevant in this case, but also you can use traditional NumPy functions with our pandas series, because again, a panda's series is internally backed by a NumPy array. So this is all the same, as you can see, here is an example that it's a little bit more clear, we're getting all the countries that have more than 80 million inhabitants, and all the countries have less than 200 million inhabitants. So it has to be above 80. But it also has to be below 200. Okay, or in this case, we say either above 80, or below 40, or below 40. Right. So that's with the OR operator or the NOT operator. Modifying series is relatively simple. Whenever you have a value, you can just assign it all together. In this case, we're saying Canada is now 40.5. I don't know why we just wanted to do it. This is by index, you can also do it by sequential positions. So in this case, we're going to say the last country should have 500 now. So we're going to see a right here, the last country has 500 now, or you can also modify elements based now bool and selection. So you can say all the countries that have less than 70 million inhabitants, all these from our previous query, all these will now be 99.9. So as you can see, it has changed all these countries. So this the assignment works by direct indexing, or also works by Boolean indexing. And this is going to be extremely important when we are cleaning data. So let's move forward and start talking about data frames now before you have exercises in for series, and also for data frames, so I recommend you to check them out. So talking about data frames, this is what a data frame is going to look like. It's pretty much the same thing. us an Excel table. So this was our series and this is going to be our data frame. It's a table. So it looks a lot like an Excel spreadsheet. And actually, it's very common to create pandas data frames out of CSV files, which are tables basically. And I'm going to create it we created with these data frame object I created. There you go, these are data frame. And as you can see, right, it has columns that we have assigned. In this case, we were designing the columns, and we have rows of values right below each one of these columns. Why? What's the similarity with with series, and it's not a data frame column will be basically a series. So we can think a data frame is a combination of multiple series one per column, we're going to assign an index to the data frame the same way that we did with our series. So in this case, this is our data frame. Sorry, right here. This is our data frame that has the index, right? And it has the columns as we had before, what columns Do we have, what's the index of the data frame, these are all attributes that you can consult, there are a couple of very interesting methods from data frames that we use all the time. The first one is the info method. That's going to give you quick quick information about the structure of your data frame. Right. So it's going to tell you what columns you have population GDP surface area, HDI continent. And it's also going to tell you the types and how many no values you have, it's actually telling you how many non null values you have. But we use these when we're cleaning data to quickly then define those columns and have missing values, we can check for the size of the data frame, we can check for the shape. And this is similar to a matrix right, a two dimensional array in NumPy is pretty much a data frame. And also similar to info the voice again, to check a summary of the structure of the data frame, we can also use this cribe, which is going to give you a summary of the statistics of the data frame. And in this case, what we see is that for each numeric column, only those columns are numeric continent is not here, for example, this is continent so you can see the type is object is a string, basically, all the numeric columns, we're going to have summary statistics for them. So for example, for population, how many elements we have, what's the mean, right? What's the average Romney, what's the standard deviation, the minimum, the maximum, and in between a couple of percentiles 25th 50th and 75th percentiles. So this is quick summary statistics. And we do this a lot. So keep in mind, his crime method is very popular. As you could see, in the in the info method, the columns have associated types, okay, so this is very important. They continent is an object that means that it's basically a string HDI is a float and surface area is an integer. And that's because NumPy, pandas is automatically with through NumPy, is automatically recognizing the correct type to assign to each one of the columns. This is similar to what we saw with a series in which the series contain natural datatype, a series was part of a given data type. So that's something you cannot change. And in this case, checking value counts, you can have a quick reference of the types of your series. So moving forward, how will we we will be selecting data from series Well, there are a couple of methods. And this might be a little bit confusing. So what I'm going to do is I'm going to skip and just going to give you a quick reference first, and then you can read if you want through the process we follow here, given a data frame, and this is just two quick rules, given a data frame, you're going to select by index using the lock attributes. So the lock attribute is will let you select individual rows. So for example, when I get Canada and that's the value of Canada, when I lock attribute will let you select similar to the series, the row by sequential position. So let's say We want to select the last row. In this case, it's the United States of America. So again, look lets you select a select rows by route by index, give me the row under this index, I log will let you select rows by sequential position, give me the last row, the first row, the second row, etc. And finally, without using lock without using a lock, just by saying the f up something, you are selecting that column, give me the entire give me a V and tire column population right here, the entire column population. So what you're seeing here, first, first of all, this is a quick reference dot dot Lok will give you an element by index, I look we'll give you an element by position, I wrote by position and just doing df, on some things gonna give you the element, the column sorry that you are passing. So it's like, both look on I look, look and I look work in a horizontal ladder, give me this, while bf art, whatever works in in a vertical montanus, which is getting you a given row. But something more interesting here is that all the results, these one and these one and these one, they're all series, what are being returned our series. So that's what we saw before. And the way it works is first, if we focus in this last example, we're going to see that it's pretty standard, just these series right here was is a one return I remember it has a type and everything. So that's, that's fine. If If we ask for a row, like in this case, we can get for example, here easily. There you go. The result is also series. But what you can see here is that this thing is kind of transposed in a way dot here was the volume of this year is population is here, and GDP is here and surface area is here HDI on continent on here you have volleys. So it's it's again, it's it's being transposed, right from vertical to horizontal, in our regular series manner on the index of this series is extracted as the name that the column hot. So in this case, the name right there is the value of the index that it had. So you can read more about it right here. But I just want you to remember these rules don't lock you select by index dot I lock you select by sequential position, the F at something you go by column, there are times when these might not apply. So or not want to apply, there will be some issues. So for example, if your rows if your index is numeric, you might have issues with these form or dot form, just respecting these three. For now, it's gonna get you any element you want to get either by row or column. So from what we've seen, the oldest slicing also works as expected. So we can get, for example easily, or we can get friends up too easily. So the upper limit is included. But again, it's built look and we select by indices from France to Italy, we can also do the second dimension similar to the way we worked with NumPy, we can do second dimension here. And we can get all the countries that are from France, or to Italy, including Italy, but only the population column or population and GDP. So here you can see the second dimension being applied at the concept of of multiple dimensions in selection being applied also to famous for ilok. It works in the same way that in that then multi index and the slicing. So we get for example, from one to three right in sequential positions. In this case, the upper limit is not included. So that's another difference from what we have. And we can also do multi dimensions we can say give me the countries from one to three and the column should be 0123 should be the third column, the fourth column, the column under index three which is HDI, so that also works as expected. And again, recommended, always use Look, I like to select rows and just use the naked data frame to select columns as we saw before. Now moving forward, conditional selection Boolean arrays will series, whatever you want to call it. This also works for data frames. And it's very important, it's a way to filter data, it's a way for us to consult the data when the when, when it so in this case, what we have is, we want to select all those countries, which the population is greater than 70. Okay, so all the countries that have more than 70 million habitants, similar to what we were we did with a series, but in this case, we want to do it with a data frame. So what you're going to see here is that we're going to construct a Boolean series as we did in our previous video, right? So every country with more than 70, false false, true false. And we're going to inject that result, that Boolean series in a dot lock selection, give me all the countries which match here than that the true value in it. And remember, just this is kind of mnemonics are a way to remember, the way pandas knows how to filter things is by matching this index, right from the resulting series. With these index of the resulting data frame. These are two different objects, they are completely different objects, but their index much. So here, Japan, March, Germany March, so here, Germany, on Japan, they are the same, and that's why that thing is working us expect that they This is just the first dimension, which is give me these rows, you can also on the second dimension, saying give me these column, or these columns, right. So that's steel, that's the awards us desire. So what about dropping stuff, you can say, whenever you have data from you can say give me just these pieces, or you can say drop the others, right, it's just pretty much the same. Dropping is very simple, you can drop by index, drop this value drop Canada altogether, period, or drop these indices can in Japan, or you can also drop columns, drop population, and HDI as columns. These ways also have a more advanced usage, which is with access similar to NumPy. I don't recommend them so much, but you can still use them and see them here. So all the operations we've seen, so far, they're all working. The most important part here is the broadcasting operation that we're going to do between series. So we're going to create a new series crisis. And I'm gonna show you what it looks like. So we have here crisis. And we're going to perform a broadcasting operation between between these, I'm going to show you what this thing looks like first, between that two, these data frame on the crisis. And the result will be that we will subtract, I don't know what's this number 1 million, subtract 1 million for each volume in here. And we're gonna subtract 0.3 HDI for each one of those. So what you can see here is again, this alignment between columns and indices, the GDP here is matched with these GDP and the HDI is much with these HDI. So there are two different objects, two independent objects, these series and these data frame here. But when we combine them with an operation like this, the the columns in this case are aligned GDP, and HDI and they work together. So you're gonna subtract these value in all these column, let me remove this, you can subtract these values in all this column for all these values, I'm going to subtract these value here in these column for all these values. That's the way it's going to work. So moving forward, what about modifying data frames? Now I wanna I want to show you something. And that's when we were dropping stuff before. We were not actually modifying the data frame. So here we did df dot drop Canada, but df still has Canada in it. And that's because similar to what happened with NumPy these operations are all immutable. They are not changing the underlying data frame. We are storing. We are storing we're creating new data frames that store the result of the given operation. So in this case, you have to drop Canada. The result is that the these new data frame but the underlying That iframe is not changed. That's because again, they are immutable operations. 99.9 operations in pandas, it are immutable, there are ways to change it, there are ways to make the changes permanent. But for now, I want you just to think that everything is immutable. Whenever you want to perform an operation, it's going to create a new series. If you want to keep track of this, you will just need to do something like df two equals that, or even df equals, you know, just to modify the current data frame. Again, there will be a way to not do that. But we're going to save in a sec. So modifying series more explicitly, that affrontare modifying data frame more explicitly, how can you create a new column? Well, very simple. Assign a column, I said, let's say in this, this column right here, it says similar to say, here, language. Oh, it's just read only. But if I say language equals, and I can just write whatever I want. In this case, what we've done is that the language, let me show you what Lynx had, in this case, was a tiny series, it didn't have elements for all the indices in the data frames, but that doesn't matter. pandas will match all the indices of our chill exist. And it will live like the rest. This na n is what we use for a blank. It's another number from NumPy. We're going to talk more about it when we start doing cleaning data. Data cleaning, sorry. So again, links France, Germany, Italy, you can see the volleys are all up there. What happens if you want to change a value the language series already exist, you want to change it or column or read exist, you want to change it. So in this case, we're going to say df language equals English. So we're going to change it all together, df now will be affected, and all the values of language will be English. How can you relate How can you realize when there is an operation that is changing the underlying data from the underlying series or than the line NumPy array, it's usually when you have an equal symbol, remember, NumPy, we saw something plus equals, in this case, whenever you have a plus and equals symbol is you're modifying the underlying data frame. So for example, check this out, the Rename function or method of a data frame will let you pass columns and indices to rename. So in this case, we want to change the United States to USA, the EU, United Kingdom to UK and Argentina to AR, Argentina doesn't exist in this data frame. But that doesn't cause a problem. And that's why we want to show you, the US, UK were modified correctly, and HDI was modified correctly. And a PC which doesn't exist, didn't cause any problems. Now, why am I showing you this because remember, these operations are immutable. If I check what's the state of the data frame, we see that the original data frame has not been changed HDI a steel HDI, it doesn't matter if we renamed it before, it's still the same data from the same thing for days, indices, all these operations are immutable. A few more examples of modifying data just for you to look at. And something that is very common for us is creating columns that are combinations of other columns. So again, this is read only, but you can you can imagine, that I could do is hear something like for example, GDP per capita, right? If I go here, and I do GDP per capita, GDP, p per capita, per capita, and here I say is equals to the GDP, this column divided by this column, right? So I do something like B, B three, actually, C three, C three, divided by b three, right. And then we would extend the values all the way along here. In pen this, we could do something very similar. We can do just any column, we can just perform operations, broadcasting operations between them, in this case, GDP by population. And we can assign that series which is a result right there. So it's a series we are going to assign that series to a new column. So GDP per capita Now, there you go is now a column of our data for. Again, all these broadcasting operations are extremely fast, they are backed by their NumPy array, and they result in a series. So very quick statistical information, a few methods, right to do summary statistics. We saw them with this crime method. But minimum maximums mean, median, all that works as expected. Something that I want you to note here, if possible, is that with pandas, we have, I'm going to change colors here, we're going to use red. With pandas, you have this concept of a data frame, right data frame that has multiple columns, multiple rows. And these operations are resulting operations are resulting in just one series. So in pandas, you have your data frame, and you have your series. And we could say we have individual numbers. And it's like always, the data frame is always resorting back to this, it's like some operations will just return a series. And the series can be used in a data frame, right. So in this case, these resulted in a series, but then we merely use the series to set the value of a column. Right. So that's why understanding series is so important. So there are a few more assignment exercises for you here. So you can check them out and complete them if it's going to make a little bit more sense once you're working with it. Finally, I want to give you a very quick introduction to reading the external data on plotting. And to do that, we're going to use a few methods that are very popular in there, maybe we can look them up very quickly here, we can say read CSV, use the read CSV function from pandas. So these function, read CSV. And as we have read CSV, we actually have a few others read sequel, read Excel, read XML, there are multiple adjacent or multiple ones, read HTML will be able to automatically parse an HTML page and read it. So a few functions like these like, what we're going to do with these read CSV, right here is the structure of it. A few of these functions will let us import data from an external source into our pain this workflow. So in this case, what we're going to read is these BTC market prize volumes, so it's right here, if I open the CSV, this is what it looks like. It's the date of the price taken a read and devalue the bread, the timestamp, and the value the timestamp of the value no decide the price of bitcoin 2017. Now it's close to $9,000, I think. But just note inside, but again, this is a CSV, and this is a CSV that we're going to be writing. To do that, again, we're going to use this method read CSV, the method will automatically parse the CSV, as expected. And there you go. And the process now will be for us to start tuning it to get to the right point. So I'm going to show you a few customization SP customizations, we can do with the receipt, read CSV function. So the first one, and sorry, let me tell you first, we have a ton of attributes here. So we have a ton of customization to do with read CSV, you will not remember all this, you will not remember everything out of the top of your head. So don't worry, you can always go back again to the documentation and just practice, it's going to come naturally. So the first thing, the first row of the CSV was considered to be the column names. So in this case, this fine lesson have a column name, let's say I add it, I'm going to do timestamp, timestamp price, you're going to save it, I'm going to rearrange the file and re re read it. There you go. So by default, pandas is assuming that the first line of the CSV is the rd columns. I'm going to go back into what it was. Right, and I'm gonna show you again, that's the assumption that pandas is doing. We're gonna Of course, of course, change that assumption, because in this case, our CSV file does not have column names. So we're going to just say Heather equals none. And this is when we start seeing the attributes that we're going to use from the read CSV function, read CSV. When I do hether equals none for us going to be known. That means don't infer don't read a header. Don't try to infer a header, a header from the CSV file. And the columns are zero and one. So now I'm going to change the columns. And I say, actually to be time something prize. And now what I'm going to do is show you the first rows. So you're saying here that I have these df dot head method that I'm doing. That's because this is a significantly large file. So we're going to say not not that long, but at least it doesn't fit in my screen. What's the shape of the day CSV or the data frame? It has 365 rows, and we have two columns. So we can do df the info, for example, to have a little bit more reference about we have 365 values, there are no no values, and price is actually float, that Tamsin is an object and we're going to fix that in a second. I'm sorry, that the F that head on the F dot tail, are the methods we used to get either the first and files or the end row sorry, or the last n rows, which are five rows, by default, you can change that and say, Show me the last three rows, for example, that's something you can do. And again, the types so the types is the timestamp in this case, the timestamp column was not properly parsed as a date, he was parsed as an object as a string, which we don't want. So we're going to use the function PD dot today time, something we're gonna explore in more detail in the reading in the cleaning data cleaning course. Part sorry, if it weren't tutorial, we're gonna use it today time function to turn these column D f, the timestamp into an actual date. And now we're going to say, the F that timestamp equals to this function resulting, and now everything looks as expected, there is one more change that we want to do, we want to set the index of the data frame to be the timestamp, because by doing so, we can quickly access price information led me see what was the price of bitcoin in 2000 1709 29. And I make a mistake here, I forgot to do the LLC. There you go. So we have the value of Bitcoin. On these particular date, forgot, look, remember that to get value from a particular row, you have to do dot lock. There we go. So we are getting Dodd's particular value. Because we've made a timestamp the index, we get artists value directly from the index. So what happens if you want to turn this thing into an automated script, for example, when I run this process, every day at 5am, whatever we can, we want to read the CSV, strip the columns, rename them turn into timestamps, etc. This is what we've done so far. Read the CSV without a header, create the columns, turn it into a daytime timestamp into a daytime and assign it to the index. And that's the result again, well, actually, the read CSV, oh, sorry, the read CSV method is so powerful that it will let us do all these actions in just one call of the read CSV method, we there are parameters that will let you customize the behavior to achieve the same results that we did with four lines of code right here. So in this case, we're gonna say, read this CSV, don't assign a header, that's something we do already or don't don't infer our header from the first line. These are the column names. So we don't need an extra line, we can just say these are the columns names. Oh, and by the way, the first column is going to be the index of the data frame. Oh, and also part of the date. They've the index, it's a date, so part of the date, and we have the same result as before. So now I'm going to pro try and same thing. There we go. So you can see it's work. So very quickly pan this plotting. Alright, so we're going to be doing here is I want to show you very quickly, I don't know what's this thing is as a vertical scrolling. I want to show you very quickly that you can create plots with Hannah's interest a breeze. It's so simple to create a block. So in this case, what we're going to be doing is, given a data frame, you can always invoke the plot method. And the plot method, what it's doing, it's using the map plot live library, something that you can check if you want in the docs. But for now, it's not necessary with these, we're going to be more than enough. What it's doing is just using, again, the regular plug library, as you can see dimopoulos Library, which is part of the standard PI Data stack. And again, for us to access using pandas is extremely simple, just df dot plot, you're done, you can set the plot as you want, we're gonna see more details of matplotlib. So don't worry too much about that later. So there is a more challenging example here that I can just run very quickly, you can inspect the process we follow to fix the data. But this is what we have, there we go. And what you can see right here is the difference between the Bitcoin and ether in this period of time right here, and they are both loaded in the same chart. And that's because this is the resulting data frame, we have Bitcoin on one side, and we have ether on the other side on we are plotting it right here, we're creating one plot with all of it. And we are noticing these empty value right here. So what we can do is we can go from December 1 up to January the first these period, so we can select that period, is in that lock. And we can just go ahead and plot it again. And this is what you see right here, the gap that we're seeing. So again, this was the introduction to pindus. We have a real life example of pandas following up. Also we have a little bit of data, more data cleaning on reading all the interesting files and sources of data for in getting more data into the pipeline, right. So the idea is going to be showing you how you can import data from Excel from SQL and then do the actual processing and analysis. Now it's time to talk about data cleaning, we have arrived to that point in our tutorial, in which we have pulled the data, I've shown you how to manipulate it with pandas, the beginning at least the introduction to data manipulation with pandas, and now it's time to properly fix it. For the sake of brevity, we are skipping a few parts of the process of data cleaning, especially you're going to find it in this first notebook that we talked about basics, conceptual, missing data with Python with NumPy. And we're going to miss a few other things. But I'm just going to mention them. pretty generic, pretty general form. And then you can of course dig deeper, you can check our courses if you want to know more about it. Usually when we talk about data cleaning, where it's in from a more conceptual level, we're going to talk about a four step process. The first step is usually finding missing data, which is the simplest problem to identify from a data set when something is missing. So you have car sales data. And there is a car that has no name right? Or there is a card has no price, right? So there is an number missing or there is a category missing and there's a string missing. And of course, each one of those is going to have a different meaning how to solve how to fix data set that is missing data, it can be very simple. If you can just for example, drop the record, if you can fill the value, right. So for example, the prices fill in these missing, you can fill it with the average value of the sales data or something like that. Or it can be very complicated if the volume is important if you can't move forward until you actually find that missing volume. And it can involve something like picking up the phone calling your ETL team asking what's going on that the data is missing. Or even if you're buying the data, you have to call the vendor, ask them why their ID if you've you're paying for that and there is data mentioning etc. So it can be a very political process. It depends what's your use case. But again, from a technical perspective, identifying missing data and fixing it is going to be extremely simple. Once you have fixed the missing values, then you start looking for the data is assuming the data is not clean yet in this process of data cleaning. The second step is when there are invalid values. So you have for example, column that is price and there is a string within it right here. We're expecting only numbers and there are strings in it. So then It's not going to be complicated to identify, it's not going to be too complicated to fix it. But again, we're increasing the complexity until a deeann of these data cleaning process, we're gonna reach problems that have to do with the domain of the day you're looking right. So for example, you have a column that is customer age, and there is a value that is 170. Right? So that is not an invalid value, it's a perfectly valid integer. The problem is that given the domain, right, but speaking about customer age, is highly unlikely that a customer is 170 years old, right? So in that case, the vowel is completely valid, there is no missing data, there is no invalid values, etc, is just about the domain. And this is when things get very complicated, because in this case, that example of age is something that resonates with all of us, we know about age of humans. But if you're working in a domain, if you're working as a data analyst, in a domain that you don't know much about, right, then you might not be able to judge if a value is invalid or not. If I am working in a biology lab, and I have something like white cells count per milliliter of blood, I don't know what's what it's a good value, or what's an invalid value, right. So it's, it's something you need to know the domain. So that's usually the the most complicated part of data cleaning, when you reach the limit of everything is valid, everything checks out. And now I need to make sure that these value is valid for these domain that we're working. So again, this is the spectrum that we're going to be revisiting today. So to get things started, the way pin this works with no values is is it has four functions, which actually there are synonyms, it's going to be it's going to be relatively simple, just trust me on that. There are a few things first, everything that pandas does in the process of missing values, is related to the way NumPy works. So again, we're skipping it, you can go to that notebook, check it out by yourself. But it's extremely simple. NumPy has these objects and n not a number to identify a missing value or no value in Python world to have the non value. But again, in pandas and NumPy, we're going to use na n none on there, or in this case, at the beginning, we have these two functions is no n is na, which are complete synonyms, we're going to find also is no and no we have it isn't a and they're also complete synonyms. So no n na for pan, this is the same. You can use the one you prefer. Sadly, I like is na because it's the way I learned it. I think for my students I usually recommend is no, because it feels more correct. And it feels more self explanatory. So you can use the one you prefer, if you can use is no, I think that's going to be better. If you get used to ease in a then you're going to be on my side, just do whatever you prefer. So again, it's no one's gonna say true or false, depending if the value is no or none, right? And of course not No, it's going to be or not na is going to be the opposite. So not na have not a number is false, and not an A of three is true. If you get to this first notebook, you're going to set all the false e values on the true fi values in detail in terms of Python, anything that is not empty or non etc is going to be considered to be truthy. So anything you pass here that again, is not an empty string or a no is going to be considered a true fi value. So it's no not no or is it a and none an A, they both work also with entire series or entire data frames, right? So it's not just for one of Valley you can pass an entire series. And the result back is going to be if the series is if the series what values in this series are either no or not no, depending what's the question you're asking either is null or not null. So in this case, we say which one is of the series are no, this is not, no this is not No, this is no so this is only true. And the opposite for the following method we are applying are actually function. And again the same thing works with not entire entire data frame. So something we do usually is if you look in to not know unknown, a few hacks that we usually apply are the count on actually this be the sum of all the no values or not no value. So we have this entire series, we can say how many not null values we have. And if we sum those, not no values. In this case, we're going to get a result out which is the entire the entire summary Have the nod no bounds we have asked, and the same thing is gonna happen if we say is no. So if I do here is no, some, we're gonna get how many novels we have? And it's pretty much the opposite of this question is no. And the way it works is in Python bullions are pretty much integers, they're ones and zeros. So every true Val is going to count as one and every four is going to count as zero. So if you ask for the sum of a Boolean series, you're going to get out the result of the number of truths that are available in that series right. So, in this case, we have to know values we ask how many knows value we have is know that some we get two out, you can use these tricks to filter the data with a series. So in this case, we can say give me all the values that are not known. Right? Just not know. Also, something interesting is that both for data frames are for series. The not not no is no isn't a not an A methods also, sorry, functions also work as methods. So in this case, we can say instead of PV dot know, we can say s.is, no load s, that is no. So now, it gets a little bit more, a little bit simpler. But if the final objective of these core as equals alzarri, s selecting only the boundaries are no no, was to drop the null values, then there is a simpler form, which is dropping, okay, so in this case, we can say s dot drop in a, and we're basically invoking the same thing that is happening here, we're missing we're just excluding sorry, all the missing values in the series or the data frame, because this also works for data frames. So what's the one, one important thing to remember here is that all these methods are immutable, we are not actually changing or modifying the original series, the underlying series is not being modified, there is a new series that is returned. So if I invoke s, again, this thing has is not modifying their series, you're creating a new series, and that's the one that hasn't, that doesn't have the missing values. Everything we've said also works for data frames. So right here, with these on a frame, we can say how many, right? The first thing usually is to start with an info method, right? So we have info, and we see that there are in total, four entries, four rows, we can also do a shape, if we need more information about the structure of our data frame. So there are four rows, four entries in our index, column A has only two no no values. So that means there are two values that are actually no no, sorry, no, there is column B that has three nought non null values. So that means that one value must be known, and that's for column B, again, so usually info gets you very close to understand the structure of your data frame and how many values there are missing. The same thing happens with some, we can just do df.is, null isn't a and then some, we're gonna get a quick reference of how many null values we have in that given data frame. Drop in a works in the same way, but there is a significant difference. The way drop in a works in a data frame by default is by dropping any row that has at least one, no value. So these row has no value dropped, these row has no value dropped, these row has two new values dropped, this is the only one that it's not being dropped, right. So it's very harsh in that respect, you can change that to make it to the column only, only keep the column that has no no values, and that's by switching the axis equals to one. And there is also a way to select a subset or thresholds. So only delete rows that have less than three valid values. For example, in that case, you're going to use something like the strategy of the drop in a you're gonna say, drop the columns, the rows, sorry, are the columns because it is also works for columns that have all the values and no, or drop. The This is the default behavior, drop all the rows that have any value in an NA or specify a threshold, which you mean by basically saying, I need this amount of valid values in order to keep the rope it's the way it works. Now, which ones to drop is which wants to keep based on the fresco. So once you have identified it No values, it's extremely simple to clean them to sorry, fix them. So the first method we're going to see is fill in a within a particular value, we're going to say from this series, I want you to fill the blanks or fill the missing values with or fill the anaise. fill them with numbers zero in this case. So these two are numbers zero, or, of course, you can use any statistical method you want. In this case, we can use the main. Remember, this is not altering the series, the original series is still the same, we're not changing it, it's creating new series because all these methods are immutable. The following method is or this the following way This method works is by passing a method which is for field or backward fields, these are the possibilities. And basically the way it works is it's overflowing, all the values top down, at least in Fairfield, right starting here, it's dropping this value here, dropping this volley here. And dropping now three here, as this thing is a nun, it gets replaced. So this thing is three now, which gets throw up here. And now this thing is three again. So that's what we have right there. And of course, backward fields works in the other way, starts with four and moves, it moves it here and then moves here, etc. You have to be careful when using these. Because if you have no no values at the beginning or the end, then you're gonna end up again, with no values because there is nothing to fifth forward, right, this is the first volley you have India. And all we've seen also works for Donna friend. So both boggler fail for field or both in terms of rows for feeling, right, so we have, we have these, these data sets. So we do for field row base is going to be one to here too. And then five. So that's going to be for field x is one, if you use for field x zero, then it's a vertical filling, right? It's going to go here, one 130 30. So that's for the column, that is y here, one 130 30. So it's either for filling in, in, sorry, this direction for failing, or it's going to be in this direction, depending on the axes that you are passing. And actually, let me we're going to put the correct forms with axes equals zero, it's going to be columns, it's going to be visit direction with axes equals one, it's going to be row based. So it's this direction, right? So we had a no volley here, that got fail in this way. Okay, moving forward, we what else we have, we have here, checking for values. And we've pretty much seen this already, you can use the is know, the sum method to get how many values you have. And there is also any an old, which will give you very quick. These are usually called Boolean tests, you can say ask if there are any values are valid, or all the values are valid is just to build more complicated queries. So so far, so good. So the process we said was at the beginning, we were fixing missing data, missing values, there is nothing in there. We have read a data frame, where's our data frame right here? We have read our data frame from CSV from a database, and the value is missing. No, there is a hole in it. So we have quickly identified it with isn't a or is no, we were able to drop the ones we didn't want to keep dropping a or we were able to fill the volume we wanted to fill fill a name that was simple isn't a drop in a fill in a what happens when you're cleaning data that actually has a value, so there is no nothing missing. But those warnings are invalid. So for example, here, the sex column is a categorical column that only accepts an on f. d on question mark, those are invalid they are, it's very simple to see an invalid value here because it's completely out of the scope. The same thing as we have, for example, question mark in the age column where we have we have a string in the age column, it's very simple to identify that, how we're going to clean those. Let's start with sex first, because it's simpler in this case, the first check we can do is with either unique or with volley counts, I'm going to use value counts. We've seen this method before. It's a quick summary of all the unique values you have. And in this case, volley counts also gives you a total count for those values. How can you fix them? Well, there is a replace method which is extremely intuitive. You can just replace in this case, we're changing all of these two F's and The End two M's, and it can work in multiple columns. For those volleys, that again, we said were more complicated to fix, like, in this case, we know age, in this case, is 290. And we know because we know the domain, that 290 as an invalid age for a human. So we will need usually in those cases, we're going to need more complicated fixing, and it will involve more programming, that's the reality, you have to be better coding. In this case, we know that these volley is invalid, because it's probably an extra zero. So all these values, you're pulling a CSV with ages, and there are a total of 180 290 32 320, for example, invalid values out of 100, right in the 100 places. And that's because there were typos when they were creating the ages. So how are you going to fix that? Well, in this case, it involves a little bit more programming, we're dividing everything by 10. So also, something that may be useful is dealing with duplicates. And we need to first define what's going to be a duplicate value. So this is, this is usually a little bit more political, if you want, you have to define what's going to be a duplicate. In this case, we have a series that contains ambassadors, and each, their master is the index, the country of the ambassador is going to be the value, right? This is usually the important part. The rating here says the word conducting a party, and we want to invite one Ambassador per country, we don't want to repeat ambassadors, ambassadors. So in this case, what's going to happen is that these two in our humanize at least, we can click clearly and quickly see that these two belong to the same country. And these three belong to the same country. But here again, we have to define which ones are the duplicate, if you want, and which ones are not duplicate. So for example, maybe we can say the first one is duplicate, or we can say the last one is duplicate. So this is the first one not duplicate, or actually can say this, the last one is one, and when I bite, it's not to duplicate. So we're going to have political rules if you want for each one of those. So let's see the duplicated method and the way it works by default. By default, duplicated method is going to return true for duplicate for all the it's it, I'm going to invert it, it's going to not treat it as a duplicate as the first instance that it says. So the method is actually walking top down right now saying, Do I have friends? No, I don't have friends. I'm going to keep it here. Because it's the first time I see friends. Do I have the UK? No, I don't have the UK, it's just gonna keep it here. Then it sees the UK again realizes the UK is already there, too. It's already present. So this one is going to be considered a duplicate. Italy is here, it's fine. The first occurrence of Germany, it's fine wrightstown, Germany, but then it says Germany two more times. And it realizes that Germany was there. So those are now duplicates, right. So the way it works by default, we can change that and change it to last to the last element is not considered to be duplicate, and the other two are considered to be duplicate. And the same thing here. Kim, here is the one consider duplicate. So it's either top down or bottom up depending the way the parameter you're passing, it's either keep default or keep last, or you can be a little bit more harsh on say everything duplicate it is actually to be needs to be considered duplicate. So these two are duplicates, and these three are all duplicates, as you can see, right there. Similar to the duplicated method, which pretty much tells you which values are duplicated, it's it helps you identify them, you also have the drop duplicates, and in this case, what this method is going to do is basically the same thing as before, but dropping all the values are checked for true, right if the method is if the value is missing, it's gonna just drop it. And the same rules apply default, last and false. For subsets in this case, we have Ace, we have multiple, we have multiple players in the data frame. But what happens is that these player Colby is present three times for humanize we see Kobe three times. What is going to happen here is that the The way we're going to think about duplicates is by understanding the correct subset that we should check. In this case, Coby plain as sn SG is duplicated two times but COBie, playing us in SF could be considered a different player if you want, because maybe it's a different season, or it's a different, a different position they played. So in that case, we need to pass What's this subset that we are going to consider duplicate, only check for the column name, or check for the column name on or not check for the column name, which is the default is going to check the entire data frame. And when that happens, then these two are considered to be duplicate. So these one is a duplicate with this rule, if we put keep last, sorry, keep false, both are going to be considered duplicate. So this second occurrence is the duplicate one. And the last one is a completely different row, because the the value in position is different. That's the way it works here. Moving forward with more cleaning of values, we're going to talk about string handling. And this is a very neat feature of panelists, that special types of columns will have special attributes. So given the column type, so df info, which is an object, which is a string, right, in pandas, that all the strings columns are going to have these special attribute which is str, all the daytime columns, something we're not going to cover, but you need to know, all the daytime columns have a.dt, Math attribute, all the categorical columns don't have a.ca t cat attributes. And those attributes, str DT cart, they have a special methods associated the domain of that column. All the methods associated with string are of course, we're string handling, or the methods associated with DT r for data handling. So in this case, we're going to review all not all very good subset of the string methods we can apply. And something interesting is that all these methods have a very good have a lot of relevance. And they're related to the ones in pure Python. So if you have a pure Python string, there's a split method. There is a contains method or I don't know if there was a contain actual, it's actually, I think it's the in operator, but there is a strip, and there is a replace, right, so most of the methods under the str attribute in pandas have, have an analogy in the standard library of string handling with Python. So starting at the beginning, this data we have, I'm going to delete this this data we have, what we are going to do is split the values right by an underscore. So in this case, that's what we have, we have split all the volleys with that underscore, and we're going to use the special attribute is expand, expand sorry, equals true. And what it's going to do, it's going to create a data frame out of that. So we create a data frame with 70 columns. And this is what we have now. So we can keep applying methods. So for example, contains or content contains, regular or contains with regular expressions rights for you to see the power of it, we can just strip replace, and we can do even regular expressions with replacing so we could fix something like this question mark in a string, we could fix it with regular expressions if you know how to handle them. And finally, something that is going to be very helpful when you're doing data cleaning, is looking at the data from a visualization perspective. data cleaning has a ton to do with statistical understanding of your data to when a volume is considered an outlier. For example, it might be invalid, and you want to claim it. So but that's a lot more about statistics. And this case, I want to show you very quickly, the mottled leave library, I've been promising for some some time now, the mapa lib library. So far, we've accessed it directly from pandas, from pandas, or we're doing a data frame dot plot. It's these library mapper lib is the one backing all those methods and we're going to see how to use it directly. Now. The model live library has two important API's we're gonna call him one is the one that I don't prefer, which is the global API, but it's the most common one. It's the one you're gonna find around the global API. And the second one is the object oriented API. So it's around here. And usually there are there are ways it's just two different ways of doing the same thing. Okay. The global API is an API that it's in part inspired in MATLAB. It's been around for a long time on sadly Most of the answers you find in Stack Overflow tutorials and books will be using these global API. The way the word the one I prefer the most. And I'm gonna explain you why in a second. It's going to be the object oriented API. But I want to show you both. So you have a reference. If you follow me in this feeling of preferring the object oriented API, you will always have to translate global to Opie. Why is it considered a global API? Well, we have imported matplotlib.pi plot as PLT. So we haven't imported the whole module, the whole Python module, depending how much you know about Python programming is going to make sense or not. We have important the whole module. And now what we're doing is we're invoking PLT dot figure. And finally, and then we're going to do a title. And then finally, we're planning two things. We're plotting x, our plotting x squared and minus x squared. And why is this global because we're invoking functions that are at module level. And there is an object, the final plot, that it's being modified by these very generalistic and global courts, right. So by by doing these call right here, I'm modifying the final result of the plot. Let me show you a more complicated example. So you see the problems with the global API. If you look at these line, if you could delete everything, let's actually delete everything. What is this line doing which plot is affecting, you do not know, there is no object oriented way of saying in this second plot the plot on the right or the figure on the right, or actually the sub plot on the right, I want you to plot this thing, you're just saying it to the entire module. And depending the order that you set it, is where it's going to land, that particular figure where it's going to land in which plot, it's going to lend. Again, it's a global API. So we start saying, I'm going to create a figure, trust me from So from now on, I'm going to start drawing on it, there's going to be the title. And hey, by the way, it's going to have one row, it's going to have two columns. And I'm gonna start drawing in the first plot these one right here, these one right here on the left, okay. So now I have kind of activated if you want that plot, it's active. So now I'm going to start drawing on it. So every action that happens after this line is going to be affecting these blocks, these blocks, right. So then I plot x and x square, I plot this vertical line, I put a legend, I set labels, etc. And at some point, I just stop and say, Hey, now I want to switch the plot, I want to now start plotting. Sorry, I want to start plotting here in this second one, because I have just changed that the first line these one. Oh, sorry, the way it works is by saying the first row, second column, but second plot. So now I want to start plotting in here, every successive line will affect that line. And again, you can see that understanding a code, given the order that the order in the sequence of lines is very hard. If you have to debug a report that has a plot that takes 100 lines, then you have to keep in your brain, what's happening top down, a different approach is going to be the object oriented approach, in which we're creating a figure. And we're creating axes. So in this case, we have in this case, we have right here, one entire figure in red. And we have in here, purple, we have two axes. So these axes one, and this is access to so we have two axes. We're going to create those using an object oriented approach. And we're going to keep references to them. So we're going to say later, to these blocks to these artists, sorry, I want to plot something. And that will be very explicit, it's going to be an object oriented way. So the first thing is creating the figure on DCE. The axis in this case, we have just one axis, that's it, but you can have more and then you say in this axis, I want to plug this thing in this axis, I want to pull up that thing, etc. When you have multiple axes, so I could show you. I'm going to go back again to that in a second. But In this case in which we have four axes, right, so we create one figure. And it has four axes, we do it with this subplots, method saying and rows and columns. Now we say to the axes number one, I want to put this thing to axis number two, I don't want to put that thing, right. So it's 1234. And now it's a lot more explicit, it's not depending on the order, I could change this order, that doesn't matter. They're that the results are gonna be the same oxes number four has yellow, regardless of the position that we're following. So the map will live. And now that we have clear out the differences in both API's, maple leaf has this very simple plot function, or method, depending on sugar enter global, that we'll plot something you specify. In this case, we're passing all the values in x and all the values in y. And in this case, we're passing a given line style, this can change with these type of syntax, you're saying, I'm plotting this thing in X, I'm blowing this thing in y second parameter and why. And I want you to use a straight line, it's a straight line, yes, with this marker, the dot and in green. So this is if you are very familiar with it. If you're very familiar with my bullet you can use to send links in other games, you can just say line style market marker, sorry, color specific keyword arguments for each one of those. So do we only have line plots in APA live? No, of course not. We have a huge variety of plots. And by the way, there is another one here, if you want to see more events are grids, you can create these grids and put different things in it. And again, not only land plots, one good example is a nice scatterplot. So basically, we're plotting X and Y correlation. And there is also our value, our color map, right. So given the volume, there is going to be a change in color. So these kind of lets you plot three to four dimensions of your data, the volume x, the volume, y, the size of the bubble, and the color of the bubble. So where you're pretty much encoding four dimensions in just one figure, right. So in this case, we're just using two different scatter plots, there's more information here, we can also block histograms, that we've very quickly seen that with pandas with pandas is, is very simple with just plot type histogram, current histogram hist, actually, you can look it up in our previous lessons. So just go back into the index in the video. And the histogram is extremely simple just takes the valleys you're plotting and how many bends you want, or some more advanced arguments here, like the alpha level, etc. But it's simple. And similar to the histogram, you can also create kernel density estimator diagrams, which is very similar to distance to simulate if you want a continuous distribution. You can combine these plots if you want, in this case, we are creating the plots were plotting a histogram. And they were plotting the lines and they were plotting our changing limits. But that's pretty much it. And you can also create bar plots, right? So in this case, we have PLT dot bar, or here we have two bars are stacked, right? That's the different way to look at it. And finally, check in outliers. You can always plot histograms or box plots, right? So box plots are also a nice feature to have in here. So this was all with data cleaning, we're gonna keep moving forward this tutorial, I want to mention one more thing here. And it's there are notes here for kind of a task that you can follow with data cleaning, which where we are identifying where indentifying missing values in given positions with is known as an A. And right here, we're looking into more detail about some statistical properties of the data, in case we need to clean it. Okay, so this is little bit more events. And it's it's related to the concept of cleaning data given the domain. So the statistical analysis can tell you that this value is an outlier. For this distribution, the value might be valid. So for example, a human being is 90 years old. That's, that's valid, that's a valid age. But if you're analyzing data about high school students, and a human that it's not a year soul, it's going to be completely invalid or it's going to be an outlier in that distribution. And you can treat it as such You valid valid and clean it out, remove it, for example. So that's, that's deal a little bit more with the whole statistical analysis you can follow here, it's a little bit more advanced for the scenario. So let's move forward with the rest of the videos. Now it's time to get into more advanced features of pandas to import external data. So we've seen already in our real life example, the way we can import data from CSV files, and from SQL databases, right, we had actually those two lessons, the objective of these part of the tutorial is to show you how you can improve or get into more advanced use cases of importing data. So we're going to start for example, with csvs, and text files. And again, you've seen it already. But here, we're gonna give it an extra twist. So I'm going to show you more advanced features. And for special use cases, txt files, CSV files, is, conceptually speaking, a CSV file is a text file, it's just human readable text, right? That it's encoding information. The idea for CSV file is that it's tabular. Right? So it's a plain text file that contains tabular data in it, and it's separated. csv stands for comma separated, but it can be separated can be anything, we can see more examples later. But basically, the idea is that it's a text file that it's tabular into in a tabular format. So though, both CSV files and text files will be read with the same method. So to get things started, I want to show you the basic way we import will read data from, from from external sources using Python without even starting yet, with pandas. So you don't need to know this, it's usually it's usually productive if you want for data scientists or data analysts to understand a little bit more how fire reading and writing works in computers, because there are multiple, multiple concepts align, here, they evolved, operating systems processes your language, right, it's not same thing to read a file with our or with Python or with another language. So there are multiple concepts here. And even though pandas in this case can make it simple, very simple to read and write data, you can get a little bit of a more advanced use case, if you know the internals of again, both the operating system processes on your language. So this the way we read data with a reader file, sorry, using pure Python, we use a function open. And in this case, we're using a context manager, just a security feature, again, related to to the advanced usage of reading and writing files. But it creates a file pointer, right. And with a file pointer, you can then use the very simple API x point post. But they but that pointer, which is something like red line, red lines, read a number of bytes or characters, or you can just even trade FP as an iterator, just do a four line in FP. But basically, we're going to do something like this, we'll start reading data from top to bottom, just a month to, I don't know, we hit I've given in this case, we're doing it just for a couple of lines. What else we can, it gets very difficult when you're reading text files to process them, because it's usually hard to parse the structure of the file. So it's not the same thing to have a funnel that is separated by comma separated by colons separated by pipes, spaces, etc. So you're gonna see that once you want to get a little bit more, I don't know a little bit more with an advanced usage, right, or a little bit more fancy your calculations and and the way you parse the data, it's gonna, it's gonna get harder. So that's why we're going to use pandas, or I'm going to show you in a second, this is the module that is part of, of Python. So this is the file that we're going to be reading. It's the XM review file, and I'm going to open it. And even though it doesn't look like a CSV, it isn't either CSV. The difference is that here the separator is the greater sign, it's not the comma, it's a greater sign. That's going to be what marks the elimination between different fields in our CSV file. So we're gonna use the CSV module. And the way right here to parse the data using that module is by passing a special delegator, right? So that's gonna be the type of work you might need to do when you're parsing the data. It's not the same thing to have that limiter dates a greater sign. It's not the same thing to have numbers for example, that are enclosed in quotes. All those things right will change the way you work on all days is going to be abstracted away by the pandas module. So to get things started, again, with pandas, at least, pandas has multiple read underscore something methods that will work for different sources, right. So we saw already have read sequel we've seen read CSV, there's also a read HTML to directly parse information from a table, it's literally you can just you pass a website's going to read information from a table, or read Jason read more advanced formats like pocket, or Stata, etc. And, again, each file format will usually have a correspondence in pandas, it's, I've never had the chance to rewrite my own stuff. To be honest, the same thing is going to happen for something like Excel, which might need external modules, it's not directly provided by pandas, but by installing those modules, you can easily incorporate Excel files in your day to day work. So the read CSV file methods already has a ton of parameters. So this day, the main characteristic of all these rate something methods, given the amount of possibilities you're going to have with these files, there exist a ton of different ways to customize the method invocation. Alright, so again, CSV files, we saw, there are multiple things happen. csv is a passage that have a header don't have a header, different delimiters different and closing of strings or numbers, multiple things, blank lines, etc, multiple things are going to happen. And that's all you're able to customize all that with the read CSV method. So this is the reference of all the attributes you can pass to it, usually something that I do, and I do this very often, and I use pandas a lot, and I still do something like read CSV, and I get the documentation right here, to look into the, the parameters that I think I need to pass to my particular use case. So keep an eye always in the docs, because it's impossible to remember all the parameters in the CSV. So in this case, what we're gonna do is something very interesting is we're gonna parse a CSV file, but it's not located in this computer, it's not locally available in the computer. The CSV file is these one right here, which actually is the source, if I get the raw version is this thing. So this is CSV file, what I could do here is download the file, right, so just do File, Save, get the CSV file on my computer uploaded here, right, so just copy and paste here, drag and drop it here. But actually pain this has this nice characteristic that it will read a CSV that it's either locally as we did with BTC market price, or you can also do it remotely, it's automatically going to download the content of those files. And it's going to provide, it's going to save it in memory for further usage. So there's a very neat feature. And again, this is the the CSV file that we are using. And again, the same thing, if it's a local file, it works in the same way. So a few features you've seen already, in this case, we can do Heather known, if you don't want to treat the first row as a header. Or what about missing values, we can treat some of these values like a question mark, or like an exclamation mark, or dash etc. us not a number, not a value, right, so it's a missing value. And now any of these values we have passed, will be transformed into another number for easier and easier process cleaning, we can pass names, which is going to be basically the column names for each one. And we can also specify column types, as you can see, right there. So now the types are going to be float. And object. We've done this already in one of our lessons, we are parsing the time and there you go. So putting all together, we get to these advanced forms of reading csvs where we're passing column names were passing types, were asking to read dates, were passing no values, Heather's etc. So this is a pretty common thing we are doing. So what about XM review, if we try parsing this thing, we get this very ugly format. In this case, they put the parameter to specify the what we used to call delimiter in CSV is now set from separator so the separator, it's going to be the greatest sign and that just works as it needs. So, a few more examples you can check on here the most important part is following right, the documentation to find those particular use cases that you are having so for example, some Like skip blank lines, or whenever there are like empty rows at the beginning, right. So if you have empty rows at the beginning is something you can also say skip rows. So you don't need to parse those out, it's not going to break, etc. So that is all part of the read CSV file. And to finalize these part, at least csvs, I'm going to tell you something that applies to pretty much every other data format. As you have a read something method, there's going to be a to something method, it's basically the process of writing. So you can do read CSV, or you can do to CSV. So these CSV that we imported from the external source and the remote source, we can just do to CSV and it's going to store it locally. Alright, and there are multiple options also to pass the CSV delimiter, or actually the separator, if you want to include a header if you want include an index, etc. They're pretty much the same as the other one. But the idea is that for every read something method, there's gonna exist a to something method that it's basically the process of writing. So let's move forward with a few more data formats. And interesting, we're gonna get to read directly HTML pages in just a couple of minutes. And now it's time to read data from databases. We have already done that in our real example with Panis part of the tutorial. But I want to show you a little bit more details details for you understand how data is being processed in case, this is a common scenario for me importing data from databases. So the libraries you will need first thing, depending on what database engine, you're using Postgres, MySQL, Oracle, etc, you will need to install different libraries. But the API's, once you have installed, those libraries are going to be the same. There's actually p Ep from Python that actually defines the interface for databases, libraries, unpin, this can work with pretty much any any database of these SQL common database that comply with that interface. In this example, we're going to use SQL lite because the database right here, there's nothing, no server to connect, etc, is extremely simple to get started. And the example we're going to use, or the danavas example we're going to use is actually different one from our previous video is reading in the previous one, we were using circular, in this case, we're going to be using chinuch, which is smaller both in structure and in size. So it's going to be a little bit simpler. So to get things going here, the same thing that we did with our previous part, that was how to read data from files, I show you how to actually read data using Python. So forget about pandas for a second, I told you, if we go back again, to the beginning of time, there was no pain this, this was the way we were writing, finance, open FP, FP, the red lines, etc. So I now want to show you what predates to pin this, what was the default way to read data before paying this, which is with the regular again, interface from Python. So the way it works is we're gonna import SQL lite three, we're gonna create a connection. And now with this connection, we have these common interface that again, it's common for pretty much any other database that you're used to. And the default behavior is we're going to create a cursor. And we're going to execute queries using that cursor. In this case, we're going to execute a regular Select star from employees limit, Fox will want to have five, five records out of the table employees, once you have executed a query, it's like they're waiting, you can do a fetch all to get all the results of that query. And here are all these results. As you are noticing this is the result is a list of tables. So it's not extremely useful. Now, if you combine it with pain, this you can just create a data frame out of that info. And we're close. It's not perfect, but we're close. So let me show you now before we were gonna close it Kurt Dickerson on the connection. Let me show you now how we work with pandas. With pandas we have as we have a read CSV method, we also have a read, see as read SQL method, and in this case, what this method is going to receive is the first parameter is going to be the query that we're passing and the second parameter is going to be the connection. That's the object the connection object to actually issue the connection by panelists. So it gets a simple as writing the query. And now everything has been imported into a data frame, including column names and all that if you want to get a little bit fancier, you can either specify the index column, there's going to be use, of course as a index, and also what types to parse for a specific column. So now we have pretty much all the work down. So we're going from something very manual as processing things with a coarser etc, which might also be as low to using pain this to do Actually imported data from the database. There is actually a caveat here that I'm going to tell you is kind of a very deep detail of the way pandas works, and is that the read SQL method is actually a shell for two other methods, read SQL query and read SQL table. Alright, so right SQL table on read SQL query, when you're using read SQL, it's actually kind of forward in the work to either query or table, or an SQL query is the default behavior, what we've done so far, so in this case, it's just going to issue a query and the connection is going to read it for you. In contrast rate SQL table is can I read an entire table, you just pass a name, and it's going to automatically give you all the information for it. So in this case, all the column names, etc. So it's a lot simpler to read an entire table, the only thing to keep in mind is that to use this method, you need to install these libraries, SQL alchemy, and the connections generated from it. So in this case, we create an engine on we create a connection objects, and now we can pass an actual auction object sorry for pandas to do it. So again, it's pretty much the same, if you find yourself doing Red Star from this table, Red Star from that table, it's a lot easier just to write SQL table, and that's going to do it just advance. As we saw that read CSV files hard to CSV, sorry, read CSV method had a to CSV method, the same thing happens with read SQL, there is a read SQL and the results are to SQL, what's what it's going to let you do is get the from the database and write it down into a database table directly. So it's going to also receive the connection, right? So to SQL, it's gonna receive what he will name of these data frame, what table name is going to be, and a connection object. Now something to keep in mind is that to SQL has an important parameter, which is what happens if the table already exists, that in the default way, it's going to fail, just going to throw an error when you are trying to save data to a table. And this makes sense, because as data analysts were usually reading data and processing it, we're not so much writing it. So we want to meet make sure that it's not by mistake. But if you do actually want to write data, you can just change this parameter if exists something like replace or append. Usually, we're writing to intermediate intermediary table tables, again, you can choose either to replace the whole concept of the table, be careful here, or to append, write, just write it a dn of the current table. So that's just for to see. So this was the way to read data from databases, of course, we're not touching on anything like SQL and all that, that it's a lot more advanced, it's just for you, if you already know SQL, if you're already working with databases, you can pretty much copy and paste what we're doing here. And you're gonna, you're gonna get your data import into Python. So let's move forward to read some HTML files. And now very quickly, I'm going to show you how to read tables or data frames directly from HTML web pages. To be honest, this is a simple method is going to be just read HTML, but it depends a lot on the structure of the web page. So if it's not well structured, or the tables are not correctly created, you're going to have issues and you will have to do a ton of data cleaning. In my experience, whenever I try to parse a table from a well structure site like Wikipedia, or some stats site, it usually works very well. And it's a very quick way of hacking. You know, whenever you have questions, you know, like, I don't know, I need to know the GDP of countries. Instead of looking for a GDP data set, you can just go to Wikipedia page, there is usually a table there, you can directly parse it and you are done. So again, it's it's a relatively simple way to get some data for quick hacking and exploration. The way it's going to work is we have these HTML creative. It's just for testing purposes. To get started, usually, of course, you will try to read something from a live website. So you're going to pass the URL to the read HTML method. And the read HTML method will download the content of the page and parse it. Let's suppose we have the the content already the HTML, and this is what it looks like. This is a exactly the same HTML we have on top, I'm just displaying it here in a book. And what we're gonna do is we're gonna invoke the method, read the HTML. And the read HTML method is going to parse the entire HTML and look for multiple tables, not just one site will potentially have multiple tables, even if you don't see them. The is a common way to structure things in HTML to use tables. That's why it's going to pause multiple tables. In this case, we stored them all in a DFS, multiple player like multiple data frames. And we see that there is only one. So in that case, we're just going to get the first data frame. And it has correctly parsed what we had before just working in the same way. The same is going to happen with for example, things for headings and all that if the table doesn't have a header, it's gonna automatically right understand.in that case. So that's pretty much as we know it already. In this case, what you're going to see is what I told you before about data cleaning process that these table does not have a header like the previous one that has a T head to head attribute, in this case, a header is just another row. So that's why read HTML is going to have issues and you have to provide a little bit extra information. So let's see another more realistic example. And we're going to parse data directly from a website, let me tell you here, just just for educational purposes, you always need to understand if you have if the data is public, so you can actually parse it. Again, for Wikipedia, at least what I do, the content is created comments, so you can get a hand on it. There. What we want to show you here is that a very complicated table that has multiple headers, etc. So that's why we're using this example. So we're gonna get the URL, and we're gonna directly do NBA tables. Equals read HTML, the only table in this page is this one, the large one. So that works. And now we're gonna get NBA is going to be that and we see that the all the players in this case have been parsed. What about something else, let's actually open this page right here to Wikipedia, for the Simpsons. And here, we will probably find several tables. See, we have one right here, this one. So I'm going to import it. We have 27 tables, again, you don't see it. You don't see them, sorry, but they are there. And the most important one is the one we care is these one right here. So the problem you're gonna have with this table is that each using both columns, pans and rows pans. So in this case, this column here is pans for one to three columns. And these row here stands for 123, at least three rows. So that column spans results in these very ugly data frame, and you will need a little bit of extra cleaning. So that probably you're going to find with HTML tables that usually there are things that are not well formatted for machines that are formatted for humans. So for example, in this case, we have this header repeated, when you parse this data, you're going to find that every 20 rows, there is going to be header row, and you will have to clinic every for in this case, to enter rows, you will need to drop it you will do something like df the drop, let's see, actually, if we can see it haven't tried this, but let's just do it like that head, and you're going to find 25 Records now. So here, record 22, we find that, Heather, so what we're going to do is you will need to do something like df the drop df dot drop, range 22 starting in 22, up to the F the shape, zero, right, these many rows plus one plus one and every 20 rows, I don't know this is going to work, just run it. Hope didn't it didn't even work. It didn't compile. Oh, this is NBA actually. There you go. So maybe it works, you can check it. But what I'm going to say is, again, there is some cleaning to do because HTML pages are optimized for humans, not for machines. So usually, it's going to take a little bit more time. The good news is that there is usually a service associated that you can consult. So for example, there is a Wikipedia API that you can use instead of a page. But again, sometimes just easier to pull the data directly from Wikipedia. So that's it. You can also write data to CSV or of course or HTML. That's pretty much the standard. As we've said, this is up all we had for the read data portion. And we're gonna move forward now with a few other methods, especially what we call data wrangling. We're going to do a little bit of grouping and keep moving forward with our tutorial. We have decided kind of last minute to our final source of external data that it's going to be an Excel file. It's just a common Excel files, you know it, because we imagine that you might come from an Excel backgrounds, you can just export the data you have in your Excel files, Excel spreadsheets, and load them into Jupyter Notebook and start working with them with him this so you can try things out and kind of draw the pearls in between Excel and what you do with pandas and Python. So the first thing is, an Excel file is not a text file. So if you try getting the content of it, it's not a text file, it's not so simple to parse it. So that's why it's gonna require external tools that they already installed in notebooks AI, there might be a student's holding goal up, but it depends on your computer, how you're going to install it. So just keep in mind that there might be issues when importing data from Excel, if they if there is low compatibility between the library you're using another spreadsheet version you're using. But without those without getting into those details, there is read Excel method, which pretty much takes care of everything for you has different parameters, like the finding the the sheet that you're reading from, of course, the path, etc. So we're going to start reading these file, which is products file that has three sheets, products, descriptions, and merchants. And it's actually something we use in an Excel file to sorry, in our data analysis, from Excel to pandas course, to show how to merge data and all that. And from this file, what we're gonna do is just read Excel. And what you're gonna see is that it reads the first sheet of the Excel file, I mean, a data frame is just corresponds to one sheet only, right? And the first one is product. So that's what we are writing. There are different behaviors for it, you can change the way you parse, Heather's etc, you notoriety defining and specific index, that's pretty much everything we have seen. So far, it's selecting specific shifts is simple, just pause the sheet name, and you can share the rate story either products, merchants, whatever is available in the current Excel file. There is another format or a new specific class that it's a little bit more advanced. But it's the Excel file class. So it's not, as we were doing here, right, Excel directly is going to read thought Excel file into a data frame, but you're going to instantiate this Excel file class, with the parameter being the file name. And now these files gonna have just a reference of everything you have. In this case, we can do for example, sheet names, it's going to tell you how product descriptions merchants, there's a little bit more explanatory data analysis. So let's say you can't use Excel to actually see the contents of the Excel file, this is going to be helpful, you're going to first parse the Excel file, get the sheet names, and a little bit more of an understanding of it. And now we can say from these files we have previously parsed right here or instantiated, we can parse the product, the product sheet, and that's going to get you that that frame. And the same thing is going to happen with all the parameters weekend pass, they are the same as read Excel. Finally, you can see that the results are to excel file. And it works pretty much the same way as to CSV, and decide if you pass an index or not. And also you can define if you're going to pass a sheet name or not, are just going to be the default one. So as you can see, getting your data into a from an Excel file into a CSV, data frames array is extremely simple. There are more customizations to do, let's say all your file is shifted array, either rows or columns, you can change that with Star row or column that's going to work, too. So that's pretty much the only thing we need. If your writing process is a little bit more complicated. Like for example, you want to write specific sheets in our multi sheets. Excel file, you can use what we call an Excel right and it's also part of fantasy, you instantiate the rider, and then you can start the ride process saying which shades you want to ride with each one of those, that friend. So again, reading and writing data from on to Excel files is relatively simple. It all depends on the libraries are installed. It depends on on what libraries you have in your current environment, if it's windows or if it's a Linux slash slash mark, the documentation of PD dot read Excel might have more details for the given platform that you have. So let's see if it names per document, if it's not here, it's gonna be in the pandas documentation, but there might be a requirement For each one of the platforms, that pan This is supported. So just check it out, check for your own for your own platform if you're in Windows, Mac Linux, how to get those libraries installed. So in case you're just getting started with Python, and you might come from another language, the objective of this quick section is to show you Python. Ideally, in under 10 minutes, I think it's going to take a little bit more. But there's a very, very, very quick reference of Python, again, just the high level features of the language, how to use it, how to code functions, how to import modules, variables, data types, collections, etc. You can just scroll through this notebook, if you want to take less time, I will be providing an explanation on top of all the topics, but there's a very good reference of the entire language. So to get things started, Python is an old language period. It has card, it has caught more attention in the past five to 10 years. But it's a very old language. It's even older than Java. It's up here in 1990s. And it was created by this person good by Guido van Rossum. And it's an important actor in our ecosystem he is used to be I think he still the one deciding discussions etc, when it comes to defining features of the language, etc. Python is a high level interpreted dynamic language. And this means a tone actually, if we read these entire sentence, interpreted high level, general purpose, this is basically high level programming language, it's object oriented. And it also includes functional attributes or functional features like functions as first class objects, etc. And it also, of course, it supports imperative programming. And it has a wide variety of applications, you can do web development with Python, you can do scripting, it's a lot use for system development for configuring machines in general. And of course, you can also do data science, it has multiple applications has a couple of interesting features like indentation, for defining blocks, etc, that make it and very good language to get started with programming. So if Python is your first language, you should be comfortable with it. It's a very good idea for me, it wasn't my first language. And I hope it was, it wasn't. But I, I have taught people programming with Python as their first language. Seriously, it's always been very good for them, because Python doesn't have weird things like my have in JavaScript or Java. So it's a very concise language and consistent language to be honest. So let's get started very quickly. First of all, when you're going to install Python, your own computer or you can use notebooks AI or Google call up. But if you're installing in your own computer, you might see that you can install either Python two, or Python three, or actually, if you're reading tutorials online, etc, you might see Python two and Python three, the reality is that Python two was deprecated in 2020, so the you cannot you should not use it anymore. There are still ways to install Python two, but it was deprecated. So you shouldn't use Python two, you should stick with Python three, which is the evolution of the language. So ton of fixes from Python to the bay where, where things happen in the language and used to confuse beginners. So that's no longer a problem. Python three, again, is what you should use, you will read in multiple tutorials, etc. What they are using Python two, you should try using Python three, and sometimes the code will break, but the changes to fix it are not very hard. So to get things started here, I will be drawing the problem of this and with regular syntaxes. For example, this is the way you will define a function in for example, JavaScript. And it's also very similar to something like C or Java based languages, the function keyword, curly braces, etc. So I will be drawing a parlors and with these sort of languages. So to get things out of the way to defined function in Python is in this way. And the main characteristic of this language is that the way we're going to define blocks is by Using different indentation levels. So this is a valid function in Python def is the key where we use the name of the function the parameters it receives. And the way to define that the body of the function is by just indenting. Everything one level to the right. Usually, this is just for spaces. Another example is an if else statement. So if this thing happens, do that if else do something else, right? This is JavaScript. In Python, again, it's defined by indentation. If this thing happens, we indent one level to the right, do this else do something else, if there was another if statement here, if I don't know, language, ends with something like I don't know, three, then do something else. Print pi, three, for example. So we're indenting everything to the right, every time we start a new block, whenever the block finishes is just when you go back again, print this as first block, right, that's the way it's going to work it by indenting. Our blocks, this is very good, because first, we don't have debates of where we should place the curly braces. And also, because it makes it a lot more readable, it's a lot easier to read these code because there is obligated obligatory indentation to even make the code work to. So you can see that's that's just how it works. How we're going to make comments in Python, just by using the number pad symbol, there we go. And the way to define variables is just by specifying the name. So it Python is a language that you don't need to declare variables, you just declare and define everything and just one pass, you know, you find a variable, as it goes. Python is dynamically dynamically typed. But it's also strongly typed. And these might kind of cause confusions. But basically, you can assign variables to any value you want. And you will see that collections etc, are heterogeneous in terms of types, etc. It is a very dynamic language. Talking about types, I'm going to show you the most important types that we have in Python, especially we have numbers, of course, integers, we don't have so many like, like, you might find that other languages, like different precision cetera, we have integers, there is also the concept of Long's that has changed with Python two. To be honest, on Python three, to be honest, we use just integers, that's the way we work. It's a, it's a smart enough type to save storage when needed. So that's, that's good. And it will also have floats, right, which is the regular float type for floating point arithmetic in other languages. And of course, it suffers if you want from strange behavior from float floating point arithmetic, like in this case, you can prevent that by using the decimal module, which, as you can see, doesn't suffer from from this issue. So numbers, we have integers floats, and we also have decimals, strings are just a type str, and they are defined literal, right, as in this in the st, you can see right here, you can just type the string as it goes. There is a difference between there was a difference already in Python two, between Unicode and strings, etc. In Python three, that has all been fixed. So we Python three, this is all Unicode. And there is the concept or the difference in terms of the concept of something being the type. The Unicode code points as it's this string, and the underlying encoding will turn it into binary. So in Python three still have we have a few ways to differentiate between whether it's a binary string or whether it's a text based string. For you shouldn't worry about it, I just want you to know, if you're writing a Python tutorial, for example, you might find a difference between Unicode strings and regular strings, which is, is no longer something that we should be worrying about. If you have a string that it's too long and it expands multiple lines, you can always write it using three quotes can be double quotes or single or single quotes. So just to create multi line strings is extremely simple. Boolean there are two Boolean type do Boolean objects are unique, right? It's kind of a single tone which is the true or false objects. For example, They are of type Bo. There is also the concept of No, in Python, which is none, we don't have no, we have none, but it serves pretty much the same purpose. In Python, everything is an object. So even this strange, strange objects, like none will have an associated class, if you want, everything in Python is an object. So all these types of you have seen. So for example, we have this string, which is H of a string. The type is str, you can use the int, str float bool types, right, but it's the result of the type also as function. So in order to cast in this case, a string into in order to cast a string into an integer, you will use it you will do it using the end function, which is the same thing that you get with these, for example, so this is the same as this, as you can see, what we have to show. So functions again, death is the key word we use, we don't use function, we use death I, you can use define, as a mnemonic, the name of the function parameters are optional, and finally have the return keyword, you should always include a return you usually 99% of the time, the function should return something. Because that's going to be the result assigned once we invoke the function just this is pretty regular. If your function doesn't return anything explicitly, if that means if you haven't written down a return statement anywhere in your function, the function will still return something so that the fact that you haven't included a return statement explicitly doesn't mean that the function is not returning anything implicitly, actually, it is returning something, it's returning none, right. So by default, if you don't include a return, Python will do this. Just for you to know a function always returns something as specified parameters and passing parameters is pretty standard. Python has some advanced features with parameters like for example, variable length arguments, we can pass as many arguments, we want to make it very dynamic keyword arguments, named arguments, etc. So all their ethic operators, you know, already, the shin modulus, in this case, were doing a power its operation, all this is pretty standard. And the same thing happens with all our Boolean operators greater than greater or equals then etc, there are type checking. So this is when we have the strongly typed feature, even though Python is dynamically typed. It is the types are enforced. In this case, you cannot compare a two with this doesn't make any sense. And Python is going to complain about that. So this is an example of an error in Python. The exception type error was raised on the same thing with bolens and not on or operators. As we saw before control flow is defined by the indentation so every new block is defined with an indentation level. Python includes if else and also l F, which is very convenient. And this is an example If this happens, Elif, Elif, etc. Python does not have a switch statement. For example, loops, how are you going to loop through something in Python loops on lists, or collections in general, are very interconnected. Because in reality, when you're looping the Python, you're not doing a regular in Python, we don't have something like in, in Java, you're gonna have something like int i equals zero. What else I it's been decades. And I this is I haven't coding in Java. So I, I don't know, minus 10, less than 10 less than 10. And here we do I put last There you go. So we don't, we don't have these in Python. We have a way to mimic it. But we in Python, we always eat iterate over a collection. So what we're going to do is we're going to create a range elements, and we're going to iterate over it. So the way it works is very close to one other language is going to be a for each. Alright, so in this case, we have all these elements and we're going to do for name in names, that's it. And at any moment, the name is going to be associated with an element in the list. while loops are part of the language, they are usually discouraged in favor of for loops. If something can be coded with a for loop, it should be coded with a for loop and not a while loop. Because as you might know, already, these my trigger or these might result in an infinite loop if you're not checking the conditions correctly. So the collections we have in Python, are the fundamental ones, the primitive ones, the most important ones are first the list Python is we do a heavy usage of lists. And it's just a heterogeneous data structure. So you can put anything in it. And actually, all these collections are heterogeneous, you can mix volumes as you want. And in this case, we have three elements that we have added one string, one integer, one string, and one Boolean. And let me say something here. Even though pythons, Python supports mixed types in the collections, it doesn't mean that you should do it. To be honest, we should, you should usually avoid mixing types in collections, because that means we don't, we don't know what we're putting in it, right. So it's, we should be consistent. So it's possible, revisit your code, if you have too many different types in it. I'm checking the length length function accessing elements is by by zero indexed, and we use square brackets. So in this case, give me the first element given the second element. And also we can index starting from the from behind from the end. So in this case, minus one, minus two, minus three. So in this case, minus one minus two, again, give you different elements, you can check the operations associated with all these elements. Very quickly, a list is L dot append, we're going to append the new element. So the list now has that element at the end. And we can check if that element is part of the list in this case is true in this case is false. topples are similar to lists, they are also sequences, but the main difference is that they are immutable, there is no way to add new elements to a tupple, or remove elements from a tupple once it has been created. So in this case, we have created a list with three elements. Now tupple, sorry, with three elements, we can access it, we can check if something is in it in the same way that we did with a list. But in this case, with a tupple. Again, you cannot modify it tupple never changes, you can't add elements to it. Another important data structure is a dictionary. In Python, a dictionary is a key value, right and mapping, it's similar to an object in JavaScript or hash table in in, in Java, it's a key value mapping type. And in this case, we are going to associate values to names. So you can see this, the way I like to explain it is if you create a topo list, right? So let's say we're going to create a list, out of all these elements, give me one second, we're going to create a list. There we go, we're gonna copy these elements. And we're gonna associate that to our list. There you go. So these are a list, we could very well store the information about our customers in a list, right? That works. I mean, I can get it done. The problem is that whenever I need to access information about this list, we're going to say, for example, I don't know I want to give me the email for this customer, I have to remember the position that the email is located so in this case is going to be position number one, if these information grows, instead of having 1234 values or four pieces of information for our user, we have 100. Right, then it's gonna be very hard to access those individual volleys. So that's why we create dictionaries, dictionaries are collections of values. The important part is on the right, the important part is the value. But they are instead of just indexed by the precision, we give them arbitrary names, we tell them very explicit names. This is the name, this is the email. This is the age. And this is if they are subscribed or not. So once we create these dictionary, we can access those values by the name, give me the email of these user or is the age present of the user is the last name present of the user in the user in the user dictionary. So again, it's a way to store information associating later In order to make it simpler for us later, let me delete this. And I move four sets sets are very common data structure, he is when you're learning about a collections and, and and yeah, the instructions in general, it's not so common in too many languages. I mean, it's not very popular in Python, we use it often because it has a very interesting feature, first of all, and it's something that I forgot to tell you about dictionaries, both sets and dictionaries, are what we call unordered. data structures, you never know, the order of the elements. In Python, with recent versions, there have been changes, which make Python dictionaries ordered. But for now, I'm going to say you shouldn't rely on it, you should think your dictionaries as they are completely unordered data structures, and the same thing for sets, sets are, it's a bag that contains elements, you know, it's a big bag, you keep throwing elements inside of the set, there is no orphan in it. And what's gonna happen with it, you're gonna odd elements, for example, to the set, or you're going to remove elements to the set. And there is one important thing that makes this set so useful, and it's the membership operation, I'm gonna write it down here, membership, ship operation, there you go. So you can access these notebooks later. So in the membership operation, the the, the process of checking if something now, nine in s, the process s of checking this is extremely fast, it will be called oh one. And this is because as you might have seen here, when I created this set, I included a couple of repeated elements, 333, write 11179, the resulting set doesn't have those repeated elements, these are two features of the set, the set will only contain unique values. And by the way, it's implemented behind the scenes will make dot these unique values are extremely simple to check whenever you pass these membership operation is extremely simple, or sorry, is extremely performant. It's very fast, different from for example, a list. So keep it in mind sets are very, very useful when you're checking for members. So again, as I told you before, we're going to iterate over collections with the for loop. So in this case is if we have a list, it's going to be for element in list. There you go. If you have a user dictionary, use a dictionary, sorry, in this case user, we're going to the default iteration is by key, we're going to get for name email age subscribed, and we have to extract the value out of the of the dictionary, we could also do for value in user dot values. Oh, there you go. Or you can iterate over both key and value with items. Key. And value. There you go. So each iteration in in in Python is very readable to put it in a way. And again, remember, we're always using the for loop that assumes that you're iterating over a collection, we don't have the for Ei equals zero equals zero, I equals zero, i less than 10. i plus plus we don't have that right in Python, we can simulate it with for i in range. Five, for example. Print. We've got simulated with the range function, which generates pretty much those elements. Something that you might have heard about Python is that it has a huge library of built in modules, right that you can just import and just gonna work. There are so many things already coded in Python, that it makes it very simple for you to create something on top. Do you want an a library for I don't know security cryptography Math, numeric processing NumPy, right? machine learning web development, creating games through is pi game, do you want to create a graphical user interface, whatever you want to do, there is usually a library that has already been coded and will make your job easier. On top of that, the bill team is down there library, right, which is already included with Python, it's not third party. In this case, it's already created by the Python core team. It's a huge library, so many modules. And the way it works is by importing this module, so this is the way we work with packages and modules, there are differences between modules and packages, third party ability, and this is a little bit more advanced. But again, this gives that random number generator, it's already built in. And you can check the docs right here. exceptions, whenever you do something that doesn't work. So in this case, we say, if the age is greater than 21, but age is a string, it's an it's not an integer, this is going to fail. We can catch exceptions before they happen, that's going to be with a try and accept lock. Right. In that case, if this fails, if anything here fails, these blocks going to be kicked in. And you can catch the exception without the program fail failing. And you can be more explicit about the error aspect. So again, this is just an introduction. It might be useful if you're coming from another language, especially to keep this notebook as a reference. We're going to be using Python a lot, of course, and it's a great language if you want to do scripting, work development, of course processing with data, data analysis, etc, visualizations, machine learning, Python is just great. So I hope this tiny tiny reuse lesson helps you port your knowledge from other languages into Python. And that's it.