Transcript for:
Introduction to Pandas

And now it's finally time to talk about pandas is the most important library that we use for data analysis in our day to day basis with Python. It's a library that will aid in the entire process of your data analysis project, you're going to start getting the data, step one, getting the data from multiple sources, like databases, Excel files, CSV, files, etc. That's all gonna get into pandas, you're going to be processing the data, right? So you're going to be combining merging, doing different types of analysis, you're going to be visualizing the data, right, so a bar chart, you're going to be visualizing the data with pandas, and you're going to be creating reports, you're going to be also doing simple statistical analysis, you're going to be doing machine learning close to it, with the help of other libraries, but everything from the platform that provides the pandas library, it's, again, one of the most important libraries in in in the data analysis data science ecosystem with Python. pandas has recently released the version 1.0. So we are talking about a very mature library. It's been around for a long time now. And again, it's the primary library that we use in Python for data analysis and data science. So I'm going to do a quick introduction to the data structures of pandas house, and we're gonna understand how they work. So you can start building right the phone, we're gonna start building the foundations, I need you to be very familiar with the way the data structures from pandas are processed. And then we're going to move into other things like writing files, grouping data, etc. So to get things started, we're going to talk about the first data structure to pandas house, which is this series. In reality, pandas has two main data structures that it uses all the time. And it's the series under the data frame, the data frame is the one you will probably be more familiar with, it looks just like an Excel table. But we're gonna start first with a series. Okay, so just stay with me here, we're going to talk about a series for a second. In this case, we have important pandas, and we have also important NumPy, as, as you might imagine, as I told you before, in the NumPy, part of the story, we were saying NumPy is fundamental for data analysis, because every other library, pandas, matplotlib, they all sit on top of NumPy. And you can see it right here, we're gonna be using some features from NumPy. Within this lesson, too. So these is a series in pandas, what you see right here, it's the concept of a series is this ordered sequence of elements, right? Or indexed right with they are all indexed by a given index, of course. And you might think that this looks a lot like a Python list, right. So in this case, we're storing the population of countries, right in millions of inhabitants. In this case, it's jevelin g7. pub is because we're getting the population of the Group of Seven, you can console the Wikipedia page. But basically, we are storing population in here in this series. And again, it looks a lot like a list, but we're gonna find a ton of differences in here. So the first one is that the the series has an associated data type. And this is something we saw in NumPy, when a NumPy array couldn't hold different type of objects, we were all we were only having one type of object. In this case, it's float 64. So all the numbers of the series will be of type float 64. The underlying data structure, the 10, this is using to store these objects is a NumPy array. So a second difference we see very quickly is that zeros can have a name right. So now when we display the series, we see that it has a name. And now it might not make a ton of sense. But once this series is part of a data frame in the form of a column, then the name is going to make a lot more sense. So moving forward, again, we saw that A has a type and again, this is because the backed the data is backed by a NumPy array that you can always consult you can check values of a series and you're going to get the array that it's backing up that pandas series right so you can see that it's a NumPy array. Once you have these series, we were just consulting here design pop, you can you can select elements as you good in a regular list, right. So for example, give me the first element Give me the second element, the last element it cetera. And that's because a series inherently has an index. Similar to list a list when you create a list in Python, right? So if I create L equals a, b, and c, there we go. There is something wrong here missing a, quote, this list, we don't say it, right. But the idea is that there is an index here, the zero, this is one, and this is two, right? In the pin, this series, this is a lot more explicit, each element has an associated value within it. And you might think that is pretty much the same thing. They're all they're both the list on the series, they're both sequences, they're ordered sequences of elements, but we're gonna see that there is a fundamental difference, and is that we can arbitrarily change the index of a series. So by default, when we created it, we didn't assign any indices. So by default, it was a range index from zero up to n minus one elements. But you can actually arbitrarily again say, what is the index of your series. And in this case, these restructure these series has now these indices that we're saying, right here? Why is this important, because now we're going to be referring to these values, not by a sequential position, but by a name by by a label by the index, which has a meaningful name for us as humans. Okay. So now, these thing looks a little bit more like a dictionary we could say, than a list, we sort of thinking that a series was similar to list but now, we can think that a series is limit similar to a dictionary. But wait, don't get me wrong here. The series has a fundamental trait, and it's that it's still ordered something that didn't happen with. With dictionaries, dictionaries in Python, are not ordered, actually, in Python 3.7, they're ordered, but we shouldn't be thinking that they are ordered their unordered data structures. In this case, a series is in the order. So it has both those advantages. It's ordered candidates always before friends, that's as we decided to create it. But also it has names or labels or keys associated with the values as a dictionary. So this is creating the series from scratch, right? All these methods, you can see you can create a series bypassing the index, it doesn't have to be a two step process where you first created the series and then the index in this case, you can do everything at once. And the indexing is now going to be done by those indices right?