Apache Spark and PySpark Overview

PiSpark is an interface for Apache Spark in Python is often used for large scale data processing and machine learning. Krish knack teaches this course. So we are going to start Apache Spark series. And specifically, if I talk about Spark, we will be focusing on how we can use spark with Python. So we are going to discuss about the library called pi Spark, we will try to understand everything why spark is actually required. And probably will also try to cover a lot of things, there is something called as emblem, spark emblem, which will basically say that how you can apply machine learning, you know, in Apache Spark itself with the help of this spark API called as pi spark libraries. And apart from that, we'll also try to see in the future, once we understand the basics of the PI spark library, how we can actually pre process our data set how we can use the PI spark data frames, we'll also try to see how we can implement or how we can use PI spark and cloud platforms like data, bricks, Amazon, AWS, you know, so all these kinds of clouds, we'll try to cover. And remember, Apache Spark is quite handy. Let me tell you just let me just give you some of the reasons so why Apache Spark is pretty much good. Because understand, suppose if you have a huge amount of data, okay, suppose if I say that I'm having 64, GB data, 128 gb data, you know, we may have some kind of systems or standalone systems, you know, where we can have 32 GB of RAM, probably 64 gb of ram right now in the workstation that I'm working in. It has 64 gb ram. So max to Max, it can directly upload a data set of 32, GB, 48 gb, right. But what if we, if we have a data set of 1.8 gb, you know, that is a time guys, we don't just depend on a local system, we will try to pre process that particular data or perform any kind of operation in distributed systems, right distributed system basically means that all there will be multiple systems, you know, where we can actually run this pipeline of jobs or process or try to do any kind of activities that we really want. And definitely Apache Spark will actually help us to do that. And this has been pretty much amazing. And yes, people wanted this kind of videos a lot. So how we are going to go through this specific playlist is that we'll try to, first of all start with the installation will try to use PI Spark, because that is also Apache Spark is it is a spark API with Python, when you're actually working with Python, we basically use PI spark library. And yes, we can also use spark with other programming languages like Java, Scala R. And all right, and we'll try to understand from basics you know, from basics, how do we read a data set? How do we connect to a data source? Probably, how do we play with the data frames, you know, in in this Apache Spark, that is your PI Spark, also, they provide you data structures, like data frames, which is pretty much similar to the panda's data frame. But yes, different kinds of operations are supported over there, which we'll be looking at one by one as we go ahead. And then we will try to enter into emlid, the spa, Apache Spark and lib. So basically, it is called a spark em lib, which will actually help us to perform machine learning, which will be where we'll be able to perform some machine learning algorithm task where we will be able to do regression, classification clustering. And finally, we'll try to see how we can actually do the same operation in cloud, where I'll try to show you some examples where we will be having a huge data set, we will try to do the operation in the clusters of system, you know, in a distributed system, and we'll try to see how we can use spark in that, right. So all those things will basically get covered. Now, some of the advantages of Apache Spark and why it is very much famous, because it runs workloads 100 multiplied by 100 times faster, you know, that basically means and if you know about big data, guys, when we talk about big data, we're basically talking about huge data set right. And then if you have heard of this terminology called as MapReduce, right, Trashman Apache Spark is much more faster 100 times faster than MapReduce also. Okay, and it is some of the more advantages that it is ease of use. You can write application quickly in Java, Scala, Python, or R. As I said, we will be focusing on Python, where we will be using a library called pi Spark. Then you can also combined sequel streaming and complex analytics. When I talk about complex analytics. I'm basically talking about this emblem, machine learning libraries. That will work definitely well with Apache Spark. And Apache sparks can run on Hadoop, Apache missiles Cuba net standalone in our in the clouds, cloud, different types of clouds guys, when I talk about AWS data, bricks, all these things, we can definitely work okay. And it actually runs In a cluster mode cluster mode basically means in a distributed mode, right. So these are some of the examples. Now if I if I go with respect to which version of pi spark will spark we'll be using pi spark 3.1 point one will be using will try to work. And if you just go and search for here you can see SQL and data frames and all here you can see Spark streaming machine emilich, that is called as machine learning. And all right, and apart from that, if I go and see the overview, here, you can see that Apache Spark is a fast and general purpose cluster computing system and provides high level API's in scalar, Java and Python, that makes parallel job easy to write an optimized engine that supports genuine competition graphs, it is basically to work with huge amount of data in short, you know, and that is pretty much handy, we'll try to work. Now if I go and search for the spark in Python, you know, this page will get basically you're open. And this thing's we'll try to discuss how to install it. And in this video, we'll try to install the PI spark library. And if I talk about pi spark library, you'll be able to see that pi spark library is pretty much amazing. This library is if you really want to work spot, if you want to work this spark functionality with Python, you basically use this specific library. And let's proceed, and let's try to see that how we can quickly how we can quickly you know, install the specific libraries and check out like, what are the things we can actually do? Okay, so all these things we'll try to see. So let's begin, please make sure that you create a new environment when you're working with PI Spark. So I have created a new environment called as my envy here, first of all, I will try to install the PI spark library. So I'll just write pip install pi Spark. Okay, and let's see, in this will focus on installation will focus on reading some data sets and try to see that what are things we can actually do, okay, and after doing this, what we can actually do is that, you can see that our PI spark has been installed, in order to check whether the installation is perfect or not, I'll just write input by Spark. So this, this looks perfectly fine it is working, you know, we are able to see that the PI spark is basically installed properly. Now, you may be facing some kind of problems that is with respect to pi Spark. So that is the reason why I'm telling you create a new environment. If you're facing some kind of issue, just let me know what is the error that you're getting? Probably writing in the comment section. Okay, now, let's do one thing, I'll just open an Excel sheet. Okay. And probably I'll just try to create some data sets, I'll say name, probably I'll just say name, and age, right. And suppose my name over here that I'm going to write as crash, and also 31. I'm going to say Sudan shoe. right shoe down shoe, I will just say okay, 30, probably, I'll just write some more names like Sonny, probably, I'll also give the data as 29. So this three data, we'll just try to see how we can read this specific file. Okay, I'm just going to save it. Let's see, I'll save it in the same location where my Jupyter notebooks, guys, he created a folder, I guess, you can save it in any location where your notebook file is opened, right? So it is not necessary. And just making sure that you don't see any of my files. Okay, and I'm just saving it. Okay, I'm saving it as test one here, you can see I'm saving it as test one dot CSV. So I'll save it. Let's keep this particular file saved. Okay. Now, if I probably want to, you know, read with the pandas, what we write we write PD dot read underscore CSV, right. And I basically use this particular data sets called as test one dot CSV, right. So when I'm executing this here, you will be able to see the specific information. Now when I really want to work with PI Spark, always, first of all, remember, we need to start a spark session. And in order to start a spark session, first of all, let me create some more fields. Just see this just follow this particular steps, or with respect to creating a pass session. So I'll write from pi Spark, dot SQL, input spark session. Ok. And then I'll execute this, you can see that it is exhibiting fine then All right, sorry, I don't know what has opened. So I'll write I'll create a variable called a spark. And probably I'll use the spark session dot builder. And I'll say app name. And here I'll just give my session name. Okay. So it will be like practice. Suppose I'm practicing this things. And then I can say get or create. So when I actually execute this, you'll be able to see us barks session will get created. And if you're executing for the first time, it will probably take some amount of time. Other than that, if I've executed multiple times, then you will be able to work it. Now here, you can definitely see that, in this, when you're executing in a local, they'll always be only one cluster, but when you are actually working in the cloud, you can create multiple clusters and instances okay. So the spark version that you will be using is V 3.1. point one. Here, you can see that this is basically present in the master, when probably you will be working in multiple instances that you will be seeing masters and cluster one, cluster two, all those kinds of information. Okay, so this is with respect to spark. Now, let's just write the F of pi Spark, where I will try to read a data set with respect to spark, okay. Now in order to read a data set, what I can write, I can write like this spark dot read dot, there is a lot of options like CSV format, JDBC, Parque qL, schema, table text, a lot of options there. So here we are going to take CSV, and here I'm just going to write tips, one, tips, one dot CSV, right? And if I just try to execute it, here, I'm getting some error saying that this particular file does not exist. Let me see. I think this file is present. Just let me see guys, why this is not getting executed tips, one, b, f file open. Here, I can see test one dot CSV, okay, sorry, I did not write the CSV file, I guess, test one dot CSV. Okay, this has now worked. Now if I go and see the white dots, pi Spark, it is showing this two strings, right, this two column C zero and C one. Now here you can see that there is I've created this particular CSV file, right. And it is just taking this A B as a default column probably. So it is saying c zero and C one. So what we can actually do is that and probably if you really want to see your entire data set, you can basically see like this df underscore pi spark dots show you here you will be able to see name and age, there is this there's this information that I really want to make my column name or age as my main column, right. But when I'm directly reading the CSV file properly, we are getting underscore Cesar underscore c one. So in order to solve this, what I will do is that we have a different technique file, right spark dot read dot option, there is something called as option. And inside this option, what you can basically give is that there will be an option with respect to header, I guess, see, there'll be something like key value that you will be providing an option. So what you can do, you can just write header, comma true. So whatever value the first column first row value will be there, that will be considered as your header. And if I write CSV with respect to test one, now I'm just going to read this test one data set test one dot CSV. Now once I execute this here, you will be able to see that I'm able to get now named string h string, okay, but let's see our complete data set. So here if I execute this now I'll be able to see the entire data set with this particular columns. Okay, so let me just quickly save this in my df underscore pi Spark. Okay, and now let's go and see that type of df underscore pi spark Okay. Now when I execute this here, you will be able to see guys when I was reading this df right when I was if I go and see the type of this with the help of pandas, here, you'll be able to see that there is partners or core dot frame dot data frame, but here you will be seeing that when you are reading this particular data set, it is of type pi spark dot SQL dot data frame dot data frame. Yes, so that is pandas DataFrame this SQL dot data frame Rhonda, yes, most of the API's are almost same the functionalities are seen a lot of things that we will be learning as we go ahead. But if I quickly want to see my probably I don't know whether head will work let's see, yes, head is also working. So if I use dot head, probably you will be able to see the rows information are basically shown over here. Now if I really want to provide see the more information regarding my columns, I will be able to use something called as print schema. Okay. Now in this this print schema is just like a df dot info which will actually tell about your columns like name is string and ages string. Okay, so all these are some basic operations that you have actually done after installation. Again, the main thing why I'm focusing on this is that just try to install this PI spark and keep it ready for On my next session, I will be trying to show you how we can change the data type, how we can work with data frames, how we can actually do data pre processing, how we can handle null values, missing values, how we can delete the columns, how we can do various things, all those things will, basically we'll be discussing over there, how to drop columns. And so guys, we will be continuing the PI spark series. And in this tutorial, we are actually going to see what our PI spark data frames, we'll try to read the data set, check the data types of the columns, we basically seen pi spark color schema, then we'll see that how we can select the columns and do indexing. We'll see describe functionality that we have similar to pandas, and then we'll try to see that how we can add new columns and probably drop columns. Now, this is just the part one. So let me just write it down as part one, because after this, there will also be one more part why this video will be important because in PI Spark, also, if you're planning to apply emblem, you know, the machine learning libraries, you really need to do data pre processing initially, you know, probably in the part two, we'll try to see how to handle missing values. And all, we'll try to see how to filter the rows, how we can probably put a filter condition and All right, so let's proceed. Before going ahead, what I'm going to do is that, we will first of all, you have a data set called test one. So I have taken three columns. One is name, age and experience. And then I have a data set like Krish Taki 110, like this Sudan shoe Sunday, right. So this is some data set, which you have saved in the same location. Now what I'm going to do first of all, as usual, the first step with respect to pi spark is to build the PI spark session. Now in order to build the PI spark session, I will write the code line by line. So please make sure that you also do along with me, it will definitely be helpful. So I'm going to write from pi spark dot SQL, input, sparks session, and then I'll create a variable, oops, sorry, then I'll start to create a variable regarding my session. So I'll write spark is equal to sparks session.we, basically, like builder dot app name. And here, I'm just going to give my app name as practice I can just say, or liquid data frame practice or data frame, right, something like this, since we are practicing data frame dot get or create function. And this is how you actually start a session. So once again, if you are executing for the first time, it will take some time, otherwise, it will, it is perfect to go. So here is my entire Spark, it is running in memory, the version that it is running over here. And obviously when you're running the local, you basically have one master node, okay, and the app name is data frame. So to begin with, we will try to read the data set again. So let's read the data set. Now reading the data set, I have already shown you multiple ways. One is to read option one is to and since this is a CSV file, we'll try to read it first first option, we'll try to see how we can actually read it. And then I'll show you multiple ways of reading it. Okay, so I'll write spark dot read dot options. And here in this option, we basically say key value, right, so here, I'll just make it as header is equal to true so that, you know it should be considering my first row as the header. And here I'll write it as header that's true dot CSV, inside the CSV. I'll give my dataset name that is called as test one. dot c is right. Now when I execute this, probably, I think you'll be able to see the data set. So here you are able to see that okay, it is a data frame, and it will have features like name, age experience, right? So if I want to see the complete data set, I'll just write dot show. So here is my entire data set over here very clearly I can see it. Now. Let me just save this in a variable called as df underscore pi Spark. Okay, so here is my entire data sets. Now, first thing, how do we check the schema, let's check the schema. Okay, schema basically means the data types like how we write in pandas df dot info, similarly we can write over here. So here you'll be able to see that I have written df underscore pi spark dot print I think it should work print schema or none type has spring theme Oh sorry. So I had written dot showed and saved in a variable I'll remove this dot show let me execute it once again. And now if I write print schema, here you will be able to see name, age and experience, but by default it is taking a string even though in my Excel sheet, what we have done is that we have written values probably this should be string, this should be integers, then they should be integers, but why it is taking in a string. The reason it is probably taking a string guys because by default, unless and until we don't give one more option in CSV, this CSV have one option call us infer schema, okay? If I Don't make this as true, right? It will by default consider all the features as you know in the stream value string values. So I'll execute this now. And now if I go and see the F underscore pi spark dot print, Sima you will be able to see that I'm getting name and string age as integer experience as integer and a level is equal to two that basically means it can have null values. Okay, so this is one way of reading it. One more way I'll try to show you which is pretty much simple so I can include both header and infer schema in one thing itself. So I'll write d f underscore pi spark is a call to spark dot read dot CSV and inside the CSV file first of all I will provide my test file CSV Okay, and then here I will go ahead and write header probably is equal to true and I can write infer schema is equal to, so when I write like this and if I write df underscore pi spark dot show you here you will be able to see all my data set Okay, so here is my entire data set. Now, if I go and see and execute this schema again it will probably give me the same way like how we have we had over here, okay. So, here you can see name is equal to string is equal to integer experience indigent, indigent, perfect. So what are things we have done, we have understood about this one and right, if I go and see the type of this, if I go and see the type of this, this is basically a data frame. pandas also have a data frame. So if somebody asked you an interview, what is a data frame, you can basically say that data frame is a data structures, you know, because inside this you can perform various kinds of operations. So this is also one kind of data structures. Okay, so what are things we have actually done, I've given introduction of data frame reading the data set, not checking the data types of the column. In order to check the data types of the column we have already just written print schema. Okay, now one more thing that I can do after this, let's see selecting columns and indexing. First of all, let's understand what columns are basically present how you can get all the column names. So in order to get the column names, you can just write dot columns, okay. And when you execute over here, you will be able to get the column name like name, age experience, perfect, this is perfectly fine. Now this is my D F. Now suppose if I want to pick up some head elements, also, I will be able to pick up because in pandas also you haven't had suppose I see I want to get the top three records. I will be getting in this particular format in the list format. Usually in pandas when we are using we usually get in a data frame format. So here you will be seeing the combination of name age and expedience. Okay, like this. This is my first row, this is my second row, this is my third. Okay, now coming to the next thing that we are going to discuss now, how do I select a column? You know, I probably want to pick up a column and see all the elements, right like how we do it in pandas. So first of all, let me just write it like this PI spark dot show here will be able to see all the columns are if I really want to pick up on the name column. Okay, so how do I do it? Okay, let's let's actually see. Now in order to pick up the name column, there is very simple functionality that we'll write which is called as pi spark dot select. And here, I will just give my name column. Now once I execute this, you'll be able to see that return type is data frame. Okay, the return type is data frame, and name is basically a string. Now, if I write dot show, I will be able to see the entire column. Okay, so when I do this, I'll be able to see this and if I try to find the type of this so sorry, if I remove this dot show and see the type of this this is basically a data frame by spark dot SQL dot data frame dot data for not pandas dot data frame. Okay, pretty much simple. Now, suppose I want to pick up multiple rows like out sorry, multiple columns, like I want to pick up name and experience probably two columns I want to pick up so what I'll do I'll just make one changes. Here. Initially, I was providing my one column name like this. After this what I will do, I will provide another column which is like experience, and I'll execute this now once I execute this over here, you can see that guys I'm getting a data frame with two features one is name and experience. Now, if I go and write dot show, here you will be able to see my all my elements are basically present inside this particular data frame. Pretty simple base, how do you select multiple rows? And yes, here slicing definitely will not work because I tried doing slicing it was not working, okay? And, okay, whenever you have any kind of concerns, always try to see the documentation, the PI spark documentation, pretty much simple. Okay, this is one way that how we can actually select the columns and probably see the rows. Okay. Now, let's show you if I just want to pick up there is also one way like the see if I write the F of pi spark off named. If I execute this here, you'll be able to see column name there. return type will be column over here, if I'm directly picking because in pandas we directly pick like this, right. And when we have this kind of columns definitely will just not, we are just able to understand what this particular feature it is basically a column it is saying, okay, nothing more, we won't be able to get the data set, there'll be no show function, it will be saying that it is basically an error. So usually what we do, whenever we want to get a pick up any kind of columns and try to see it, we basically need to select using this particular select operation, which is my function. Okay, so these things have been done, guys, what we try to understand now let's see how we can check the data types. So there is a functionality which is called D types. So here you will be able to see name is called a string age is equal to end experience is equal to int. And again, D types is pretty much similar because we also use this in pandas. Okay, most of the functionalities are similar to pandas, guys. So what are things we have actually done? Let's see, pi spark DataFrame, reading the data set checking the data type, selecting columns and indexing, check the describe options similar to pandas. So, we can also check out the describe options. So, let's see pi spark dot describe and if I execute this, you will be able to see it will give you a data frame summary is equal to string this this this information is there. Now when I write dot show okay you will be able to see all this this is basically in the form of data frame you may be thinking why this null values coming mean and standard division because understand even in this it will take the string column also basically the values that are having the data type of string away obviously, you don't have anything So, min and max is basically taken on the index because in the zeros in the second index you will be able to see in crushes then after that Sonny's there, okay look for the next and remaining all these information are actually present again. So this is basically the same like the describe options that we are actually seen, you know, probably in our pandas right, so similarly, we have actually done that. Okay, so describe option is also done. So, let's go and see adding columns, dropping columns. So adding columns and dropping columns is very, very simple Guys, if we need to add the columns, so I'm just going to write the comment over here, adding columns in a data frame, and this data frame is by pi spark data frame, okay, now in order to add the column, so we have an amazing function add function, which is called as the PI spark dot, there's something called as width column. Okay? Now, this width column, if I see the functionality, it returns a new data frame by adding a class or replacing the existing column that has the same name. Okay, so here, the first parameter that I'm going to give is my column name. Suppose I want to pick up let's see, I'll pick up experience. So I'll say experience, okay. And probably this will be my new column. After two years, what will happen if experience after two years, you know, initially, the candidate is 10 years experience, after two years, it will become 12. Right, so we'll try to put now the value, this is my new column name, and what value it should have. So for that, I'll write df pi Spark. And here, I'll say, probably, I'll take that same experience, I will multiplied by I will add like two because after two years, the experience will get added by two, just I'm taking one, one way of actually solving this, I can put any values I want guys, it is up to you. Okay, and you can actually check it out. Okay, now after this, this is only the two things that is required. And now if I execute it, you'll be able to see that the same operation will happen. And now you in this data frame, you have 123 and four feature, if I want to see the complete data set, I can add dot show, once I execute it now here we'll be seeing that experience after two years is nothing but 266 because 10 plus 212 you have very very simple rest and this is what width column is basically told us and you can also do different different things with respect to this. So this is how you get add a column in a data frame. And again guys, this is not an in place operation, you basically need to assign it to a variable in order to get reflected. Suppose if I want to get it reflected, I really need to assign like this. And here now if I go and see my sorry, first of all, let me remove the show. The show will not give us proper result. Okay, oh has no attribute with column. Okay, sorry. So, this there was a problem with respect to this, I'll read this dataset, because I replaced it completely right. Now I will execute it. And once again, now is fine. Now if I go and write dot show, here, you will be able to see the elements all properly given. Now, this was with respect to adding the columns with data frame. Now, I probably need to drop the columns also. So drop the columns. Let's see how we can actually drop the columns. Dropping the columns is pretty much simple like how we usually drop this drop functionality. By default a column names you can give a list of columns. You can Give a single column name. So suppose I say experience after two years, I want to drop this, because who knows, after two years, what will happen. So let's drop this, in order, drop this, just execute like this and just go and see dot show. Here, you will be able to find out without that specific column. Again, this is not an in place operation, you need to assign it to a variable, very simple. So let me just assign it to a variable is equal to and please make sure that you remove this dot show dot show is the functionality, right? Now, if I write this dot show, here, you will be able to see all the elements. But now let's go ahead and see how to rename the column. So we are just doing this guys because you really need to be very good at data pre processing, okay, so I'll write the hot and there is another function, which is called as with column rename. Okay? Nine, this you just give your existing and the new column name. Suppose I have my existing column name over here, I'll say, name. And I'll say new name. Okay, and just executed. And now, if I go and just like dot show, and try to see the elements here, you will be able to see instead of name, there will be something called as new name. Right. Now, this is what I had to actually discuss. I am just writing one more point over here. We have also discussed about renaming columns, right? Yes, this is just the part one or data frames, the part two, we'll try to do something called as filter operation. And in filter operation, we'll try to see various operation because it will be pretty much amazing, you'll be able to learn a lot probably this is the tutorial three, which is the part three with respect to data frame operations. In this particular video, we are going to see how we can handle missing values, null values, you know. So in short, this many things we'll actually try to do, we'll see how to drop columns, we'll see how to drop rows, then we'll see when when we are dropping rows, probably based on a null values, we'll try to drop a rose. And then we'll try to see what are the various parameters in dropping functionalities and handling missing value by Mean, Median or mode. Okay, so here, I'm just going to write it as mean, median, and more probably, right. So all these things we are actually going to see, again, the main thing is that, I really want to show you that how we can handle the missing values. This is pretty much important because in pandas, and also we try to do this in a scale on we have some kind of inbuilt function. So let's proceed. Whenever we usually start pi Spark, whenever we are working with PI Spark, we really need to start a PI spark session. So I hope till now you all are familiar. So I'll write for pi spark dot SQL, I'm going to import sparks session, again. And then I'm going to create a variable with Spark. And then here I'm going to write spark session dot builder. Not happening. Okay, not app name. Again, let me just keep this app name as practice, okay, because I'm just practicing things, then I like get or create and just execute this. So overall, it will take some time to get executed. Yes, it has got executed fine. Now for this, I've just created a very simple data set, which looks like this. I have a column like name, age, experience, salary. So these are all my names, all the candidate names, and probably there are some values which are left blank. Here, you can see some values I've left blank. So we'll try to see how to probably drop a null values or how to handle this particular missing values or not. Okay, so let's proceed. So first of all, in order to read the data set, I'll just write spark dot read dot CSV. And here, I'm just going to use the CSV file name that's to dot CSV. And it is saved in the same location where this particular file is Anyhow, I will be providing the you in this data also. And I'm going to use header is equal to true and probably there is also info schema is equal to true so that I'll be able to get the data set properly. So probably when I'm reading this, you'll be able to see this is my data frame that I'm actually getting if you want to see the entire data set, this will be like show us dot show. And this is your entire data set here you are having null values under perfect. So what let me do one thing, let me just save this in a variable. So I'll write df underscore pi Spark. So if I go and now check, dots show and this is my entire data set. Okay, perfect. We are pretty much good 10 Here we are working fine. With respect to this. We know we have actually read some kind of data set also. Now, probably First, let's start. How do we drop the columns? dropping the columns is very, very simple guy. Suppose I want to drop Name column, then I just use df dot drop and provide my column name like this right? So column, right column name, suppose I'll write df.pi spark and here column name will be named. So let me write it as name. And I can basically go and check out my dot show. And then you will be able to see all the features that are actually present. This is pretty simple, which I also showed you probably in the previous session also, okay. And this is how it is basically done basically dropping your feature or a columns, but our main focus is dropping the non value. So right now, let me just write df.pi spark dot show. So this is my data set, right? Now let's see how to drop this specific rows based on the null values. So over here, I'll just use df.by spark.na. Okay, there's something called as na and then you have drop, fill and replace. So first of all, I'll start with dropped. Now, inside this particular drop, always remember if I don't give anything okay and just exhibited here you will be able to see wherever there is a null values those all rows will get deleted. So, here we will be seeing that this last three rows are not present, right? So here you can see Shibam this particular value is present the meaning all the rows have been removed perfect right. So not a problem at all. So, he had you in short what you're doing is that whenever you use.na dot drop, it is just going to drop wherever it is going to drop those rows wherever none values are actually present or null values are actually present Okay, perfect This match is fine. If I go and search in the drop, there are two main features one is how and one is threshold and then one is subset. So, let's try to understand this particular features. Now, first, I will start with how any is equal to how I just tried like this okay. So, suppose if I write df.pi spark.na dot drop and if my how the how value can have two values one is any one is all Okay, one is any one is on any if the value selected as any drop a row if it contains any notes like even though there is just one okay one of the rich tuners or there is an entire mouth you know none by default it is going to get dropped okay. But how is the call to all when do we use all that basically means suppose if in your future you have suppose if any rule you have all the values as null in this case you have 36 one value this will not get dropped, but if he in a record you have all the values and ml then only it will get dropped okay. So, let's see whether this will work or not definitely is not going to work because I know all at least one values or at least one values one one value one non non null values always there right? If I'm using How is equal to all it is going to drop those records which is having completely not by default this how value should be having any right so, by default it is any any basically says that whether there is one or two now we are just going to drop it drop those specific records right. Pretty much simple this was what how was and let's go ahead and try to understand with respect to threshold What is this threshold I will tell you what is this threshold Now, let me just use this Okay, I know how is any But there is another one more option called as thrush nine Thresh what we do is that suppose if I right let's keep the threshold as two it basically says that suppose over here in this particular case if the threshold value is two Okay, let's let's first of all execute it you'll be able to see that the last column has got deleted over here Okay, that's the last row has got deleted why it has got deleted because we have kept the threshold value is two it says that at least two non null values should be present okay at least two non null values now here you have two non null values like more than 40,000 Okay, here you just have one non null values. So, because of that it got deleted suppose if you had two non null values over here see 34 and 10 this is not got deleted This is same over here 3410 right 3410 you have if I go and show you 3410 over here and 38,000 there at least here you add three normal values here you add to normal values. So here whenever we give some threshold values to that basically it will go and check whether in that specific row at least two non null values are there if it is there, it is just going to keep that row otherwise it is just going to delete it that is what you cannot you can also check out with the one so if I go and see one then you can see that all this particular rows are there because it will just go and check okay here one non nine values are there Here it is there. If I make it as three, okay, let's see what it will come. Now here you can see at least this is their remaining all has been deleted right see over here you had only two non non values here also you how do you add three so this is the 3410 38,009 so here you can see the value that is understanding with respect to threshold. Now, let's go ahead with the another one we'll just call a subset. So here I'm just going to write it as subset because this is the third parameter inside my drop feature. And remember, these are features are pretty much simple with respect to if you have worked with pandas, the same thing we are working away subset in the subset we can actually provide. Suppose I'll say in the subject, let's remove threshold I don't want to keep any threshold let's say I just want to drop nine values only for my specific column probably only from experience column then I can basically give it as a subset. So, from the experience column you can see that wherever there was none values in the records all those that hold record has been deleted right. So like this you can apply with respect to the suppose you want to apply it in age you can also apply this right wherever there was none values that old record has got deleted in the age column. So this is with respect to subset so I hope you are getting an idea guys, this is pretty much God because the same thing we are trying to do right we are we are actually trying to apply whatever things we actually did in pandas and this is very, very handy when you have working with missing data. Okay. Let's go with the next thing. Now let's go and fill the missing value filling the missing value nine order to fill the missing value again I'll be using Vf eyespot dot fill dot okay sorry na dot fill okay. And inside this this field will take two parameters one is value and the one is subset Okay. Now, suppose if I go and give value like this, suppose I say missing value and if I go and write dot show, then what it is going to do whenever there is a non valid is going to replace with missing values. So, here you can see here the null value is there. So, missing value missing value missing value. Suppose, if you really want to perform this missing value handling in only a specific column, then you can basically write your column name also like this. So, this will be my Excel subset, okay. I can also give multiple records like this. See, I can also give multiple Gods like experience karma probably age, karma age in call enlist, right, when I give like this, then it this kind of functionalities will happen in two columns, right? Pretty much simple. So by now next step, what we are going to do is that, we are going to take a specific column, and probably we are going to handle the missing values with the help of mean of that specific column or median of that specific column. So, right now, if I go and check out my bf got pi spark here, if I go and see my dot show value, this is my entire data set over here. Now, what I'm going to do is that I'm going to take this particular experience column and probably replace the null values with the mean of the experience itself. So in order to do this, I'm going to use an inbuilt function. And guys, if you know about imputing function, we basically use that with the help of SK learn also in PI Spark, also, we have an impure function. So I'm just going to copy and paste the code over here to make it very, very simple. From pi spark.ml dot feature import in pewter, here, I'm just going to give my input columns that is age experience salary, probably I want to apply for every column over here. And then I'm just saying that for age experience salary, I'm just going to find out this dot format dot c output columns. And then I'm going to keep the strategy as mean you can also change the strategy has immediate more than everything. So I'll execute this this has got executed fine, and then we are just going to right fit and transform. So imputed reflect df of pi spark dot transform. So once I execute this guy's here, you will be able to see that we are going to create multiple columns with underscore imputed as this name. So here you can see h underscore imputed. In short, what we have done we have tried some kind of mean functionality over here that basically means the null value has been replaced by mean. So over here you can see this null value is replaced by 28. Similarly, this to null value is replaced with 10 and five sorry, five. This is what is the experience imputed column. Over here you will be seeing that wherever there is a null value it is being replaced by the mean of the experience column, the mean of the age column and mean or the salary column. And this way, you'll be able to do it if you really want to go ahead with median. Just go and change this mean to median and just try to execute it. Here. Now you'll be able to see the median value and here is your initial null columns, which had sorry, here are the columns which has none values. And here are all the columns which has basically the imputed values right with respect to mean median. So guys, today we are in the tutorial for of pi spark data frames. And here in this particular video, we are going to discuss about filter operation. A filter operation is pretty much important for data pre processing technique. If you want to retrieve some of the records based on some kind of conditions, or some kind of Boolean conditions, we can definitely do that with the help of filter operation. Now guys, please make sure that you follow this particular playlist with respect to pi Spark, I will be uploading more and more videos as we go ahead. And remember one more thing there was a lot of complaints from people are telling to upload SQL with Python. Don't worry parallely I'll start uploading SQL with Python. I'm extremely sorry, because of some delay, because I was doing some kind of work busy with something. But I'll make sure that I'll try to upload all the videos. So parallely SQL with Python will also get uploaded. So let's proceed. Now first of all, let me go and make some cell. Now today for this app taken our data set a small data set, which is called as test one dot CSV. Here I have some data set like name a just experience and salary. And I'm just going to use this and try to show you some of the example with respect to filter operation. Initially, whenever you want to work with PI Spark, you have to make sure that you install all the libraries. So I'm going to use plus pi spark dot SQL import spark session. And this will actually help us to create a spark session, right. And that is the first step whenever we want to basically work with PI Spark, right. So we'll be using sparks session dot builder dot app name, then I'm just going to create my app name as data frame. And basically right get our create function, which will actually help me to quickly create a spark session, I think this is pretty much familiar with every one of you. And let's proceed unless try to read a specific data set. So over here, what I'm going to do, I'm just going to create a variable, df underscore pi Spark, and I'm going to use the spark variable dot read dot CSV. And here, I'm just going to consider my data set test one dot CSV. And here, I'm just going to make sure that we have this particular option selected header is equal to true and in for schema is equal to true I think this all I've actually explained you then if I write df.pi spark dot show up, here, you'll be able to see your data set. Okay, so it is reading, let's see how we will get the output. So this is my entire output. Now guys, as I showed you that we will be working on a filter operation, I will try to retrieve some of the records based on some conditions. Remember, filters also are available in pandas. But there you try to write in a different way. Let me just show you how we can perform filter operation by using pi Spark. Okay, so filter operations, let me make this as a markdown. So it looks big. looks amazing. Let me make some more cells Perfect. Now first step, how do I do a filter operation, suppose I want to find out salary of the people who are less than probably 20,000. Okay, less than or equal to 20,000. Again, I'd like that less than or equal to 20,000. Now for this, there are two ways how we can write it first way, I will just try to use the filter operation. So you have like dot filter. And here, you just have to specify the condition that you want. Suppose I'll write salary is less than or equal to 20,000. Remember, this salary should be the same name of the column over here, right? And when I write dot show, you will be able to see this specific record. And you'll be able to see, okay, less than or equal to 20,000 is this foreign for people, Sonny Paul has shown sober here, you will be able to see all these things along with the experience right? Now this is one way, probably I just want to pick up. After putting this particular condition, I want to pick up two columns. So what I can do, I can use this. And then I can basically write dot select. And here, I'm going to specify my name, probably I want the name and age, name, comma, age. So dot show, I'll do this. Now this is how you can actually do it again. Over here, you can see that name underscore age is actually there. And you are able to get that specific information after this. Probably I want to do some of the operation you can actually do less than greater than whatever things you want. Probably I'm going to put two different conditions Then how should I put it? Let's see. Let's see for that also. So I'll write divide df pi spark dot theta. And here I am going to specify my first condition. Suppose this is one way. This is one way by using filter operation. Also guys, then this conditions that I'm writing, I can also write something like this See this, suppose if I write df pi spark of salary suppose salary as less than or equal to 20,000, I can also write like this, I will also be able to get the same output. So here, you'll be able to see the same output over here. Now, suppose I want to write multiple conditions, how do I write, it's very simple, I will take this, this is, first of all, this is one of my condition. So I'm just going to use this condition. And I can also use an AND operation you know, so I'll say and, or, or any kind of operation that you want, probably, I want to say that df underscore pi salary is great, less than or equal to 2000 20,000. And probably I want a df pi spark of salary salary greater than or equal to 15,000. So I'll be able to get all those specific records, okay. And again, I'll try to put this in another brackets, make sure that you do this, otherwise, you will be getting an error. Okay? Very, very simple, guys. So let's see how I've actually written it is something like this d f underscore pi spark dot filter df or pi spark of salary is less than or equal to 20,000 greater than equal to 15. If I execute, you will be able to see between 15,000 to 20,000, you'll be able to find out you can also write or then you will be able to get all the different different values. Now, this is our kind of filter operation that you can basically specify, remember, this will be pretty much handy when you are probably retrieving some of the records with respect to any kind of datasets, and you can try different different things. So this is one way where you are actually directly providing your column name and putting a condition internally this PI spark actually pi spark DataFrame understands it and you will be able to get the output, right. So yes, this was it all about this particular video, I hope you like it, I hope you liked this particular filter operation, just try to do it from your side. Okay, one more operation is basically pending, I can also write like this serious everybody, I can basically say that, okay. Probably, I can use this operation, which is called a knot operation. Let's see how this knot operation will be coming. Okay. Basically, the inverse condition operation, we basically say, so I'll be using this, okay. And this, and inside this, I can put a knot condition which like this, so I'll say this is a knot of df of pi spark salary is less than or equal to 20,000. So anything that is greater than 20,000 will be given over here. Okay, so inverse operation, you can see in words filter operation, guys, we will be continuing the PI spark series. And in this particular video, we are going to see group by an aggregate function. Already I have actually created somewhere around four tutorials on pi Spark, this is basically the fifth tutorial. And again, this is a part of a data frame, why we should actually use group by an aggregate functions again for doing some kind of data pre processing. So let's begin for this particular data set. For this particular problem, I have created a dataset which has three features, like name departments and salary, and you have some of the data like crash data science, salary, right, something like this. So we're here in short, if I want to basically understand about this particular data set, there are some departments probably where crash and other people teach. And based on different different departments, they get a different different salary. So let's see how we can perform different different group by an aggregate functions and see how we can pre process or how we can get some or retrieve some kind of results from this particular data. So to begin with, what we are going to do, we are first of all going to import pi Spark SQL import spark session. As usual, we have to create a spark session. So after this, what we have to do, I'll create a spark variable. So I'll use spark session dot builder dot happening. I think everybody must be familiar with this. But again, I'm trying to show you this one, so let me write it as aggregate dot get auto create. So now I've actually created a spark session. Okay, probably this will take some time. Now if I go and check out my spark variable, so here is your entire information, okay, with respect to this particular spark video, now let's go ahead and try to read the data set. Now I will just write df underscore pi Spark. And then here I'll write spark dot read dot CSV. The CSV file name is basically test three dot CSV, and remember I'll be giving this particular CSV file in the GitHub also. And then I'll be using header is equal to true comma, infer schema is equal to two. Now this is my df underscore pi Spark. Now what I will do in the next statement, I will write df underscore pi Spark. dot show right. Now here you will be able to see that I am actually being able to see all the datasets. Here I have named departments and salary on all this particular information. If I really want to see the schema or the columns, like which all columns where it belongs to this like a data type, so I can definitely use the F underscore pi spark dot print schema, right. And now here you can see name is a string, department is string, and salary is basically an integer. Okay, now let's perform some group by operation First we'll start by group by operation, probably I want to group by name, and probably try to see what will be the mean average salary. You know, well what suppose let's let's take a specific example over here. So I'll write TF dot underscore pi spark dot group by suppose I want to go and check who is having the maximum salary out of all these people that are present in this particular data set. So, I will first of all group by name, if I execute this, you can see that we will be getting a return type of group data at some specific memory location. And you should always know that guys, group by aggregate functions works together. That basically means first of all we are we need to apply a group by functionality and then we need to apply an aggregate function. So aggregate function Do you really want to check just press dot and press tab. So here, you will be able to see a lot of different different function examples like aggregate average count max mean, by better many more right? Now what I'm going to do, I'm just going to use this dot sum because I really need to find which is the maximum salary from out of all this particular employees who is having the maximum salary. So here I'll say Datsun and if I execute it, you will be able to see that we are getting sequel dot data frame, which has name and sum of salary This is very much important Let's sum of salary because I really want to have the sum of the salary remember, we cannot apply sum on the string. So, that is the reason it has not done over here it is just giving you the name because we have grouped by name and this dot some will just get applied on this particular salary. Now, if I go and write dot show here you will be able to see so the answer will be here is having the highest salary of 35,000 Sonny has 12,000 Krish has 19,000 Mahesh has 7000 So, if you go and see over here, so, the uncial is basically present here here and in big data. So, overall, his salary should be 35,000 if you compute it, similarly, you can go and compute my salary over here, over here by just calculating this and then you can also compute sunny salary. And you can also see my hash and so this is one just an example. So here I will just write we have grouped to find the maximum salary. And definitely over here from this entire observation, we can retrieve that sudhanshu is having the highest salary. Okay, now let's go to one step ahead. One more step ahead. Now we'll try to group by departments to find out which department gives maximum salary Okay, we are going to do a group by departments which gives maximum standard suppose this is my, this is my requirement, okay, and different different types of requirement may come, I'm just trying to show you some examples. I'm just going to copy this. I'm going to use this department Okay. And then I'm basically going to say dot some dots show, if I execute it, let me see department is a wrong column name. So I'll write departments it is department. So let me write this. Now if I go and see IoT over here gives some salary around 115 1000 to the simplest to all the employees right combined because we are doing the some big data gives somewhere around 15,000 data science gifts around 43,000 I suppose if I go and see big data over here 4000 4000 8000 8013 1000 13,000 15,000 so I hope I'm getting yes Big Data is actually giving us 15,000 so you can go and calculate it. Suppose if you want to find out the mean you can also find out the mean okay, so let me just write it over here. Just copy this entire thing, paste it over here and read me right instead of instead of some I'll try to write mean so by default the mean salary here you can see that for a particular employee somewhere for IoT it is 7500 because this mean will be based on how many number of people are working in the department right. So like this, you can actually find out now I can also check one more thing I can copy this I can try to find out how many number of employees are actually working based on the department so I can use dot Count and then if I go and execute this properly This is a method Okay. Now here you will be seeing that IoT there are two people in big data there are four people in data science they are four people. So, four plus four plus eight total employees that are present over here is basically now, one more way that I can basically apply a directly aggregate function also. Now, see these are all some of the examples and again you can do different different groupbuy let me use df pi spark suppose I say dot aggregate okay and inside this I will just give my key value pairs like this suppose I say let me say that salary I want to find out the sum of the salaries the overall salary that is basically given to the entire total expenditure inside. So, the total expenditure that you will be able to see somewhere on 73,000 All right. So, we can also apply direct aggregate function otherwise this all are also aggregate function which we basically apply after after you know applying a group by function. Now, suppose these are probably the salary I want to find out suppose I take this example I want to find out the maximum salary that the person is basically getting who is getting the maximum salary sorry. So, here instead of writing dot sum now I'll write max dot show. Now, here you can see, Sudan shows basically getting 20,000 over here 10 10,000 crashes getting 10,000 matches getting four for 4000 right. So, all this particular data is there see, Krishna is basically getting with respect to data science over here 10,000 So, it has basically picked up it is not picking up both the records, but at least when it is grouping by name, and then it is showing this particular data that time you will be able to see it, let's see whether I will be also able to see this or not. So, group by if I score and write min. So here you will be able to see minimum value with respect to different different records when I'm grouping by here, you will be able to see that Sudan shoe, sorry. So the answer is getting a minimum salary of 5000 2000 crushers getting a minimum salary of 4000. Right, we can also get that particular information. Now let's see what all different different types of operation are, their average is also there. So if I write a VG, it's just like mean only guys. So this is basically the mean salary that probably, again, you can check out different different functionalities, why these all things are basically required. understand one thing is that you really need to do a lot of data pre processing a lot of retrieving skills that you basically do, you can check it out this one and you can do different different functionalities as you like it spark emulate also has an amazing documentation with respect to various examples. So here, you can go and click on examples. And basically check out this particular documentation, you can actually see different different kinds of examples how it is basically done. But with respect to spark ml, there are two different techniques. One is the RDD techniques, and one is the data frame API's. Now what we are going to do, guys, data frame API is the most recent one, you know, and it is pretty much famously used everywhere. So we'll be focusing on data frame API. That is the reason why we learn DataFrame in PI spark very much nicely. So we'll try to learn through data frame API's. And we'll try to see the technique how we can basically solve a machine learning use case. Now let's go and see one very simple example guys. And always remember, the documentation is pretty much amazingly given, you can actually check out over here and try to read all these things. Okay. So let's proceed. And let's try to see, what are things we can actually do. In this particular example, I'm just going to take a simple machine learning problem statement. So let me just open a specific data set for you all, and then probably will try to do it. Okay. So this is my data set. Guys. I know there are not many records over here. Okay, so I have a data set, which has like name, age, experience and salary. And this is just a simple problem statement to just show you that how powerful SPARC actually is, with respect to M lab libraries just to show you a demo. From the next video I'll be showing you detailed explanation of the regression algorithms how we can basically do the implementation all theoretical and all guys have already uploaded you can see over here, I'll be doing see after this tutorial five this is basically the tutorial six I'll try to add it after this and then whenever I will be uploading the linear regression algorithm before that please make sure that you watch this match infusion. Okay, I have uploaded this specific video also in the same playlist. So after this tutorial 26 St. yt saying tutorial 26 because I have also added this in my machine learning playlist. So after this, you'll also be able to find out when we'll be discussing about linear regression how we can implement in depth That video will also get uploaded. So let's proceed. And here is my entire data set. Guys, this is my data set. Now what I have to do is that based on age and experience, I need to predict the salary, very simple use case, not not much data pre processing, not much transformation, not much standardization and all okay, I'm just going to take up this two independent feature. And I will be predicting the salary of this particular person based on age and experience. Okay, so this is what I'm actually going to do. So here is a perfect example, again, detailee, I'll try to show you how to basically implement line by line probably from the upcoming videos where I'll be discussing about linear regression, and on and if I go see this particular problem, this is also a linear regression example. Okay, so let's go here. First of all, as usual, I will be creating a spark session. So I'll use from pi spark dot SQL import spark session. And then I'm going to use spark session dot builder dot app name. Here, I'm actually creating a spark session on missing, let me execute it, I think this is pretty much familiar, you're familiar with this, then what I'm going to do over here is that we are just going to read this particular data set with test one dot CSV header is equal to true and infer schema is equal to true. So when I go and see my training dot show, these are all my features over here, perfect, I'll be giving you this data set. Also, don't worry. Now, from this particular data set, if I go and check out my print schema, so here, you will be able to see that I'm getting this particular information. This is my entire print schema. Over here, I have features like name, age, experience, and salary. Now if I go and see train dot columns, this is my training dot columns. Now always remember guys in PI Spark, we use a different fund or mechanism or a kind of data pre processing before See, usually what we do is that in non by using machine learning algorithms that are available net scalar, we basically do a train test split, right. And then we first of all, divide that into independent features dependent features, right, which we use an X and Y variable, and then we do train test split. By doing this, in in in PI Spark, we just do some different techniques, what we do is that yes, we have to basically create a way where I can group all my independent features. So probably I'll try to create vector assembler, we basically say it as a vector assembler see where the class I've actually used, the vector assembler will make sure that I have all my features together grouped like this group like this, in the form of age and experience, suppose over here, my two main features are age and experience, which are my independent feature right. So it will be grouped like this, for every record, it will be grouped like this, okay, for every report, it will be grouped like this, and then what I will be doing is that I will be treating this group as a different feature. So this will basically be my new feature, right. And remember, this new feature is my independent feature. So my independent feature will look like this in a group of H comma experience, which will be treated as a new feature. And this is exactly my independent feature. So I have to group this particular way. So in order to group this, what we do is that in PI Spark, we use something called as vector assembler. So in this vector assembler is basically present in pi spark.ml dot feature, we use this vector assembler, we use two things. One is input column, which all column we are basically taking to group it. So two columns, one is age and experience, right, we don't have to take name because name is fixed, it is a string. Yes, if category features are there, what we do what we need to do, we will convert that into some numerical representation that I'll be showing you when I'm doing some in depth implementation, the upcoming videos of linear regression, logistic regression and all. But here, you'll be able to see that I'm going to take input columns age come experience in the form of a list. And then I will try to group this and create a new column, which is called as independent feature over here, right, that is what I'm actually doing. So if I go and execute this vector assembler, so here, I'm got my feature assembler, and then I do dot transform, I do dot transform on my training data. So this is basically my training data when I do this, and when I do output dot show here, you'll be able to see I had this all features, and a new feature has been created, which is called as independent features. Okay, so we have actually created an independent feature. And you can see over here, age, and experience, age and experience, age and experience, so this is my grouped rows that I've actually got, in short, what I've done, I've combined this two column and made it as a single independent feature, okay? Now, this will be my input feature. Okay. And this will be my output feature, and we'll try to train the model. Okay, so over here now, if I go and see output dot columns, I have name, age experience, salary independent feature. Now what I'll do out of this, let's take which all data set I'm actually interested in. So out of this, I will just be interested in this two data. Separate independent features and salary salary will be my output feature, the y variable, right, and this will be my input feature. So what I'm going to do, I am going to select output dot select independent features and salary. And I'm going to put that in my finalized underscore data. That is what I'm actually doing. If I now go and see my dot show here, you will be able to see the entire thing. Now, this are my independent feature, these are my dependent feature. Now the first step, what we do we do it train test split, like how we do it in a scalar. So in order to do a train test split, I use a function inside my finalized data, which is called as random split. Remember, guys, I'll try to explain you line by line by implementing it when I'm doing a bigger project. Right now, since this is just an introduction session, I really want to explain you how things are actually done. So this is basically my train test split. So here, let me write it down the comment, train test split. And I will be using the linear regression like how we import a classroom a scaler. And similarly, by using pi spark.ml, dot regression, import linear regression, and then I'm doing a random slipped off 75 to 25%. That basically means my training data set will be having 75 percentage of the data, and my test data set will be having 25 percentage of the data, right, then after that, I'll be using a surrogate linear regression in this you have two important variables that we need to get one is feature columns, how many number of feature columns are there that is completely present in this independent feature. So I'm giving it over here. Similarly, in label column, this is my second feature that I have to give this is my output feature. So after I provide both these things, and do a fit on train data, I will be able to find out my coefficient, these are my coefficients. These are my intercepts. And here, I can now evaluate and see my output, right. So by using the evaluate function, we will be able to see the output and inside this there'll be a prediction variable, which will have the output. Okay, so this is my prediction. This is my salary, the real value. This is my other thing. Now, if I really want to find out the other important part parameters or metrics, let's press tab here, you will be able to see mean absolute error. pred underscore result dot mean squared error, suppose if I do see this particular word value, you will be able to understand that how the model is actually performing. So that's just a various a very simple example, guys. Don't worry, I will be explaining in depth probably in the upcoming videos when we'll be starting from linear regression. Now remember, the next video is about linear regression implementation in depth implementation, right? Help, you know what exactly is a data bricks platform. And this is an amazing platform where you can actually use PI Spark, all you can work with Apache Spark. And one more amazing thing about this particular platform is that they also provide you cluster instances. So suppose if you have a huge amount of data set, probably you want to distribute the parallel processing or probably want to distribute it in multiple clusters, you can definitely do with the help of data bricks. Now, if I really want to use this particular platform, there are two ways one is for community version, and one is for the paid version, which is like Azure, or AWS cloud, you can actually use in the backend, data bricks also helps you to implement ml flow, okay, and this ml flow is with respect to the CI CD pipeline. So you can also perform those kinds of experiments also, altogether, an amazing platform. What I will be focusing in my youtube channel is that I will try to show you both with the community version also. And in the upcoming videos, we'll try to execute, try to execute with both AWS and Azure. When we are using AWS and Azure, what we will try to do is that whenever we create the instances, multiple instances, you know, that will try to create in this particular cloud platform will also try to pull the data from s3 bucket, which is the storage unit in AWS, and try to show you that how we can work with huge huge data sets, right all those things when we actually should as we go ahead. Now let's understand what is data bricks is it is an open and unified data analytics platform for data engineering, data science and machine learning analytics. Remember why data bricks actually helps us to perform data engineering and when I say data engineering, probably working with big data, it also helps us to execute some machine learning algorithms. Probably any kind of data science problem statement will be able to do it. And Rob, let's suppose three kinds of platform cloud platforms one is AWS, Microsoft Azure and Google Cloud. Now, if you really want to start start with this, we'll start with the community version. And you just have to go into this particular URL and just type try data bricks, and then you just enter all your details to get registered for free. Now once you are registered, when you want to get started for free, you'll get two options over there. On the right hand side you will be seeing the community version which you really want to use it for free. And in the left hand side you will be having an option where it I will tell you that you need to work with this three cloud platforms. And you can select that also. So for right now, I will try to show you a community version, which will be very simple, very, very easy. So let's go to the community version. So this is how the community version actually looks like. If you really want to go into the cloud version, you can just click on upgrade. Okay, so just click on upgrade. And this is the URL of the community version and this version of this URL you will be able to get when you register for the community version tomorrow. So you think that you probably want to work with the cloud, you just have to click on this upgrade now. Now in this, you will be able to see three things one is explore to the Explore the Quick Start tutorial, important explore data, create a blank notebook, and many more things. Over here, what kind of tasks you'll be able to do in the community version one is you can create a new notebook, you can create a table, create a cluster, create new ml flow experiments, I hope I actually showed you ml flow experience, we can also create this ml Flow Experiment by combining to a database in the backend, okay, then we can import libraries, read documentation, do a lot of tasks. Now first of all, what we need to do is that probably I'll create a cluster. And I in order to create a cluster, I will click on this create a cluster here, you can basically just write down any cluster name. Suppose I'll say Apache, or I'll just say, pi Spark cluster. Suppose this is my cluster that I want to basically create. Okay, and then here by default, over here, you can see 8.2 scaler, this one spark 3.1 point one is selected. So we'll be working with spark 3.1 point one, if you remember, in my local also, I actually installed this particular version only, okay, by default, you will be able to see that they will be providing you one instance with 15 GB memory, and some more configuration. If you really want to upgrade your configuration, you can basically go and click over here, okay. And remember, in the free version, you will be able to work in an instance unless and until it is not idle for two hours, otherwise it will get disconnected. So over here you can see one driver 15.3, GB memory, two cores and one dBu. Okay, all these things are there, you can also understand what the view is DB is nothing but a data, bricks unit. If you want to click over here, you will be able to understand what exactly the view is okay. And you will be able to select a cloud and basically work with that perfect. till here, everything is fine. Let's start, let's create the cluster. Now, once you you will be seeing that the cluster is basically getting created. You also have lot of options over here like notebook libraries, event logs, spark UI driver logs, and all. It's not like you just have, you will be able to work with Python over here. Here you have lots of options. Okay. So suppose if I go and click on libraries, and if I click on install new here, you will be having an option to upload the libraries, you can also install the libraries from pi pi from Maven, which will basically use along with Java, and then you have different different workspace. So here, what I'm going to do is that suppose you select by by and suppose you want to install some of the libraries like TensorFlow, or probably want to go with Kara's, you can basically write like this, probably I want a scale on, you know, so I can just get comma separated and start installing them. Okay. But by default, I know I'm going to work with PI Spark, so I'm not going to install any libraries. So let's see how much time this will probably take. This is just getting executed over here. And let's go back to my home. So apart from this year, you will be also able to upload the data set and that particular data will give you an environment like how you're storing the data in the loop okay. So before the cluster is getting created, okay, now the cluster has got created here you can see pi spark it is in running state. Now. And remember, this cluster only has one instance, if you want to create multiple clusters, we have to use the cloud platform one which will be chargeable. Okay? So in here, I'm going to click on export the data. Now see you guys you can upload the data you can also bring from s3 bucket or you can also then bring from s3 bucket. These are all things I'll try to show you. Then you also have dbfs you know, and DB FF, you will basically be storing inside this particular format. Then you have other data sources like Amazon redshift, Amazon kinases, Amazon kinases is basically used for live streaming data. Okay. Then you have Cassandra, Cassandra is also a no SQL database and JDBC lastic search so different different data sets, data sources also there will also try to see with a set of partners integration. So they are also like real time capture in the data lake and many more things out there. So you can definitely have a look onto this. Now what I'm going to do is that I'm just going to click over here and try to upload our data. Let me just see. Let me just upload a data sets. I'll just go to my PI spark folder. So here is my PI Spark. So I'm just going to upload the test data set probably alright. upload this test one. Now here you can see that the data set has been uploaded. Now it is saying that create table with UI CREATE TABLE in Node back notebook. Suppose if I go and click this, you know. So here you will be able to see this is the code, this is the entire code to basically create a table in the UI. But what I really want to do is that I don't want to create a table instead, I will just try to execute some of the PI spark code, which we have already learned. And now, okay, so what I'm going to do, I'll just remove this, I don't want it, I'll remove this, okay. Okay, let me read the data set. Now for reading the data set. Over here, you will be able to see that my data set path is basically this it is a CSV file. In full schema headed schema, all these things are there. So let me remove this also. So let me start reading the data set. So by default spark is already uploaded. So I'll write spark dot spark dot read dot CSV, I hope so it will work and for the first time, remember, this is my file location. file location. Okay, bye underscore elevation. And then I will also be using two more option one is header is equal to true and then I have inferred schema is equal to true. Once I execute this, now you will be seeing that automatically. The first time when you're executing, it will say that launch, launch and run so we are going to launch the cluster and run it so I'm just going to click it failed to create reject request since the total number of nodes would exceed the limit one, why this is there. Let's see if our clusters we just have one cluster. Okay, there was some examples that have been taken over here. So let me remove one of them. Okay, let me just execute this. Okay, I'll go over here. space, let me delete it. Okay. Perfect. Now, I'll try to read this. Let's see. Again, it says failed to create the cluster reject request rejected since the total number of nodes would exceed the limit of one and it is not allowing us to execute more than one file I guess. So because of that, I'm just reloading it. Let's see now. Now it has got executed see guys before they were two files. So because of that, it was not allowing me to run I just real I deleted one file and I I reloaded one file Okay. So now you can see that it is getting the run now. Okay, you can also press Shift Tab to basically see some hints and all the same like how we do it in Jupyter Notebook. Now here you will be able to see that my file will be running absolutely fine. And it shows it shows this df it shows that okay, it is a PI spark dot SQL dot data frame raw data. Now, let me just execute the other things. Now suppose if I want df dot read, see I'm just using that tab feature print schema, if I go and see this here, you will be able to see find out all the values right. So in short, this is basically now running in my instance of the cluster right, I will be able to upload any huge data set, probably a 50 gb data set also from s3 bucket and not right that I'll try to show you how we can do it from s3 bucket in the upcoming videos. But what I am going to show you guys in the upcoming future will try to run all this kind of problem statements through the data bricks so that you will be able to learn it Okay. Now, let me just go and do one more thing. So this is my df dot show. Okay, so, this is my entire data set. So probably I will just want to select some column, I can actually write the DF dot select and here I just want to say salary dot show I'm just selecting salary dot show here you will be able to see so everything that you want to do you will be able to do it and remember over here you will be able to find out around 15 gb and you can definitely perform any kind of things okay. Here also you have same options like how we have within you know in Jupyter Notebook, every option is that you will be able to find out all this particular options in Jupyter Notebook also, right. So, this is basically running in 15.25 gb, two cores, okay in that particular cluster, you have two cores, then you have spark 3.1 point One Spark 2.12 and you will be able to see all this particular information. So what I would like to want Guys, please try to make a specific environment for you, and then try to start it, try to keep everything ready. And from the upcoming videos, we will try to see how we can execute how we can implement problem statement how we can implement different algorithms. I've already given you the introduction of data bricks in the last class. I hope you have made your account and I hope you have started using it. If you don't know how to make an account, please watch Logitech tutorial seven, the entire playlist link will be given in the description. Now this is my databricks community account. Remember, in the community version, we can only create one cluster. I'll also be showing you in the upgraded versions probably in the future, I will be buying it. And I will try to show you how you can also create multiple clusters, unlimited clusters. But for that, you also need to use some clouds like AWS or Azure. Now, first of all what data set I'm going to use. So this is the entire data set that I'm going to use guys, this data set is called us tips data set. So that basically means people who are actually going to the restaurant what tip they have actually given based on the total bill, or I can also go and solve this particular problem based on all these particular parameters, what should what probably is the total bill that the person is going to pay? Okay, so this is the problem statement that I'm going to solve. Now here, you can see this is a multi linear regression problem statement. Here, you have many, many features, right? So let's proceed. Now first of all, what I'm going to do, I'm just going to click to the browse, and I'm going to upload this particular data set. Now in order to upload this particular data set, I have this particular data set in my path. So probably I'll also be giving you this particular data set, so don't worry about it. Oh, let me just quickly, just a second, let me just upload the data set over here. Okay. By Spark, okay, so here, you can see that this is my data set, which I'm actually uploading tips. So let me open it right. Now here, you will be able to see that your tips data set will get uploaded, you know, in this DB Fs directory. So here you will be having something like file stores slash tables. Okay. Now what you can actually do now, let's go and click on this dvfs. And here you can see and file stores, probably you can also click on tables. Here you have the steps dot CSV, I've also uploaded dissolve data sets in my previous videos, probably I was just using this, okay, but here, I'm just focusing on tips dot CSV. Now what I'm going to do over here, let's go and do the first step. Remember, the first step in data bricks is that we need to create the clusters, okay? And create a cluster. Right now, by default in the community version, data bricks actually helps you to create a cluster, just a one single cluster, okay. But if you're using the paid version, the upgraded version, it will actually help you to create multiple clusters if you have the access of AWS cloud. So I'm just going to click on the cluster, let me create a new cluster. So I'll say this is my linear regression cluster, okay. And then I'm going to use this runtime 8.2 scalar, this is there. And we're just going to click the cluster and remaining all things will be almost same in this cluster. In this instance, you'll be getting 15 GB memory and all the other information here, you can check it out. You can also be getting two cores, and one I dB. Okay, so which I've actually already discussed in my previous session, so I'll go and click on cluster. This will take some time. And remember, guys, if you really want to use any kind of libraries, just click over here and install those libraries, which they want. Like suppose if you want to use Seabourn, you want to use kiraz, you want to use TensorFlow. So here you can basically type this along with the versions and you will be able to install it okay. But right now, I don't require any libraries under shown to use PI spark that is my main aim. So guys, click on the cluster over here. And here, you can see that probably after a minute, this particular cluster is actually created. Okay. Now, again, go to the home, what you can do, you can create a blank notebook, I've already created one notebook, so that I have the basic code written. So I'm just going to open this, and let's start this particular process. Now first of all, I have something called as file location, I know my file location is basically tips dot CSV, the file type is CSV. And then I'm just using spark dot read dot CSV file location header is equal to true info, schema is equal to two. And let me just write df dot show, this will actually help me to check the entire dataset. Okay, so I'm just going to execute it in front of you. And let's make it line by line I'll try to write down all the all the codes, it will definitely be helpful for you to understand. So please make sure that you also type along with me to understand the code much more better. Okay, so here now I'm going to execute this. Now here, you will be able to see my, my clusters will start running. Okay. And then you can see waiting to run running the command, probably we will be able to see it and just zoom out a little bit so that you'll be able to see properly. And again, guys for the first time, if you're starting this particular cluster, it will take time. Okay, so spark jobs it is running. And now you will be able to see my data set. That is my tips data set, which is uploaded in this specific file location. So this is my entire dataset, total bill, tip, sex, mocha date, time size, perfect. Now let's go to the next step. What I'm going to do, I'm just going to write df dot print schema. So I can Oh So, you stab, you know, it will be able to load this entire thing. So now here you can see that this is my entire features total bill, tip, sex smoker day time. So here is all your features like double, double sexy string smoker is string, day string time string integer. Now remember you may be thinking Krish, why I am actually doing this in databricks to just make you understand how this will basically run in the cluster. Right now I just have one single cluster guys, that basically means that the maximum ram in this particular cluster is somewhere around 15 gb. But just understand if you're working with 100 GB of data, and what happens, this kind of processing will get split in multiple clusters, right. So in this way, you'll be able to work with big data also in the upcoming things right. Now this is I think that's right. Now let's go and try to understand over here, which is my independent feature, my independent feature is my tips feature sex smoker day time and size. And my dependent features, basically total bills. So based on all these particular features, I need to create a linear regression algorithm, which will be able to predict the total bill. So let's go ahead, now over here, I'm just going to write df dot columns. So if I want to check my columns, this is my columns over here. So I can see this is my exact columns, this many columns I actually have. Now, one thing about this particular feature over here, guys, you have columns like sex, smoker, day, time, right? These all are categorical features, right? And probably, you know, this category of features needs to be converted into some numerical values, then only my machine learning algorithm will get will be able to understand it. So let's see how to handle category features. So here, I'm just going to write a comment. Okay, handling categorical features, right. Now, I'll try to show you how to handle this kind of category features. Now, one way in PI Spark, and obviously, we know, in normal SK learn, you know, we try to use one hot encoding, we try to use ordinal encoding, we try to use different kinds of encodings in this. And similarly, we can use that same encoding over here also with the help of pi Spark. So for this particular process, we have something called a string indexer. So I'm just going to say from pi spark radar, from pi spark.ml dot feature, okay, I'm going to import something called as string indexer. So, I will be using the string indexer, the string indexer will actually help us to you know, basically convert our string category features into some numerical features, numerical features basically is ordinal encoding. Like suppose if I have gender like male or female, it will be shown as zeros and ones. And over here you will be seeing I so most of the categories over here are ordinal encoding. Now, you may be thinking, one hot encoding, what is the process that I'll try to show you in the upcoming videos with different different machine learning algorithm? The reason why I'm making it because it is better to learn one thing at a time, right? So we'll I'll try to show all those kinds of examples also, is now let's proceed and try to see that how we can convert this category features like sex, smoker day and time, probably time is also category feature, see over here. So if I see this all features over here, let me do one thing, okay. Let me just write df dot show. So this is my entire features. Quickly, if I go and see this is time is also category feature. So quickly, let's go ahead and try to see how we can basically use this, let me delete this thing also, or let me just write it once again. So I have actually, you know, imported this library called a string indexer. Now what I'm going to do over here is that, let me write our indexer object saying as this and I'll write string indexer. And here, first of all, I really need to provide which all our category features. Now remember, in this string indexer, if I go and press Shift Tab, probably over here, here, you will be able to see I have to give input columns. So let me touch here, I have to give input columns, and I have to give output columns. I also have options of providing input columns as multiple columns, and the output columns as multiple columns. So let me try both the thing Okay, so over here, first of all, let me try with input columns. So here in the input columns, I will provide my first value. Now suppose I want to really convert the sex column into into my category feature. So here I'll write output column. Okay. And here, I'll say, sex underscore indexed. Now here what we are actually doing guys, here, I'm actually giving my sex column and this sex column will be converted into an ordinal encoding with the help of the string indexer. Okay, now in the next step, what I will do, I will just write df, okay, probably I'll just use df. Or what I can do, I can, I can just create another, probably I can create another data frame. So I'll write df underscore are probably c because I don't want to change the DF and again, run that particular code. Now, I'll say indexer dot fit, okay, so I can definitely use fit. And then I can use transform. So here also, it is almost same like that only guys fit underscore transform. and here also i'm going to use df. Okay. And then if I go and see df.df underscore r dot show, here, now you'll be able to see that the sex column, one more sex underscore index column will be created, and it will have the ordinal encoded values in this particular column. So let's go and see this. So once I execute it, perfect, I think it is running properly, it will take time. Now here, you can see that I'm having one more column, which is called a sex underscore index, wherever the female value is that the value is one wherever the male value is that the value is zero, right? So we have handled this particular column. And we have basically converted this character feature into the ordinal encoding. Now, still, we have many features. So what I'm going to do, I'm just going to use this indexer again, okay. And probably I'm just going to write over here, multiple columns, I will specify. So first column, I've already changed it. So I'm going to change this into something else, sex instead of sex that will become smoker. Okay, smoker. But I showed you guys, instead of writing input columns, now I have to write input columns, right. So in this multiple columns when I'm giving, so this is smoker, then I have one more feature. If I see day, day and time, day, and time is more two features. So I'm just going to right over here, day underscore. So guys, now how I've written smoker day and time, similarly, I will be writing three more columns over here. So the first column should be because I'm going to create the ordinal encoding. And I probably create a new column over here. So this will be my smoker underscore indexed. Okay, I'll close the braces over here. My second feature will basically be de underscore index, right? Or, and my third feature will probably be our time underscore index. So here, I'm just going to create three more features. And then I'm giving index dot fit or df underscore R. Okay, because now I have my new data frame, and then I'm going to say df underscore dot show. Now once I execute it guys, I hope should not give us an error, okay, it is saying that invalid parameter value given for Param output columns could not convert last restore, so I have to make this as output columns. Okay, so that was the issue. Right? So now you'll be able to see that it's executing perfectly fine. Now, here you have all the features available six underscore index smoken underscore index, de underscore index and time underscore index and all you can see over here this ordinal encodings like 012 right and we have now converted this all string values into this kind of all category values that are available in this feature into numerical values. Now, my model will definitely be able to understand Okay, now we have done this guys, now, let's proceed what is the next step that we basically do that we are going to discuss now the other steps are pretty much easy because we have already created this specific data set. Now what we have to do is that there is something called as vector assembler. Now always remember guys in PI Spark, you know, whenever we have all this particular feature, we need to group all the independent features together, and the dependent feature separately, okay, so guys, we'll write from from pi spark.ml dot feature, I'm going to import something called as vector assembler. Right? So I'm just going to use this vector assembler, this will actually help us to group independent features together and the dependent features separately, so let me just go ahead and write vector assembler and then I'm going to initialize this the first parameter I have to provide is basically my input columns, here are my input columns, what are the input columns I have? Let me just see before this, let me quickly do one thing is that let me make a cell okay. Cell up create a cell let me just removed this and probably just let me write you know, df underscore r dot columns. Okay, so how many number of columns we have. So I have all my information with respect to the columns. So my input column over here, the first thing that I am definitely going to provide is my tip column. Because tip is required tip is the first independent feature and it is a numerical feature then I have something like six underscore indexed Okay, so I'm just going to copy this paste it over here. And this is my another input feature. And remember guys, we really need to follow the order. So now my third feature is basically smoke index. And before this also I can also specify size okay. So I will be specifying size, size six index smoker index. Okay, and then probably I'll also create a index. Okay, they index, comma, I'm just going to use time index. Okay, so this is our, these are all my independent features. And with respect to this, now remember, this will be grouped together. And I also have to say that if this is grouped together, let's create a new feature, and untie and name this entire group. Okay, so here, I'm just going to say, output column is equal to, and here I'm just going to specify this are my independent features. So I'm going to name this entire thing as my independent feature pretty much simple. Now, let me do one more thing, let me create a variable called as feature assembler, so that we will be able to transform this value. Okay, so feature assembler is equal to vector assembler, and here I have to provide my input columns and the output columns pretty much simple, pretty much easy. Now the next step that I'm going to do is that right output is equal to I'm just going to say dot transform, because I really need to transfer and this needs to be transformed from my df underscore art, okay. So, let me just execute it. Now. Now, here, it has got executed here you can see the entire output, all these things are created, these are my independent features. Now in the independent and why we need to create this independent features that is the that is the specification that is given in PI Spark, always remember, we need to create a group of features and probably a list, all these independent features will be done together. Now, if I go and see my output dot show, here, now you will be able to see, I will be able to see one more feature, which has this, or let me just write output dot select, because probably all the features have been grouped together. And it is very difficult to see all the features in just one screen, I'm just going to take this independent features and just click on dot show. Now once I do this, here, you will be able to see all this particular features are there. Remember, this needs to be shown in the same order. The first feature is tip, then size six, underscore index, smoker underscore index, de underscore index, time underscore index. So these are my independent feature. Now I just have one feature. And over here, you will be able to see that it is just having a list of all the features like this. And this is the first major step to create. Okay, now let's go to the next one. Now I know what is my output. Now what I'm going to do is that out of this entire output, if I go and see my output, output dot show here, you will be able to see all the features are available right in output dot show. So here we'll be able to see all the features are available. Now you know, which is the output feature, right? So this is my dependent feature. And this independent features are my independent feature. So what I'm going to do now I'm just going to select output, or I'll say this is basically my finalized data. And I'm just going to pick up two columns, that is output dot select. And inside this, I'm just going to give my two features, which I'm going to say one is independent features. Okay, indie band features, I hope that the name is right. Otherwise, I'll just confirm it once again. So let me click it over here. Independent features, and one is total underscore bill. Perfect, comma, total underscore bill. Okay. Now if I just go and execute this, now I'm just picking up to features right from this. Now if I go and find out finalized data dot show, now I'll be able to see two important features, that is independent features and total bill. Remember, this is all my independent features. And this is my dependent feature. This is very much simple till here. If it is done, guys, the next step is basically I'm just going to copy and paste some code, I'm just going to apply the linear regression. So first of all, from pi spark.ml dot regression, I'm going to import the linear regression. And then I'm going to take this entire finalized data and then do a random split of 75 and 25%. Okay, and then in my linear regression, I am just going to provide my independent features as my feature column. This is two parameters which to be given in linear regression, one is feature column here I'll be providing independent features. The second one is basically total bill which is my dependent feature. And now I will just do a fit on train data. So once I do it, my regressor model will get created. And probably this will take time. Now here you can see all the information, amazing information you are getting with respect to train and test. Okay, and remember guys, whatever group you have made this independent feature, this is in the format of UDT. Okay, you can see the full form of UDP, that is not a big problem, I thought, okay, now I have my regressor. So what I'm going to do, I'm just going to say regressor dot coefficient, since this is a multiple linear regression, so I'm just going to use regular So dot coefficient and these are all my parameters are different different coefficients because I have around six parameters. So these are all the six different different coefficients. Always remember in a linear regression, you will be having coefficient based on the number of features. And you will also be having intersect. So here I'm just going to intercept. Okay, so this is basically my intercept, that is point 923. Okay, so here you have both the information, now is the time that we will try to evaluate, evaluate the test data. So here, I'm just going to say test. And this is basically my predictions, right? So here, let me write it as something like this. So predictions, okay. prediction. And what I'm going to do, I'm just going to write pred underscore results. Results is equal to this one. And this will basically be my results, okay? Or test is not defined why test is not defined, because there should be test data. I'm actually sorry, okay. But it's okay, you will be able to get so small, small errors, okay. Now, if I really want to see my price results, just go and see red dot predictions, they'll be something like predictions dot show, okay. If you write like this, here, you will be able to get the entire prediction. Okay. So remember, in this prediction, this is your independent feature, this is your total bill, this is your actual value. And this is a prediction value, actual value prediction value, actual value and prediction value here, you can actually compare how good it is, you know, by just seeing your total bill and the prediction value, pretty much good, pretty much amazing, you are able to see the data, I'm just going to write my final comparison. Okay, final comparison. Perfect. I'm very much good at it, you can see it. Now let's see some of the other information like what information we can basically check it out. From this, we can we have a lot of things go probably you want to check out the R square. So what you can write, you can basically write a regression.if, I press tab, this coefficient intercept, then you have lost, then there's also something called an R squared. If I go and execute this, this is basically my r squared. Or let me just write it down. I think, prediction predictions. I don't think so r square is where Let's see whether we'll be able to see the R squared value or not. In just a second, I'm just checking out the documentation page. Okay, oh, sorry, I don't have to use regressor over here. So here I will be using prediction dot results. And let me compute the R square. So this is my r squared. Similarly, you can also check out prediction results dot mean absolute error. So you have mean absolute error. You also have prediction underscore result dot mean, squared. So all these three values, you can definitely check it out. So here is your mean absolute error here is your mean squared error. So these are my performance metrics that I can definitely have. And whenever you guys whenever you face any kind of problems, just make sure that you check the documentation in Apache Spark em lib documentation. Now in this way, you can definitely do this entire problem statement. Now I'll give you one assignment just try to Google it and try to see how you can save this particular file probably in a pickle format or probably in a temporary model pickle file. You know, it's very, very simple, you just have to use the regression dot save but try to have a look and try to see that how you can save this particular pickle file. Now this was all about this particular video. I hope you like this now just try to solve it any other problem statement. Try to do this. In the upcoming videos. I'll also try to show how you can do one hot encoding and probably will be able to learn that too. So I hope you liked this particular video. Please do subscribe the channel if you're not subscribed to either make sure to have a great day. Thank you. Bye bye

PiSpark is an interface for Apache Spark in
Python is often used for large scale data processing and machine learning. Krish knack
teaches this course. So we are going to start Apache Spark series. And specifically, if
I talk about Spark, we will be focusing on how we can use spark with Python. So we are
going to discuss about the library called pi Spark, we will try to understand everything
why spark is actually required. And probably will also try to cover a lot of things, there
is something called as emblem, spark emblem, which will basically say that how you can
apply machine learning, you know, in Apache Spark itself with the help of this spark API
called as pi spark libraries. And apart from that, we&#39;ll also try to see in the future,
once we understand the basics of the PI spark library, how we can actually pre process our
data set how we can use the PI spark data frames, we&#39;ll also try to see how we can implement
or how we can use PI spark and cloud platforms like data, bricks, Amazon, AWS, you know,
so all these kinds of clouds, we&#39;ll try to cover. And remember, Apache Spark is quite
handy. Let me tell you just let me just give you some of the reasons so why Apache Spark
is pretty much good. Because understand, suppose if you have a huge amount of data, okay, suppose
if I say that I&#39;m having 64, GB data, 128 gb data, you know, we may have some kind of
systems or standalone systems, you know, where we can have 32 GB of RAM, probably 64 gb of
ram right now in the workstation that I&#39;m working in. It has 64 gb ram. So max to Max,
it can directly upload a data set of 32, GB, 48 gb, right. But what if we, if we have a
data set of 1.8 gb, you know, that is a time guys, we don&#39;t just depend on a local system,
we will try to pre process that particular data or perform any kind of operation in distributed
systems, right distributed system basically means that all there will be multiple systems,
you know, where we can actually run this pipeline of jobs or process or try to do any kind of
activities that we really want. And definitely Apache Spark will actually help us to do that.
And this has been pretty much amazing. And yes, people wanted this kind of videos a lot.
So how we are going to go through this specific playlist is that we&#39;ll try to, first of all
start with the installation will try to use PI Spark, because that is also Apache Spark
is it is a spark API with Python, when you&#39;re actually working with Python, we basically
use PI spark library. And yes, we can also use spark with other programming languages
like Java, Scala R. And all right, and we&#39;ll try to understand from basics you know, from
basics, how do we read a data set? How do we connect to a data source? Probably, how
do we play with the data frames, you know, in in this Apache Spark, that is your PI Spark,
also, they provide you data structures, like data frames, which is pretty much similar
to the panda&#39;s data frame. But yes, different kinds of operations are supported over there,
which we&#39;ll be looking at one by one as we go ahead. And then we will try to enter into
emlid, the spa, Apache Spark and lib. So basically, it is called a spark em lib, which will actually
help us to perform machine learning, which will be where we&#39;ll be able to perform some
machine learning algorithm task where we will be able to do regression, classification clustering.
And finally, we&#39;ll try to see how we can actually do the same operation in cloud, where I&#39;ll
try to show you some examples where we will be having a huge data set, we will try to
do the operation in the clusters of system, you know, in a distributed system, and we&#39;ll
try to see how we can use spark in that, right. So all those things will basically get covered.
Now, some of the advantages of Apache Spark and why it is very much famous, because it
runs workloads 100 multiplied by 100 times faster, you know, that basically means and
if you know about big data, guys, when we talk about big data, we&#39;re basically talking
about huge data set right. And then if you have heard of this terminology called as MapReduce,
right, Trashman Apache Spark is much more faster 100 times faster than MapReduce also.
Okay, and it is some of the more advantages that it is ease of use. You can write application
quickly in Java, Scala, Python, or R. As I said, we will be focusing on Python, where
we will be using a library called pi Spark. Then you can also combined sequel streaming
and complex analytics. When I talk about complex analytics. I&#39;m basically talking about this
emblem, machine learning libraries. That will work definitely well with Apache Spark. And
Apache sparks can run on Hadoop, Apache missiles Cuba net standalone in our in the clouds,
cloud, different types of clouds guys, when I talk about AWS data, bricks, all these things,
we can definitely work okay. And it actually runs In a cluster mode cluster mode basically
means in a distributed mode, right. So these are some of the examples. Now if I if I go
with respect to which version of pi spark will spark we&#39;ll be using pi spark 3.1 point
one will be using will try to work. And if you just go and search for here you can see
SQL and data frames and all here you can see Spark streaming machine emilich, that is called
as machine learning. And all right, and apart from that, if I go and see the overview, here,
you can see that Apache Spark is a fast and general purpose cluster computing system and
provides high level API&#39;s in scalar, Java and Python, that makes parallel job easy to
write an optimized engine that supports genuine competition graphs, it is basically to work
with huge amount of data in short, you know, and that is pretty much handy, we&#39;ll try to
work. Now if I go and search for the spark in Python, you know, this page will get basically
you&#39;re open. And this thing&#39;s we&#39;ll try to discuss how to install it. And in this video,
we&#39;ll try to install the PI spark library. And if I talk about pi spark library, you&#39;ll
be able to see that pi spark library is pretty much amazing. This library is if you really
want to work spot, if you want to work this spark functionality with Python, you basically
use this specific library. And let&#39;s proceed, and let&#39;s try to see that how we can quickly
how we can quickly you know, install the specific libraries and check out like, what are the
things we can actually do? Okay, so all these things we&#39;ll try to see. So let&#39;s begin, please
make sure that you create a new environment when you&#39;re working with PI Spark. So I have
created a new environment called as my envy here, first of all, I will try to install
the PI spark library. So I&#39;ll just write pip install pi Spark. Okay, and let&#39;s see, in
this will focus on installation will focus on reading some data sets and try to see that
what are things we can actually do, okay, and after doing this, what we can actually
do is that, you can see that our PI spark has been installed, in order to check whether
the installation is perfect or not, I&#39;ll just write input by Spark. So this, this looks
perfectly fine it is working, you know, we are able to see that the PI spark is basically
installed properly. Now, you may be facing some kind of problems that is with respect
to pi Spark. So that is the reason why I&#39;m telling you create a new environment. If you&#39;re
facing some kind of issue, just let me know what is the error that you&#39;re getting? Probably
writing in the comment section. Okay, now, let&#39;s do one thing, I&#39;ll just open an Excel
sheet. Okay. And probably I&#39;ll just try to create some data sets, I&#39;ll say name, probably
I&#39;ll just say name, and age, right. And suppose my name over here that I&#39;m going to write
as crash, and also 31. I&#39;m going to say Sudan shoe. right shoe down shoe, I will just say
okay, 30, probably, I&#39;ll just write some more names like Sonny, probably, I&#39;ll also give
the data as 29. So this three data, we&#39;ll just try to see how we can read this specific
file. Okay, I&#39;m just going to save it. Let&#39;s see, I&#39;ll save it in the same location where
my Jupyter notebooks, guys, he created a folder, I guess, you can save it in any location where
your notebook file is opened, right? So it is not necessary. And just making sure that
you don&#39;t see any of my files. Okay, and I&#39;m just saving it. Okay, I&#39;m saving it as test
one here, you can see I&#39;m saving it as test one dot CSV. So I&#39;ll save it. Let&#39;s keep this
particular file saved. Okay. Now, if I probably want to, you know, read with the pandas, what
we write we write PD dot read underscore CSV, right. And I basically use this particular
data sets called as test one dot CSV, right. So when I&#39;m executing this here, you will
be able to see the specific information. Now when I really want to work with PI Spark,
always, first of all, remember, we need to start a spark session. And in order to start
a spark session, first of all, let me create some more fields. Just see this just follow
this particular steps, or with respect to creating a pass session. So I&#39;ll write from
pi Spark, dot SQL, input spark session. Ok. And then I&#39;ll execute this, you can see that
it is exhibiting fine then All right, sorry, I don&#39;t know what has opened. So I&#39;ll write
I&#39;ll create a variable called a spark. And probably I&#39;ll use the spark session dot builder.
And I&#39;ll say app name. And here I&#39;ll just give my session name. Okay. So it will be
like practice. Suppose I&#39;m practicing this things. And then I can say get or create.
So when I actually execute this, you&#39;ll be able to see us barks session will get created.
And if you&#39;re executing for the first time, it will probably take some amount of time.
Other than that, if I&#39;ve executed multiple times, then you will be able to work it. Now
here, you can definitely see that, in this, when you&#39;re executing in a local, they&#39;ll
always be only one cluster, but when you are actually working in the cloud, you can create
multiple clusters and instances okay. So the spark version that you will be using is V
3.1. point one. Here, you can see that this is basically present in the master, when probably
you will be working in multiple instances that you will be seeing masters and cluster
one, cluster two, all those kinds of information. Okay, so this is with respect to spark. Now,
let&#39;s just write the F of pi Spark, where I will try to read a data set with respect
to spark, okay. Now in order to read a data set, what I can write, I can write like this
spark dot read dot, there is a lot of options like CSV format, JDBC, Parque qL, schema,
table text, a lot of options there. So here we are going to take CSV, and here I&#39;m just
going to write tips, one, tips, one dot CSV, right? And if I just try to execute it, here,
I&#39;m getting some error saying that this particular file does not exist. Let me see. I think this
file is present. Just let me see guys, why this is not getting executed tips, one, b,
f file open. Here, I can see test one dot CSV, okay, sorry, I did not write the CSV
file, I guess, test one dot CSV. Okay, this has now worked. Now if I go and see the white
dots, pi Spark, it is showing this two strings, right, this two column C zero and C one. Now
here you can see that there is I&#39;ve created this particular CSV file, right. And it is
just taking this A B as a default column probably. So it is saying c zero and C one. So what
we can actually do is that and probably if you really want to see your entire data set,
you can basically see like this df underscore pi spark dots show you here you will be able
to see name and age, there is this there&#39;s this information that I really want to make
my column name or age as my main column, right. But when I&#39;m directly reading the CSV file
properly, we are getting underscore Cesar underscore c one. So in order to solve this,
what I will do is that we have a different technique file, right spark dot read dot option,
there is something called as option. And inside this option, what you can basically give is
that there will be an option with respect to header, I guess, see, there&#39;ll be something
like key value that you will be providing an option. So what you can do, you can just
write header, comma true. So whatever value the first column first row value will be there,
that will be considered as your header. And if I write CSV with respect to test one, now
I&#39;m just going to read this test one data set test one dot CSV. Now once I execute this
here, you will be able to see that I&#39;m able to get now named string h string, okay, but
let&#39;s see our complete data set. So here if I execute this now I&#39;ll be able to see the
entire data set with this particular columns. Okay, so let me just quickly save this in
my df underscore pi Spark. Okay, and now let&#39;s go and see that type of df underscore pi spark
Okay. Now when I execute this here, you will be able to see guys when I was reading this
df right when I was if I go and see the type of this with the help of pandas, here, you&#39;ll
be able to see that there is partners or core dot frame dot data frame, but here you will
be seeing that when you are reading this particular data set, it is of type pi spark dot SQL dot
data frame dot data frame. Yes, so that is pandas DataFrame this SQL dot data frame Rhonda,
yes, most of the API&#39;s are almost same the functionalities are seen a lot of things that
we will be learning as we go ahead. But if I quickly want to see my probably I don&#39;t
know whether head will work let&#39;s see, yes, head is also working. So if I use dot head,
probably you will be able to see the rows information are basically shown over here.
Now if I really want to provide see the more information regarding my columns, I will be
able to use something called as print schema. Okay. Now in this this print schema is just
like a df dot info which will actually tell about your columns like name is string and
ages string. Okay, so all these are some basic operations that you have actually done after
installation. Again, the main thing why I&#39;m focusing on this is that just try to install
this PI spark and keep it ready for On my next session, I will be trying to show you
how we can change the data type, how we can work with data frames, how we can actually
do data pre processing, how we can handle null values, missing values, how we can delete
the columns, how we can do various things, all those things will, basically we&#39;ll be
discussing over there, how to drop columns. And so guys, we will be continuing the PI
spark series. And in this tutorial, we are actually going to see what our PI spark data
frames, we&#39;ll try to read the data set, check the data types of the columns, we basically
seen pi spark color schema, then we&#39;ll see that how we can select the columns and do
indexing. We&#39;ll see describe functionality that we have similar to pandas, and then we&#39;ll
try to see that how we can add new columns and probably drop columns. Now, this is just
the part one. So let me just write it down as part one, because after this, there will
also be one more part why this video will be important because in PI Spark, also, if
you&#39;re planning to apply emblem, you know, the machine learning libraries, you really
need to do data pre processing initially, you know, probably in the part two, we&#39;ll
try to see how to handle missing values. And all, we&#39;ll try to see how to filter the rows,
how we can probably put a filter condition and All right, so let&#39;s proceed. Before going
ahead, what I&#39;m going to do is that, we will first of all, you have a data set called test
one. So I have taken three columns. One is name, age and experience. And then I have
a data set like Krish Taki 110, like this Sudan shoe Sunday, right. So this is some
data set, which you have saved in the same location. Now what I&#39;m going to do first of
all, as usual, the first step with respect to pi spark is to build the PI spark session.
Now in order to build the PI spark session, I will write the code line by line. So please
make sure that you also do along with me, it will definitely be helpful. So I&#39;m going
to write from pi spark dot SQL, input, sparks session, and then I&#39;ll create a variable,
oops, sorry, then I&#39;ll start to create a variable regarding my session. So I&#39;ll write spark
is equal to sparks session.we, basically, like builder dot app name. And here, I&#39;m just
going to give my app name as practice I can just say, or liquid data frame practice or
data frame, right, something like this, since we are practicing data frame dot get or create
function. And this is how you actually start a session. So once again, if you are executing
for the first time, it will take some time, otherwise, it will, it is perfect to go. So
here is my entire Spark, it is running in memory, the version that it is running over
here. And obviously when you&#39;re running the local, you basically have one master node,
okay, and the app name is data frame. So to begin with, we will try to read the data set
again. So let&#39;s read the data set. Now reading the data set, I have already shown you multiple
ways. One is to read option one is to and since this is a CSV file, we&#39;ll try to read
it first first option, we&#39;ll try to see how we can actually read it. And then I&#39;ll show
you multiple ways of reading it. Okay, so I&#39;ll write spark dot read dot options. And
here in this option, we basically say key value, right, so here, I&#39;ll just make it as
header is equal to true so that, you know it should be considering my first row as the
header. And here I&#39;ll write it as header that&#39;s true dot CSV, inside the CSV. I&#39;ll give my
dataset name that is called as test one. dot c is right. Now when I execute this, probably,
I think you&#39;ll be able to see the data set. So here you are able to see that okay, it
is a data frame, and it will have features like name, age experience, right? So if I
want to see the complete data set, I&#39;ll just write dot show. So here is my entire data
set over here very clearly I can see it. Now. Let me just save this in a variable called
as df underscore pi Spark. Okay, so here is my entire data sets. Now, first thing, how
do we check the schema, let&#39;s check the schema. Okay, schema basically means the data types
like how we write in pandas df dot info, similarly we can write over here. So here you&#39;ll be
able to see that I have written df underscore pi spark dot print I think it should work
print schema or none type has spring theme Oh sorry. So I had written dot showed and
saved in a variable I&#39;ll remove this dot show let me execute it once again. And now if I
write print schema, here you will be able to see name, age and experience, but by default
it is taking a string even though in my Excel sheet, what we have done is that we have written
values probably this should be string, this should be integers, then they should be integers,
but why it is taking in a string. The reason it is probably taking a string guys because
by default, unless and until we don&#39;t give one more option in CSV, this CSV have one
option call us infer schema, okay? If I Don&#39;t make this as true, right? It will by default
consider all the features as you know in the stream value string values. So I&#39;ll execute
this now. And now if I go and see the F underscore pi spark dot print, Sima you will be able
to see that I&#39;m getting name and string age as integer experience as integer and a level
is equal to two that basically means it can have null values. Okay, so this is one way
of reading it. One more way I&#39;ll try to show you which is pretty much simple so I can include
both header and infer schema in one thing itself. So I&#39;ll write d f underscore pi spark
is a call to spark dot read dot CSV and inside the CSV file first of all I will provide my
test file CSV Okay, and then here I will go ahead and write header probably is equal to
true and I can write infer schema is equal to, so when I write like this and if I write
df underscore pi spark dot show you here you will be able to see all my data set Okay,
so here is my entire data set. Now, if I go and see and execute this schema again it will
probably give me the same way like how we have we had over here, okay. So, here you
can see name is equal to string is equal to integer experience indigent, indigent, perfect.
So what are things we have done, we have understood about this one and right, if I go and see
the type of this, if I go and see the type of this, this is basically a data frame. pandas
also have a data frame. So if somebody asked you an interview, what is a data frame, you
can basically say that data frame is a data structures, you know, because inside this
you can perform various kinds of operations. So this is also one kind of data structures.
Okay, so what are things we have actually done, I&#39;ve given introduction of data frame
reading the data set, not checking the data types of the column. In order to check the
data types of the column we have already just written print schema. Okay, now one more thing
that I can do after this, let&#39;s see selecting columns and indexing. First of all, let&#39;s
understand what columns are basically present how you can get all the column names. So in
order to get the column names, you can just write dot columns, okay. And when you execute
over here, you will be able to get the column name like name, age experience, perfect, this
is perfectly fine. Now this is my D F. Now suppose if I want to pick up some head elements,
also, I will be able to pick up because in pandas also you haven&#39;t had suppose I see
I want to get the top three records. I will be getting in this particular format in the
list format. Usually in pandas when we are using we usually get in a data frame format.
So here you will be seeing the combination of name age and expedience. Okay, like this.
This is my first row, this is my second row, this is my third. Okay, now coming to the
next thing that we are going to discuss now, how do I select a column? You know, I probably
want to pick up a column and see all the elements, right like how we do it in pandas. So first
of all, let me just write it like this PI spark dot show here will be able to see all
the columns are if I really want to pick up on the name column. Okay, so how do I do it?
Okay, let&#39;s let&#39;s actually see. Now in order to pick up the name column, there is very
simple functionality that we&#39;ll write which is called as pi spark dot select. And here,
I will just give my name column. Now once I execute this, you&#39;ll be able to see that
return type is data frame. Okay, the return type is data frame, and name is basically
a string. Now, if I write dot show, I will be able to see the entire column. Okay, so
when I do this, I&#39;ll be able to see this and if I try to find the type of this so sorry,
if I remove this dot show and see the type of this this is basically a data frame by
spark dot SQL dot data frame dot data for not pandas dot data frame. Okay, pretty much
simple. Now, suppose I want to pick up multiple rows like out sorry, multiple columns, like
I want to pick up name and experience probably two columns I want to pick up so what I&#39;ll
do I&#39;ll just make one changes. Here. Initially, I was providing my one column name like this.
After this what I will do, I will provide another column which is like experience, and
I&#39;ll execute this now once I execute this over here, you can see that guys I&#39;m getting
a data frame with two features one is name and experience. Now, if I go and write dot
show, here you will be able to see my all my elements are basically present inside this
particular data frame. Pretty simple base, how do you select multiple rows? And yes,
here slicing definitely will not work because I tried doing slicing it was not working,
okay? And, okay, whenever you have any kind of concerns, always try to see the documentation,
the PI spark documentation, pretty much simple. Okay, this is one way that how we can actually
select the columns and probably see the rows. Okay. Now, let&#39;s show you if I just want to
pick up there is also one way like the see if I write the F of pi spark off named. If
I execute this here, you&#39;ll be able to see column name there. return type will be column
over here, if I&#39;m directly picking because in pandas we directly pick like this, right.
And when we have this kind of columns definitely will just not, we are just able to understand
what this particular feature it is basically a column it is saying, okay, nothing more,
we won&#39;t be able to get the data set, there&#39;ll be no show function, it will be saying that
it is basically an error. So usually what we do, whenever we want to get a pick up any
kind of columns and try to see it, we basically need to select using this particular select
operation, which is my function. Okay, so these things have been done, guys, what we
try to understand now let&#39;s see how we can check the data types. So there is a functionality
which is called D types. So here you will be able to see name is called a string age
is equal to end experience is equal to int. And again, D types is pretty much similar
because we also use this in pandas. Okay, most of the functionalities are similar to
pandas, guys. So what are things we have actually done? Let&#39;s see, pi spark DataFrame, reading
the data set checking the data type, selecting columns and indexing, check the describe options
similar to pandas. So, we can also check out the describe options. So, let&#39;s see pi spark
dot describe and if I execute this, you will be able to see it will give you a data frame
summary is equal to string this this this information is there. Now when I write dot
show okay you will be able to see all this this is basically in the form of data frame
you may be thinking why this null values coming mean and standard division because understand
even in this it will take the string column also basically the values that are having
the data type of string away obviously, you don&#39;t have anything So, min and max is basically
taken on the index because in the zeros in the second index you will be able to see in
crushes then after that Sonny&#39;s there, okay look for the next and remaining all these
information are actually present again. So this is basically the same like the describe
options that we are actually seen, you know, probably in our pandas right, so similarly,
we have actually done that. Okay, so describe option is also done. So, let&#39;s go and see
adding columns, dropping columns. So adding columns and dropping columns is very, very
simple Guys, if we need to add the columns, so I&#39;m just going to write the comment over
here, adding columns in a data frame, and this data frame is by pi spark data frame,
okay, now in order to add the column, so we have an amazing function add function, which
is called as the PI spark dot, there&#39;s something called as width column. Okay? Now, this width
column, if I see the functionality, it returns a new data frame by adding a class or replacing
the existing column that has the same name. Okay, so here, the first parameter that I&#39;m
going to give is my column name. Suppose I want to pick up let&#39;s see, I&#39;ll pick up experience.
So I&#39;ll say experience, okay. And probably this will be my new column. After two years,
what will happen if experience after two years, you know, initially, the candidate is 10 years
experience, after two years, it will become 12. Right, so we&#39;ll try to put now the value,
this is my new column name, and what value it should have. So for that, I&#39;ll write df
pi Spark. And here, I&#39;ll say, probably, I&#39;ll take that same experience, I will multiplied
by I will add like two because after two years, the experience will get added by two, just
I&#39;m taking one, one way of actually solving this, I can put any values I want guys, it
is up to you. Okay, and you can actually check it out. Okay, now after this, this is only
the two things that is required. And now if I execute it, you&#39;ll be able to see that the
same operation will happen. And now you in this data frame, you have 123 and four feature,
if I want to see the complete data set, I can add dot show, once I execute it now here
we&#39;ll be seeing that experience after two years is nothing but 266 because 10 plus 212
you have very very simple rest and this is what width column is basically told us and
you can also do different different things with respect to this. So this is how you get
add a column in a data frame. And again guys, this is not an in place operation, you basically
need to assign it to a variable in order to get reflected. Suppose if I want to get it
reflected, I really need to assign like this. And here now if I go and see my sorry, first
of all, let me remove the show. The show will not give us proper result. Okay, oh has no
attribute with column. Okay, sorry. So, this there was a problem with respect to this,
I&#39;ll read this dataset, because I replaced it completely right. Now I will execute it.
And once again, now is fine. Now if I go and write dot show, here, you will be able to
see the elements all properly given. Now, this was with respect to adding the columns
with data frame. Now, I probably need to drop the columns also. So drop the columns. Let&#39;s
see how we can actually drop the columns. Dropping the columns is pretty much simple
like how we usually drop this drop functionality. By default a column names you can give a list
of columns. You can Give a single column name. So suppose I say experience after two years,
I want to drop this, because who knows, after two years, what will happen. So let&#39;s drop
this, in order, drop this, just execute like this and just go and see dot show. Here, you
will be able to find out without that specific column. Again, this is not an in place operation,
you need to assign it to a variable, very simple. So let me just assign it to a variable
is equal to and please make sure that you remove this dot show dot show is the functionality,
right? Now, if I write this dot show, here, you will be able to see all the elements.
But now let&#39;s go ahead and see how to rename the column. So we are just doing this guys
because you really need to be very good at data pre processing, okay, so I&#39;ll write the
hot and there is another function, which is called as with column rename. Okay? Nine,
this you just give your existing and the new column name. Suppose I have my existing column
name over here, I&#39;ll say, name. And I&#39;ll say new name. Okay, and just executed. And now,
if I go and just like dot show, and try to see the elements here, you will be able to
see instead of name, there will be something called as new name. Right. Now, this is what
I had to actually discuss. I am just writing one more point over here. We have also discussed
about renaming columns, right? Yes, this is just the part one or data frames, the part
two, we&#39;ll try to do something called as filter operation. And in filter operation, we&#39;ll
try to see various operation because it will be pretty much amazing, you&#39;ll be able to
learn a lot probably this is the tutorial three, which is the part three with respect
to data frame operations. In this particular video, we are going to see how we can handle
missing values, null values, you know. So in short, this many things we&#39;ll actually
try to do, we&#39;ll see how to drop columns, we&#39;ll see how to drop rows, then we&#39;ll see
when when we are dropping rows, probably based on a null values, we&#39;ll try to drop a rose.
And then we&#39;ll try to see what are the various parameters in dropping functionalities and
handling missing value by Mean, Median or mode. Okay, so here, I&#39;m just going to write
it as mean, median, and more probably, right. So all these things we are actually going
to see, again, the main thing is that, I really want to show you that how we can handle the
missing values. This is pretty much important because in pandas, and also we try to do this
in a scale on we have some kind of inbuilt function. So let&#39;s proceed. Whenever we usually
start pi Spark, whenever we are working with PI Spark, we really need to start a PI spark
session. So I hope till now you all are familiar. So I&#39;ll write for pi spark dot SQL, I&#39;m going
to import sparks session, again. And then I&#39;m going to create a variable with Spark.
And then here I&#39;m going to write spark session dot builder. Not happening. Okay, not app
name. Again, let me just keep this app name as practice, okay, because I&#39;m just practicing
things, then I like get or create and just execute this. So overall, it will take some
time to get executed. Yes, it has got executed fine. Now for this, I&#39;ve just created a very
simple data set, which looks like this. I have a column like name, age, experience,
salary. So these are all my names, all the candidate names, and probably there are some
values which are left blank. Here, you can see some values I&#39;ve left blank. So we&#39;ll
try to see how to probably drop a null values or how to handle this particular missing values
or not. Okay, so let&#39;s proceed. So first of all, in order to read the data set, I&#39;ll just
write spark dot read dot CSV. And here, I&#39;m just going to use the CSV file name that&#39;s
to dot CSV. And it is saved in the same location where this particular file is Anyhow, I will
be providing the you in this data also. And I&#39;m going to use header is equal to true and
probably there is also info schema is equal to true so that I&#39;ll be able to get the data
set properly. So probably when I&#39;m reading this, you&#39;ll be able to see this is my data
frame that I&#39;m actually getting if you want to see the entire data set, this will be like
show us dot show. And this is your entire data set here you are having null values under
perfect. So what let me do one thing, let me just save this in a variable. So I&#39;ll write
df underscore pi Spark. So if I go and now check, dots show and this is my entire data
set. Okay, perfect. We are pretty much good 10 Here we are working fine. With respect
to this. We know we have actually read some kind of data set also. Now, probably First,
let&#39;s start. How do we drop the columns? dropping the columns is very, very simple guy. Suppose
I want to drop Name column, then I just use df dot drop and provide my column name like
this right? So column, right column name, suppose I&#39;ll write df.pi spark and here column
name will be named. So let me write it as name. And I can basically go and check out
my dot show. And then you will be able to see all the features that are actually present.
This is pretty simple, which I also showed you probably in the previous session also,
okay. And this is how it is basically done basically dropping your feature or a columns,
but our main focus is dropping the non value. So right now, let me just write df.pi spark
dot show. So this is my data set, right? Now let&#39;s see how to drop this specific rows based
on the null values. So over here, I&#39;ll just use df.by spark.na. Okay, there&#39;s something
called as na and then you have drop, fill and replace. So first of all, I&#39;ll start with
dropped. Now, inside this particular drop, always remember if I don&#39;t give anything okay
and just exhibited here you will be able to see wherever there is a null values those
all rows will get deleted. So, here we will be seeing that this last three rows are not
present, right? So here you can see Shibam this particular value is present the meaning
all the rows have been removed perfect right. So not a problem at all. So, he had you in
short what you&#39;re doing is that whenever you use.na dot drop, it is just going to drop
wherever it is going to drop those rows wherever none values are actually present or null values
are actually present Okay, perfect This match is fine. If I go and search in the drop, there
are two main features one is how and one is threshold and then one is subset. So, let&#39;s
try to understand this particular features. Now, first, I will start with how any is equal
to how I just tried like this okay. So, suppose if I write df.pi spark.na dot drop and if
my how the how value can have two values one is any one is all Okay, one is any one is
on any if the value selected as any drop a row if it contains any notes like even though
there is just one okay one of the rich tuners or there is an entire mouth you know none
by default it is going to get dropped okay. But how is the call to all when do we use
all that basically means suppose if in your future you have suppose if any rule you have
all the values as null in this case you have 36 one value this will not get dropped, but
if he in a record you have all the values and ml then only it will get dropped okay.
So, let&#39;s see whether this will work or not definitely is not going to work because I
know all at least one values or at least one values one one value one non non null values
always there right? If I&#39;m using How is equal to all it is going to drop those records which
is having completely not by default this how value should be having any right so, by default
it is any any basically says that whether there is one or two now we are just going
to drop it drop those specific records right. Pretty much simple this was what how was and
let&#39;s go ahead and try to understand with respect to threshold What is this threshold
I will tell you what is this threshold Now, let me just use this Okay, I know how is any
But there is another one more option called as thrush nine Thresh what we do is that suppose
if I right let&#39;s keep the threshold as two it basically says that suppose over here in
this particular case if the threshold value is two Okay, let&#39;s let&#39;s first of all execute
it you&#39;ll be able to see that the last column has got deleted over here Okay, that&#39;s the
last row has got deleted why it has got deleted because we have kept the threshold value is
two it says that at least two non null values should be present okay at least two non null
values now here you have two non null values like more than 40,000 Okay, here you just
have one non null values. So, because of that it got deleted suppose if you had two non
null values over here see 34 and 10 this is not got deleted This is same over here 3410
right 3410 you have if I go and show you 3410 over here and 38,000 there at least here you
add three normal values here you add to normal values. So here whenever we give some threshold
values to that basically it will go and check whether in that specific row at least two
non null values are there if it is there, it is just going to keep that row otherwise
it is just going to delete it that is what you cannot you can also check out with the
one so if I go and see one then you can see that all this particular rows are there because
it will just go and check okay here one non nine values are there Here it is there. If
I make it as three, okay, let&#39;s see what it will come. Now here you can see at least this
is their remaining all has been deleted right see over here you had only two non non values
here also you how do you add three so this is the 3410 38,009 so here you can see the
value that is understanding with respect to threshold. Now, let&#39;s go ahead with the another
one we&#39;ll just call a subset. So here I&#39;m just going to write it as subset because this
is the third parameter inside my drop feature. And remember, these are features are pretty
much simple with respect to if you have worked with pandas, the same thing we are working
away subset in the subset we can actually provide. Suppose I&#39;ll say in the subject,
let&#39;s remove threshold I don&#39;t want to keep any threshold let&#39;s say I just want to drop
nine values only for my specific column probably only from experience column then I can basically
give it as a subset. So, from the experience column you can see that wherever there was
none values in the records all those that hold record has been deleted right. So like
this you can apply with respect to the suppose you want to apply it in age you can also apply
this right wherever there was none values that old record has got deleted in the age
column. So this is with respect to subset so I hope you are getting an idea guys, this
is pretty much God because the same thing we are trying to do right we are we are actually
trying to apply whatever things we actually did in pandas and this is very, very handy
when you have working with missing data. Okay. Let&#39;s go with the next thing. Now let&#39;s go
and fill the missing value filling the missing value nine order to fill the missing value
again I&#39;ll be using Vf eyespot dot fill dot okay sorry na dot fill okay. And inside this
this field will take two parameters one is value and the one is subset Okay. Now, suppose
if I go and give value like this, suppose I say missing value and if I go and write
dot show, then what it is going to do whenever there is a non valid is going to replace with
missing values. So, here you can see here the null value is there. So, missing value
missing value missing value. Suppose, if you really want to perform this missing value
handling in only a specific column, then you can basically write your column name also
like this. So, this will be my Excel subset, okay. I can also give multiple records like
this. See, I can also give multiple Gods like experience karma probably age, karma age in
call enlist, right, when I give like this, then it this kind of functionalities will
happen in two columns, right? Pretty much simple. So by now next step, what we are going
to do is that, we are going to take a specific column, and probably we are going to handle
the missing values with the help of mean of that specific column or median of that specific
column. So, right now, if I go and check out my bf got pi spark here, if I go and see my
dot show value, this is my entire data set over here. Now, what I&#39;m going to do is that
I&#39;m going to take this particular experience column and probably replace the null values
with the mean of the experience itself. So in order to do this, I&#39;m going to use an inbuilt
function. And guys, if you know about imputing function, we basically use that with the help
of SK learn also in PI Spark, also, we have an impure function. So I&#39;m just going to copy
and paste the code over here to make it very, very simple. From pi spark.ml dot feature
import in pewter, here, I&#39;m just going to give my input columns that is age experience
salary, probably I want to apply for every column over here. And then I&#39;m just saying
that for age experience salary, I&#39;m just going to find out this dot format dot c output columns.
And then I&#39;m going to keep the strategy as mean you can also change the strategy has
immediate more than everything. So I&#39;ll execute this this has got executed fine, and then
we are just going to right fit and transform. So imputed reflect df of pi spark dot transform.
So once I execute this guy&#39;s here, you will be able to see that we are going to create
multiple columns with underscore imputed as this name. So here you can see h underscore
imputed. In short, what we have done we have tried some kind of mean functionality over
here that basically means the null value has been replaced by mean. So over here you can
see this null value is replaced by 28. Similarly, this to null value is replaced with 10 and
five sorry, five. This is what is the experience imputed column. Over here you will be seeing
that wherever there is a null value it is being replaced by the mean of the experience
column, the mean of the age column and mean or the salary column. And this way, you&#39;ll
be able to do it if you really want to go ahead with median. Just go and change this
mean to median and just try to execute it. Here. Now you&#39;ll be able to see the median
value and here is your initial null columns, which had sorry, here are the columns which
has none values. And here are all the columns which has basically the imputed values right
with respect to mean median. So guys, today we are in the tutorial for of pi spark data
frames. And here in this particular video, we are going to discuss about filter operation.
A filter operation is pretty much important for data pre processing technique. If you
want to retrieve some of the records based on some kind of conditions, or some kind of
Boolean conditions, we can definitely do that with the help of filter operation. Now guys,
please make sure that you follow this particular playlist with respect to pi Spark, I will
be uploading more and more videos as we go ahead. And remember one more thing there was
a lot of complaints from people are telling to upload SQL with Python. Don&#39;t worry parallely
I&#39;ll start uploading SQL with Python. I&#39;m extremely sorry, because of some delay, because
I was doing some kind of work busy with something. But I&#39;ll make sure that I&#39;ll try to upload
all the videos. So parallely SQL with Python will also get uploaded. So let&#39;s proceed.
Now first of all, let me go and make some cell. Now today for this app taken our data
set a small data set, which is called as test one dot CSV. Here I have some data set like
name a just experience and salary. And I&#39;m just going to use this and try to show you
some of the example with respect to filter operation. Initially, whenever you want to
work with PI Spark, you have to make sure that you install all the libraries. So I&#39;m
going to use plus pi spark dot SQL import spark session. And this will actually help
us to create a spark session, right. And that is the first step whenever we want to basically
work with PI Spark, right. So we&#39;ll be using sparks session dot builder dot app name, then
I&#39;m just going to create my app name as data frame. And basically right get our create
function, which will actually help me to quickly create a spark session, I think this is pretty
much familiar with every one of you. And let&#39;s proceed unless try to read a specific data
set. So over here, what I&#39;m going to do, I&#39;m just going to create a variable, df underscore
pi Spark, and I&#39;m going to use the spark variable dot read dot CSV. And here, I&#39;m just going
to consider my data set test one dot CSV. And here, I&#39;m just going to make sure that
we have this particular option selected header is equal to true and in for schema is equal
to true I think this all I&#39;ve actually explained you then if I write df.pi spark dot show up,
here, you&#39;ll be able to see your data set. Okay, so it is reading, let&#39;s see how we will
get the output. So this is my entire output. Now guys, as I showed you that we will be
working on a filter operation, I will try to retrieve some of the records based on some
conditions. Remember, filters also are available in pandas. But there you try to write in a
different way. Let me just show you how we can perform filter operation by using pi Spark.
Okay, so filter operations, let me make this as a markdown. So it looks big. looks amazing.
Let me make some more cells Perfect. Now first step, how do I do a filter operation, suppose
I want to find out salary of the people who are less than probably 20,000. Okay, less
than or equal to 20,000. Again, I&#39;d like that less than or equal to 20,000. Now for this,
there are two ways how we can write it first way, I will just try to use the filter operation.
So you have like dot filter. And here, you just have to specify the condition that you
want. Suppose I&#39;ll write salary is less than or equal to 20,000. Remember, this salary
should be the same name of the column over here, right? And when I write dot show, you
will be able to see this specific record. And you&#39;ll be able to see, okay, less than
or equal to 20,000 is this foreign for people, Sonny Paul has shown sober here, you will
be able to see all these things along with the experience right? Now this is one way,
probably I just want to pick up. After putting this particular condition, I want to pick
up two columns. So what I can do, I can use this. And then I can basically write dot select.
And here, I&#39;m going to specify my name, probably I want the name and age, name, comma, age.
So dot show, I&#39;ll do this. Now this is how you can actually do it again. Over here, you
can see that name underscore age is actually there. And you are able to get that specific
information after this. Probably I want to do some of the operation you can actually
do less than greater than whatever things you want. Probably I&#39;m going to put two different
conditions Then how should I put it? Let&#39;s see. Let&#39;s see for that also. So I&#39;ll write
divide df pi spark dot theta. And here I am going to specify my first condition. Suppose
this is one way. This is one way by using filter operation. Also guys, then this conditions
that I&#39;m writing, I can also write something like this See this, suppose if I write df
pi spark of salary suppose salary as less than or equal to 20,000, I can also write
like this, I will also be able to get the same output. So here, you&#39;ll be able to see
the same output over here. Now, suppose I want to write multiple conditions, how do
I write, it&#39;s very simple, I will take this, this is, first of all, this is one of my condition.
So I&#39;m just going to use this condition. And I can also use an AND operation you know,
so I&#39;ll say and, or, or any kind of operation that you want, probably, I want to say that
df underscore pi salary is great, less than or equal to 2000 20,000. And probably I want
a df pi spark of salary salary greater than or equal to 15,000. So I&#39;ll be able to get
all those specific records, okay. And again, I&#39;ll try to put this in another brackets,
make sure that you do this, otherwise, you will be getting an error. Okay? Very, very
simple, guys. So let&#39;s see how I&#39;ve actually written it is something like this d f underscore
pi spark dot filter df or pi spark of salary is less than or equal to 20,000 greater than
equal to 15. If I execute, you will be able to see between 15,000 to 20,000, you&#39;ll be
able to find out you can also write or then you will be able to get all the different
different values. Now, this is our kind of filter operation that you can basically specify,
remember, this will be pretty much handy when you are probably retrieving some of the records
with respect to any kind of datasets, and you can try different different things. So
this is one way where you are actually directly providing your column name and putting a condition
internally this PI spark actually pi spark DataFrame understands it and you will be able
to get the output, right. So yes, this was it all about this particular video, I hope
you like it, I hope you liked this particular filter operation, just try to do it from your
side. Okay, one more operation is basically pending, I can also write like this serious
everybody, I can basically say that, okay. Probably, I can use this operation, which
is called a knot operation. Let&#39;s see how this knot operation will be coming. Okay.
Basically, the inverse condition operation, we basically say, so I&#39;ll be using this, okay.
And this, and inside this, I can put a knot condition which like this, so I&#39;ll say this
is a knot of df of pi spark salary is less than or equal to 20,000. So anything that
is greater than 20,000 will be given over here. Okay, so inverse operation, you can
see in words filter operation, guys, we will be continuing the PI spark series. And in
this particular video, we are going to see group by an aggregate function. Already I
have actually created somewhere around four tutorials on pi Spark, this is basically the
fifth tutorial. And again, this is a part of a data frame, why we should actually use
group by an aggregate functions again for doing some kind of data pre processing. So
let&#39;s begin for this particular data set. For this particular problem, I have created
a dataset which has three features, like name departments and salary, and you have some
of the data like crash data science, salary, right, something like this. So we&#39;re here
in short, if I want to basically understand about this particular data set, there are
some departments probably where crash and other people teach. And based on different
different departments, they get a different different salary. So let&#39;s see how we can
perform different different group by an aggregate functions and see how we can pre process or
how we can get some or retrieve some kind of results from this particular data. So to
begin with, what we are going to do, we are first of all going to import pi Spark SQL
import spark session. As usual, we have to create a spark session. So after this, what
we have to do, I&#39;ll create a spark variable. So I&#39;ll use spark session dot builder dot
happening. I think everybody must be familiar with this. But again, I&#39;m trying to show you
this one, so let me write it as aggregate dot get auto create. So now I&#39;ve actually
created a spark session. Okay, probably this will take some time. Now if I go and check
out my spark variable, so here is your entire information, okay, with respect to this particular
spark video, now let&#39;s go ahead and try to read the data set. Now I will just write df
underscore pi Spark. And then here I&#39;ll write spark dot read dot CSV. The CSV file name
is basically test three dot CSV, and remember I&#39;ll be giving this particular CSV file in
the GitHub also. And then I&#39;ll be using header is equal to true comma, infer schema is equal
to two. Now this is my df underscore pi Spark. Now what I will do in the next statement,
I will write df underscore pi Spark. dot show right. Now here you will be able to see that
I am actually being able to see all the datasets. Here I have named departments and salary on
all this particular information. If I really want to see the schema or the columns, like
which all columns where it belongs to this like a data type, so I can definitely use
the F underscore pi spark dot print schema, right. And now here you can see name is a
string, department is string, and salary is basically an integer. Okay, now let&#39;s perform
some group by operation First we&#39;ll start by group by operation, probably I want to
group by name, and probably try to see what will be the mean average salary. You know,
well what suppose let&#39;s let&#39;s take a specific example over here. So I&#39;ll write TF dot underscore
pi spark dot group by suppose I want to go and check who is having the maximum salary
out of all these people that are present in this particular data set. So, I will first
of all group by name, if I execute this, you can see that we will be getting a return type
of group data at some specific memory location. And you should always know that guys, group
by aggregate functions works together. That basically means first of all we are we need
to apply a group by functionality and then we need to apply an aggregate function. So
aggregate function Do you really want to check just press dot and press tab. So here, you
will be able to see a lot of different different function examples like aggregate average count
max mean, by better many more right? Now what I&#39;m going to do, I&#39;m just going to use this
dot sum because I really need to find which is the maximum salary from out of all this
particular employees who is having the maximum salary. So here I&#39;ll say Datsun and if I execute
it, you will be able to see that we are getting sequel dot data frame, which has name and
sum of salary This is very much important Let&#39;s sum of salary because I really want
to have the sum of the salary remember, we cannot apply sum on the string. So, that is
the reason it has not done over here it is just giving you the name because we have grouped
by name and this dot some will just get applied on this particular salary. Now, if I go and
write dot show here you will be able to see so the answer will be here is having the highest
salary of 35,000 Sonny has 12,000 Krish has 19,000 Mahesh has 7000 So, if you go and see
over here, so, the uncial is basically present here here and in big data. So, overall, his
salary should be 35,000 if you compute it, similarly, you can go and compute my salary
over here, over here by just calculating this and then you can also compute sunny salary.
And you can also see my hash and so this is one just an example. So here I will just write
we have grouped to find the maximum salary. And definitely over here from this entire
observation, we can retrieve that sudhanshu is having the highest salary. Okay, now let&#39;s
go to one step ahead. One more step ahead. Now we&#39;ll try to group by departments to find
out which department gives maximum salary Okay, we are going to do a group by departments
which gives maximum standard suppose this is my, this is my requirement, okay, and different
different types of requirement may come, I&#39;m just trying to show you some examples. I&#39;m
just going to copy this. I&#39;m going to use this department Okay. And then I&#39;m basically
going to say dot some dots show, if I execute it, let me see department is a wrong column
name. So I&#39;ll write departments it is department. So let me write this. Now if I go and see
IoT over here gives some salary around 115 1000 to the simplest to all the employees
right combined because we are doing the some big data gives somewhere around 15,000 data
science gifts around 43,000 I suppose if I go and see big data over here 4000 4000 8000
8013 1000 13,000 15,000 so I hope I&#39;m getting yes Big Data is actually giving us 15,000
so you can go and calculate it. Suppose if you want to find out the mean you can also
find out the mean okay, so let me just write it over here. Just copy this entire thing,
paste it over here and read me right instead of instead of some I&#39;ll try to write mean
so by default the mean salary here you can see that for a particular employee somewhere
for IoT it is 7500 because this mean will be based on how many number of people are
working in the department right. So like this, you can actually find out now I can also check
one more thing I can copy this I can try to find out how many number of employees are
actually working based on the department so I can use dot Count and then if I go and execute
this properly This is a method Okay. Now here you will be seeing that IoT there are two
people in big data there are four people in data science they are four people. So, four
plus four plus eight total employees that are present over here is basically now, one
more way that I can basically apply a directly aggregate function also. Now, see these are
all some of the examples and again you can do different different groupbuy let me use
df pi spark suppose I say dot aggregate okay and inside this I will just give my key value
pairs like this suppose I say let me say that salary I want to find out the sum of the salaries
the overall salary that is basically given to the entire total expenditure inside. So,
the total expenditure that you will be able to see somewhere on 73,000 All right. So,
we can also apply direct aggregate function otherwise this all are also aggregate function
which we basically apply after after you know applying a group by function. Now, suppose
these are probably the salary I want to find out suppose I take this example I want to
find out the maximum salary that the person is basically getting who is getting the maximum
salary sorry. So, here instead of writing dot sum now I&#39;ll write max dot show. Now,
here you can see, Sudan shows basically getting 20,000 over here 10 10,000 crashes getting
10,000 matches getting four for 4000 right. So, all this particular data is there see,
Krishna is basically getting with respect to data science over here 10,000 So, it has
basically picked up it is not picking up both the records, but at least when it is grouping
by name, and then it is showing this particular data that time you will be able to see it,
let&#39;s see whether I will be also able to see this or not. So, group by if I score and write
min. So here you will be able to see minimum value with respect to different different
records when I&#39;m grouping by here, you will be able to see that Sudan shoe, sorry. So
the answer is getting a minimum salary of 5000 2000 crushers getting a minimum salary
of 4000. Right, we can also get that particular information. Now let&#39;s see what all different
different types of operation are, their average is also there. So if I write a VG, it&#39;s just
like mean only guys. So this is basically the mean salary that probably, again, you
can check out different different functionalities, why these all things are basically required.
understand one thing is that you really need to do a lot of data pre processing a lot of
retrieving skills that you basically do, you can check it out this one and you can do different
different functionalities as you like it spark emulate also has an amazing documentation
with respect to various examples. So here, you can go and click on examples. And basically
check out this particular documentation, you can actually see different different kinds
of examples how it is basically done. But with respect to spark ml, there are two different
techniques. One is the RDD techniques, and one is the data frame API&#39;s. Now what we are
going to do, guys, data frame API is the most recent one, you know, and it is pretty much
famously used everywhere. So we&#39;ll be focusing on data frame API. That is the reason why
we learn DataFrame in PI spark very much nicely. So we&#39;ll try to learn through data frame API&#39;s.
And we&#39;ll try to see the technique how we can basically solve a machine learning use
case. Now let&#39;s go and see one very simple example guys. And always remember, the documentation
is pretty much amazingly given, you can actually check out over here and try to read all these
things. Okay. So let&#39;s proceed. And let&#39;s try to see, what are things we can actually
do. In this particular example, I&#39;m just going to take a simple machine learning problem
statement. So let me just open a specific data set for you all, and then probably will
try to do it. Okay. So this is my data set. Guys. I know there are not many records over
here. Okay, so I have a data set, which has like name, age, experience and salary. And
this is just a simple problem statement to just show you that how powerful SPARC actually
is, with respect to M lab libraries just to show you a demo. From the next video I&#39;ll
be showing you detailed explanation of the regression algorithms how we can basically
do the implementation all theoretical and all guys have already uploaded you can see
over here, I&#39;ll be doing see after this tutorial five this is basically the tutorial six I&#39;ll
try to add it after this and then whenever I will be uploading the linear regression
algorithm before that please make sure that you watch this match infusion. Okay, I have
uploaded this specific video also in the same playlist. So after this tutorial 26 St. yt
saying tutorial 26 because I have also added this in my machine learning playlist. So after
this, you&#39;ll also be able to find out when we&#39;ll be discussing about linear regression
how we can implement in depth That video will also get uploaded. So let&#39;s proceed. And here
is my entire data set. Guys, this is my data set. Now what I have to do is that based on
age and experience, I need to predict the salary, very simple use case, not not much
data pre processing, not much transformation, not much standardization and all okay, I&#39;m
just going to take up this two independent feature. And I will be predicting the salary
of this particular person based on age and experience. Okay, so this is what I&#39;m actually
going to do. So here is a perfect example, again, detailee, I&#39;ll try to show you how
to basically implement line by line probably from the upcoming videos where I&#39;ll be discussing
about linear regression, and on and if I go see this particular problem, this is also
a linear regression example. Okay, so let&#39;s go here. First of all, as usual, I will be
creating a spark session. So I&#39;ll use from pi spark dot SQL import spark session. And
then I&#39;m going to use spark session dot builder dot app name. Here, I&#39;m actually creating
a spark session on missing, let me execute it, I think this is pretty much familiar,
you&#39;re familiar with this, then what I&#39;m going to do over here is that we are just going
to read this particular data set with test one dot CSV header is equal to true and infer
schema is equal to true. So when I go and see my training dot show, these are all my
features over here, perfect, I&#39;ll be giving you this data set. Also, don&#39;t worry. Now,
from this particular data set, if I go and check out my print schema, so here, you will
be able to see that I&#39;m getting this particular information. This is my entire print schema.
Over here, I have features like name, age, experience, and salary. Now if I go and see
train dot columns, this is my training dot columns. Now always remember guys in PI Spark,
we use a different fund or mechanism or a kind of data pre processing before See, usually
what we do is that in non by using machine learning algorithms that are available net
scalar, we basically do a train test split, right. And then we first of all, divide that
into independent features dependent features, right, which we use an X and Y variable, and
then we do train test split. By doing this, in in in PI Spark, we just do some different
techniques, what we do is that yes, we have to basically create a way where I can group
all my independent features. So probably I&#39;ll try to create vector assembler, we basically
say it as a vector assembler see where the class I&#39;ve actually used, the vector assembler
will make sure that I have all my features together grouped like this group like this,
in the form of age and experience, suppose over here, my two main features are age and
experience, which are my independent feature right. So it will be grouped like this, for
every record, it will be grouped like this, okay, for every report, it will be grouped
like this, and then what I will be doing is that I will be treating this group as a different
feature. So this will basically be my new feature, right. And remember, this new feature
is my independent feature. So my independent feature will look like this in a group of
H comma experience, which will be treated as a new feature. And this is exactly my independent
feature. So I have to group this particular way. So in order to group this, what we do
is that in PI Spark, we use something called as vector assembler. So in this vector assembler
is basically present in pi spark.ml dot feature, we use this vector assembler, we use two things.
One is input column, which all column we are basically taking to group it. So two columns,
one is age and experience, right, we don&#39;t have to take name because name is fixed, it
is a string. Yes, if category features are there, what we do what we need to do, we will
convert that into some numerical representation that I&#39;ll be showing you when I&#39;m doing some
in depth implementation, the upcoming videos of linear regression, logistic regression
and all. But here, you&#39;ll be able to see that I&#39;m going to take input columns age come experience
in the form of a list. And then I will try to group this and create a new column, which
is called as independent feature over here, right, that is what I&#39;m actually doing. So
if I go and execute this vector assembler, so here, I&#39;m got my feature assembler, and
then I do dot transform, I do dot transform on my training data. So this is basically
my training data when I do this, and when I do output dot show here, you&#39;ll be able
to see I had this all features, and a new feature has been created, which is called
as independent features. Okay, so we have actually created an independent feature. And
you can see over here, age, and experience, age and experience, age and experience, so
this is my grouped rows that I&#39;ve actually got, in short, what I&#39;ve done, I&#39;ve combined
this two column and made it as a single independent feature, okay? Now, this will be my input
feature. Okay. And this will be my output feature, and we&#39;ll try to train the model.
Okay, so over here now, if I go and see output dot columns, I have name, age experience,
salary independent feature. Now what I&#39;ll do out of this, let&#39;s take which all data
set I&#39;m actually interested in. So out of this, I will just be interested in this two
data. Separate independent features and salary salary will be my output feature, the y variable,
right, and this will be my input feature. So what I&#39;m going to do, I am going to select
output dot select independent features and salary. And I&#39;m going to put that in my finalized
underscore data. That is what I&#39;m actually doing. If I now go and see my dot show here,
you will be able to see the entire thing. Now, this are my independent feature, these
are my dependent feature. Now the first step, what we do we do it train test split, like
how we do it in a scalar. So in order to do a train test split, I use a function inside
my finalized data, which is called as random split. Remember, guys, I&#39;ll try to explain
you line by line by implementing it when I&#39;m doing a bigger project. Right now, since this
is just an introduction session, I really want to explain you how things are actually
done. So this is basically my train test split. So here, let me write it down the comment,
train test split. And I will be using the linear regression like how we import a classroom
a scaler. And similarly, by using pi spark.ml, dot regression, import linear regression,
and then I&#39;m doing a random slipped off 75 to 25%. That basically means my training data
set will be having 75 percentage of the data, and my test data set will be having 25 percentage
of the data, right, then after that, I&#39;ll be using a surrogate linear regression in
this you have two important variables that we need to get one is feature columns, how
many number of feature columns are there that is completely present in this independent
feature. So I&#39;m giving it over here. Similarly, in label column, this is my second feature
that I have to give this is my output feature. So after I provide both these things, and
do a fit on train data, I will be able to find out my coefficient, these are my coefficients.
These are my intercepts. And here, I can now evaluate and see my output, right. So by using
the evaluate function, we will be able to see the output and inside this there&#39;ll be
a prediction variable, which will have the output. Okay, so this is my prediction. This
is my salary, the real value. This is my other thing. Now, if I really want to find out the
other important part parameters or metrics, let&#39;s press tab here, you will be able to
see mean absolute error. pred underscore result dot mean squared error, suppose if I do see
this particular word value, you will be able to understand that how the model is actually
performing. So that&#39;s just a various a very simple example, guys. Don&#39;t worry, I will
be explaining in depth probably in the upcoming videos when we&#39;ll be starting from linear
regression. Now remember, the next video is about linear regression implementation in
depth implementation, right? Help, you know what exactly is a data bricks platform. And
this is an amazing platform where you can actually use PI Spark, all you can work with
Apache Spark. And one more amazing thing about this particular platform is that they also
provide you cluster instances. So suppose if you have a huge amount of data set, probably
you want to distribute the parallel processing or probably want to distribute it in multiple
clusters, you can definitely do with the help of data bricks. Now, if I really want to use
this particular platform, there are two ways one is for community version, and one is for
the paid version, which is like Azure, or AWS cloud, you can actually use in the backend,
data bricks also helps you to implement ml flow, okay, and this ml flow is with respect
to the CI CD pipeline. So you can also perform those kinds of experiments also, altogether,
an amazing platform. What I will be focusing in my youtube channel is that I will try to
show you both with the community version also. And in the upcoming videos, we&#39;ll try to execute,
try to execute with both AWS and Azure. When we are using AWS and Azure, what we will try
to do is that whenever we create the instances, multiple instances, you know, that will try
to create in this particular cloud platform will also try to pull the data from s3 bucket,
which is the storage unit in AWS, and try to show you that how we can work with huge
huge data sets, right all those things when we actually should as we go ahead. Now let&#39;s
understand what is data bricks is it is an open and unified data analytics platform for
data engineering, data science and machine learning analytics. Remember why data bricks
actually helps us to perform data engineering and when I say data engineering, probably
working with big data, it also helps us to execute some machine learning algorithms.
Probably any kind of data science problem statement will be able to do it. And Rob,
let&#39;s suppose three kinds of platform cloud platforms one is AWS, Microsoft Azure and
Google Cloud. Now, if you really want to start start with this, we&#39;ll start with the community
version. And you just have to go into this particular URL and just type try data bricks,
and then you just enter all your details to get registered for free. Now once you are
registered, when you want to get started for free, you&#39;ll get two options over there. On
the right hand side you will be seeing the community version which you really want to
use it for free. And in the left hand side you will be having an option where it I will
tell you that you need to work with this three cloud platforms. And you can select that also.
So for right now, I will try to show you a community version, which will be very simple,
very, very easy. So let&#39;s go to the community version. So this is how the community version
actually looks like. If you really want to go into the cloud version, you can just click
on upgrade. Okay, so just click on upgrade. And this is the URL of the community version
and this version of this URL you will be able to get when you register for the community
version tomorrow. So you think that you probably want to work with the cloud, you just have
to click on this upgrade now. Now in this, you will be able to see three things one is
explore to the Explore the Quick Start tutorial, important explore data, create a blank notebook,
and many more things. Over here, what kind of tasks you&#39;ll be able to do in the community
version one is you can create a new notebook, you can create a table, create a cluster,
create new ml flow experiments, I hope I actually showed you ml flow experience, we can also
create this ml Flow Experiment by combining to a database in the backend, okay, then we
can import libraries, read documentation, do a lot of tasks. Now first of all, what
we need to do is that probably I&#39;ll create a cluster. And I in order to create a cluster,
I will click on this create a cluster here, you can basically just write down any cluster
name. Suppose I&#39;ll say Apache, or I&#39;ll just say, pi Spark cluster. Suppose this is my
cluster that I want to basically create. Okay, and then here by default, over here, you can
see 8.2 scaler, this one spark 3.1 point one is selected. So we&#39;ll be working with spark
3.1 point one, if you remember, in my local also, I actually installed this particular
version only, okay, by default, you will be able to see that they will be providing you
one instance with 15 GB memory, and some more configuration. If you really want to upgrade
your configuration, you can basically go and click over here, okay. And remember, in the
free version, you will be able to work in an instance unless and until it is not idle
for two hours, otherwise it will get disconnected. So over here you can see one driver 15.3,
GB memory, two cores and one dBu. Okay, all these things are there, you can also understand
what the view is DB is nothing but a data, bricks unit. If you want to click over here,
you will be able to understand what exactly the view is okay. And you will be able to
select a cloud and basically work with that perfect. till here, everything is fine. Let&#39;s
start, let&#39;s create the cluster. Now, once you you will be seeing that the cluster is
basically getting created. You also have lot of options over here like notebook libraries,
event logs, spark UI driver logs, and all. It&#39;s not like you just have, you will be able
to work with Python over here. Here you have lots of options. Okay. So suppose if I go
and click on libraries, and if I click on install new here, you will be having an option
to upload the libraries, you can also install the libraries from pi pi from Maven, which
will basically use along with Java, and then you have different different workspace. So
here, what I&#39;m going to do is that suppose you select by by and suppose you want to install
some of the libraries like TensorFlow, or probably want to go with Kara&#39;s, you can basically
write like this, probably I want a scale on, you know, so I can just get comma separated
and start installing them. Okay. But by default, I know I&#39;m going to work with PI Spark, so
I&#39;m not going to install any libraries. So let&#39;s see how much time this will probably
take. This is just getting executed over here. And let&#39;s go back to my home. So apart from
this year, you will be also able to upload the data set and that particular data will
give you an environment like how you&#39;re storing the data in the loop okay. So before the cluster
is getting created, okay, now the cluster has got created here you can see pi spark
it is in running state. Now. And remember, this cluster only has one instance, if you
want to create multiple clusters, we have to use the cloud platform one which will be
chargeable. Okay? So in here, I&#39;m going to click on export the data. Now see you guys
you can upload the data you can also bring from s3 bucket or you can also then bring
from s3 bucket. These are all things I&#39;ll try to show you. Then you also have dbfs you
know, and DB FF, you will basically be storing inside this particular format. Then you have
other data sources like Amazon redshift, Amazon kinases, Amazon kinases is basically used
for live streaming data. Okay. Then you have Cassandra, Cassandra is also a no SQL database
and JDBC lastic search so different different data sets, data sources also there will also
try to see with a set of partners integration. So they are also like real time capture in
the data lake and many more things out there. So you can definitely have a look onto this.
Now what I&#39;m going to do is that I&#39;m just going to click over here and try to upload
our data. Let me just see. Let me just upload a data sets. I&#39;ll just go to my PI spark folder.
So here is my PI Spark. So I&#39;m just going to upload the test data set probably alright.
upload this test one. Now here you can see that the data set has been uploaded. Now it
is saying that create table with UI CREATE TABLE in Node back notebook. Suppose if I
go and click this, you know. So here you will be able to see this is the code, this is the
entire code to basically create a table in the UI. But what I really want to do is that
I don&#39;t want to create a table instead, I will just try to execute some of the PI spark
code, which we have already learned. And now, okay, so what I&#39;m going to do, I&#39;ll just remove
this, I don&#39;t want it, I&#39;ll remove this, okay. Okay, let me read the data set. Now for reading
the data set. Over here, you will be able to see that my data set path is basically
this it is a CSV file. In full schema headed schema, all these things are there. So let
me remove this also. So let me start reading the data set. So by default spark is already
uploaded. So I&#39;ll write spark dot spark dot read dot CSV, I hope so it will work and for
the first time, remember, this is my file location. file location. Okay, bye underscore
elevation. And then I will also be using two more option one is header is equal to true
and then I have inferred schema is equal to true. Once I execute this, now you will be
seeing that automatically. The first time when you&#39;re executing, it will say that launch,
launch and run so we are going to launch the cluster and run it so I&#39;m just going to click
it failed to create reject request since the total number of nodes would exceed the limit
one, why this is there. Let&#39;s see if our clusters we just have one cluster. Okay, there was
some examples that have been taken over here. So let me remove one of them. Okay, let me
just execute this. Okay, I&#39;ll go over here. space, let me delete it. Okay. Perfect. Now,
I&#39;ll try to read this. Let&#39;s see. Again, it says failed to create the cluster reject request
rejected since the total number of nodes would exceed the limit of one and it is not allowing
us to execute more than one file I guess. So because of that, I&#39;m just reloading it.
Let&#39;s see now. Now it has got executed see guys before they were two files. So because
of that, it was not allowing me to run I just real I deleted one file and I I reloaded one
file Okay. So now you can see that it is getting the run now. Okay, you can also press Shift
Tab to basically see some hints and all the same like how we do it in Jupyter Notebook.
Now here you will be able to see that my file will be running absolutely fine. And it shows
it shows this df it shows that okay, it is a PI spark dot SQL dot data frame raw data.
Now, let me just execute the other things. Now suppose if I want df dot read, see I&#39;m
just using that tab feature print schema, if I go and see this here, you will be able
to see find out all the values right. So in short, this is basically now running in my
instance of the cluster right, I will be able to upload any huge data set, probably a 50
gb data set also from s3 bucket and not right that I&#39;ll try to show you how we can do it
from s3 bucket in the upcoming videos. But what I am going to show you guys in the upcoming
future will try to run all this kind of problem statements through the data bricks so that
you will be able to learn it Okay. Now, let me just go and do one more thing. So this
is my df dot show. Okay, so, this is my entire data set. So probably I will just want to
select some column, I can actually write the DF dot select and here I just want to say
salary dot show I&#39;m just selecting salary dot show here you will be able to see so everything
that you want to do you will be able to do it and remember over here you will be able
to find out around 15 gb and you can definitely perform any kind of things okay. Here also
you have same options like how we have within you know in Jupyter Notebook, every option
is that you will be able to find out all this particular options in Jupyter Notebook also,
right. So, this is basically running in 15.25 gb, two cores, okay in that particular cluster,
you have two cores, then you have spark 3.1 point One Spark 2.12 and you will be able
to see all this particular information. So what I would like to want Guys, please try
to make a specific environment for you, and then try to start it, try to keep everything
ready. And from the upcoming videos, we will try to see how we can execute how we can implement
problem statement how we can implement different algorithms. I&#39;ve already given you the introduction
of data bricks in the last class. I hope you have made your account and I hope you have
started using it. If you don&#39;t know how to make an account, please watch Logitech tutorial
seven, the entire playlist link will be given in the description. Now this is my databricks
community account. Remember, in the community version, we can only create one cluster. I&#39;ll
also be showing you in the upgraded versions probably in the future, I will be buying it.
And I will try to show you how you can also create multiple clusters, unlimited clusters.
But for that, you also need to use some clouds like AWS or Azure. Now, first of all what
data set I&#39;m going to use. So this is the entire data set that I&#39;m going to use guys,
this data set is called us tips data set. So that basically means people who are actually
going to the restaurant what tip they have actually given based on the total bill, or
I can also go and solve this particular problem based on all these particular parameters,
what should what probably is the total bill that the person is going to pay? Okay, so
this is the problem statement that I&#39;m going to solve. Now here, you can see this is a
multi linear regression problem statement. Here, you have many, many features, right?
So let&#39;s proceed. Now first of all, what I&#39;m going to do, I&#39;m just going to click to the
browse, and I&#39;m going to upload this particular data set. Now in order to upload this particular
data set, I have this particular data set in my path. So probably I&#39;ll also be giving
you this particular data set, so don&#39;t worry about it. Oh, let me just quickly, just a
second, let me just upload the data set over here. Okay. By Spark, okay, so here, you can
see that this is my data set, which I&#39;m actually uploading tips. So let me open it right. Now
here, you will be able to see that your tips data set will get uploaded, you know, in this
DB Fs directory. So here you will be having something like file stores slash tables. Okay.
Now what you can actually do now, let&#39;s go and click on this dvfs. And here you can see
and file stores, probably you can also click on tables. Here you have the steps dot CSV,
I&#39;ve also uploaded dissolve data sets in my previous videos, probably I was just using
this, okay, but here, I&#39;m just focusing on tips dot CSV. Now what I&#39;m going to do over
here, let&#39;s go and do the first step. Remember, the first step in data bricks is that we need
to create the clusters, okay? And create a cluster. Right now, by default in the community
version, data bricks actually helps you to create a cluster, just a one single cluster,
okay. But if you&#39;re using the paid version, the upgraded version, it will actually help
you to create multiple clusters if you have the access of AWS cloud. So I&#39;m just going
to click on the cluster, let me create a new cluster. So I&#39;ll say this is my linear regression
cluster, okay. And then I&#39;m going to use this runtime 8.2 scalar, this is there. And we&#39;re
just going to click the cluster and remaining all things will be almost same in this cluster.
In this instance, you&#39;ll be getting 15 GB memory and all the other information here,
you can check it out. You can also be getting two cores, and one I dB. Okay, so which I&#39;ve
actually already discussed in my previous session, so I&#39;ll go and click on cluster.
This will take some time. And remember, guys, if you really want to use any kind of libraries,
just click over here and install those libraries, which they want. Like suppose if you want
to use Seabourn, you want to use kiraz, you want to use TensorFlow. So here you can basically
type this along with the versions and you will be able to install it okay. But right
now, I don&#39;t require any libraries under shown to use PI spark that is my main aim. So guys,
click on the cluster over here. And here, you can see that probably after a minute,
this particular cluster is actually created. Okay. Now, again, go to the home, what you
can do, you can create a blank notebook, I&#39;ve already created one notebook, so that I have
the basic code written. So I&#39;m just going to open this, and let&#39;s start this particular
process. Now first of all, I have something called as file location, I know my file location
is basically tips dot CSV, the file type is CSV. And then I&#39;m just using spark dot read
dot CSV file location header is equal to true info, schema is equal to two. And let me just
write df dot show, this will actually help me to check the entire dataset. Okay, so I&#39;m
just going to execute it in front of you. And let&#39;s make it line by line I&#39;ll try to
write down all the all the codes, it will definitely be helpful for you to understand.
So please make sure that you also type along with me to understand the code much more better.
Okay, so here now I&#39;m going to execute this. Now here, you will be able to see my, my clusters
will start running. Okay. And then you can see waiting to run running the command, probably
we will be able to see it and just zoom out a little bit so that you&#39;ll be able to see
properly. And again, guys for the first time, if you&#39;re starting this particular cluster,
it will take time. Okay, so spark jobs it is running. And now you will be able to see
my data set. That is my tips data set, which is uploaded in this specific file location.
So this is my entire dataset, total bill, tip, sex, mocha date, time size, perfect.
Now let&#39;s go to the next step. What I&#39;m going to do, I&#39;m just going to write df dot print
schema. So I can Oh So, you stab, you know, it will be able to load this entire thing.
So now here you can see that this is my entire features total bill, tip, sex smoker day time.
So here is all your features like double, double sexy string smoker is string, day string
time string integer. Now remember you may be thinking Krish, why I am actually doing
this in databricks to just make you understand how this will basically run in the cluster.
Right now I just have one single cluster guys, that basically means that the maximum ram
in this particular cluster is somewhere around 15 gb. But just understand if you&#39;re working
with 100 GB of data, and what happens, this kind of processing will get split in multiple
clusters, right. So in this way, you&#39;ll be able to work with big data also in the upcoming
things right. Now this is I think that&#39;s right. Now let&#39;s go and try to understand over here,
which is my independent feature, my independent feature is my tips feature sex smoker day
time and size. And my dependent features, basically total bills. So based on all these
particular features, I need to create a linear regression algorithm, which will be able to
predict the total bill. So let&#39;s go ahead, now over here, I&#39;m just going to write df
dot columns. So if I want to check my columns, this is my columns over here. So I can see
this is my exact columns, this many columns I actually have. Now, one thing about this
particular feature over here, guys, you have columns like sex, smoker, day, time, right?
These all are categorical features, right? And probably, you know, this category of features
needs to be converted into some numerical values, then only my machine learning algorithm
will get will be able to understand it. So let&#39;s see how to handle category features.
So here, I&#39;m just going to write a comment. Okay, handling categorical features, right.
Now, I&#39;ll try to show you how to handle this kind of category features. Now, one way in
PI Spark, and obviously, we know, in normal SK learn, you know, we try to use one hot
encoding, we try to use ordinal encoding, we try to use different kinds of encodings
in this. And similarly, we can use that same encoding over here also with the help of pi
Spark. So for this particular process, we have something called a string indexer. So
I&#39;m just going to say from pi spark radar, from pi spark.ml dot feature, okay, I&#39;m going
to import something called as string indexer. So, I will be using the string indexer, the
string indexer will actually help us to you know, basically convert our string category
features into some numerical features, numerical features basically is ordinal encoding. Like
suppose if I have gender like male or female, it will be shown as zeros and ones. And over
here you will be seeing I so most of the categories over here are ordinal encoding. Now, you may
be thinking, one hot encoding, what is the process that I&#39;ll try to show you in the upcoming
videos with different different machine learning algorithm? The reason why I&#39;m making it because
it is better to learn one thing at a time, right? So we&#39;ll I&#39;ll try to show all those
kinds of examples also, is now let&#39;s proceed and try to see that how we can convert this
category features like sex, smoker day and time, probably time is also category feature,
see over here. So if I see this all features over here, let me do one thing, okay. Let
me just write df dot show. So this is my entire features. Quickly, if I go and see this is
time is also category feature. So quickly, let&#39;s go ahead and try to see how we can basically
use this, let me delete this thing also, or let me just write it once again. So I have
actually, you know, imported this library called a string indexer. Now what I&#39;m going
to do over here is that, let me write our indexer object saying as this and I&#39;ll write
string indexer. And here, first of all, I really need to provide which all our category
features. Now remember, in this string indexer, if I go and press Shift Tab, probably over
here, here, you will be able to see I have to give input columns. So let me touch here,
I have to give input columns, and I have to give output columns. I also have options of
providing input columns as multiple columns, and the output columns as multiple columns.
So let me try both the thing Okay, so over here, first of all, let me try with input
columns. So here in the input columns, I will provide my first value. Now suppose I want
to really convert the sex column into into my category feature. So here I&#39;ll write output
column. Okay. And here, I&#39;ll say, sex underscore indexed. Now here what we are actually doing
guys, here, I&#39;m actually giving my sex column and this sex column will be converted into
an ordinal encoding with the help of the string indexer. Okay, now in the next step, what
I will do, I will just write df, okay, probably I&#39;ll just use df. Or what I can do, I can,
I can just create another, probably I can create another data frame. So I&#39;ll write df
underscore are probably c because I don&#39;t want to change the DF and again, run that
particular code. Now, I&#39;ll say indexer dot fit, okay, so I can definitely use fit. And
then I can use transform. So here also, it is almost same like that only guys fit underscore
transform. and here also i&#39;m going to use df. Okay. And then if I go and see df.df underscore
r dot show, here, now you&#39;ll be able to see that the sex column, one more sex underscore
index column will be created, and it will have the ordinal encoded values in this particular
column. So let&#39;s go and see this. So once I execute it, perfect, I think it is running
properly, it will take time. Now here, you can see that I&#39;m having one more column, which
is called a sex underscore index, wherever the female value is that the value is one
wherever the male value is that the value is zero, right? So we have handled this particular
column. And we have basically converted this character feature into the ordinal encoding.
Now, still, we have many features. So what I&#39;m going to do, I&#39;m just going to use this
indexer again, okay. And probably I&#39;m just going to write over here, multiple columns,
I will specify. So first column, I&#39;ve already changed it. So I&#39;m going to change this into
something else, sex instead of sex that will become smoker. Okay, smoker. But I showed
you guys, instead of writing input columns, now I have to write input columns, right.
So in this multiple columns when I&#39;m giving, so this is smoker, then I have one more feature.
If I see day, day and time, day, and time is more two features. So I&#39;m just going to
right over here, day underscore. So guys, now how I&#39;ve written smoker day and time,
similarly, I will be writing three more columns over here. So the first column should be because
I&#39;m going to create the ordinal encoding. And I probably create a new column over here.
So this will be my smoker underscore indexed. Okay, I&#39;ll close the braces over here. My
second feature will basically be de underscore index, right? Or, and my third feature will
probably be our time underscore index. So here, I&#39;m just going to create three more
features. And then I&#39;m giving index dot fit or df underscore R. Okay, because now I have
my new data frame, and then I&#39;m going to say df underscore dot show. Now once I execute
it guys, I hope should not give us an error, okay, it is saying that invalid parameter
value given for Param output columns could not convert last restore, so I have to make
this as output columns. Okay, so that was the issue. Right? So now you&#39;ll be able to
see that it&#39;s executing perfectly fine. Now, here you have all the features available six
underscore index smoken underscore index, de underscore index and time underscore index
and all you can see over here this ordinal encodings like 012 right and we have now converted
this all string values into this kind of all category values that are available in this
feature into numerical values. Now, my model will definitely be able to understand Okay,
now we have done this guys, now, let&#39;s proceed what is the next step that we basically do
that we are going to discuss now the other steps are pretty much easy because we have
already created this specific data set. Now what we have to do is that there is something
called as vector assembler. Now always remember guys in PI Spark, you know, whenever we have
all this particular feature, we need to group all the independent features together, and
the dependent feature separately, okay, so guys, we&#39;ll write from from pi spark.ml dot
feature, I&#39;m going to import something called as vector assembler. Right? So I&#39;m just going
to use this vector assembler, this will actually help us to group independent features together
and the dependent features separately, so let me just go ahead and write vector assembler
and then I&#39;m going to initialize this the first parameter I have to provide is basically
my input columns, here are my input columns, what are the input columns I have? Let me
just see before this, let me quickly do one thing is that let me make a cell okay. Cell
up create a cell let me just removed this and probably just let me write you know, df
underscore r dot columns. Okay, so how many number of columns we have. So I have all my
information with respect to the columns. So my input column over here, the first thing
that I am definitely going to provide is my tip column. Because tip is required tip is
the first independent feature and it is a numerical feature then I have something like
six underscore indexed Okay, so I&#39;m just going to copy this paste it over here. And this
is my another input feature. And remember guys, we really need to follow the order.
So now my third feature is basically smoke index. And before this also I can also specify
size okay. So I will be specifying size, size six index smoker index. Okay, and then probably
I&#39;ll also create a index. Okay, they index, comma, I&#39;m just going to use time index. Okay,
so this is our, these are all my independent features. And with respect to this, now remember,
this will be grouped together. And I also have to say that if this is grouped together,
let&#39;s create a new feature, and untie and name this entire group. Okay, so here, I&#39;m
just going to say, output column is equal to, and here I&#39;m just going to specify this
are my independent features. So I&#39;m going to name this entire thing as my independent
feature pretty much simple. Now, let me do one more thing, let me create a variable called
as feature assembler, so that we will be able to transform this value. Okay, so feature
assembler is equal to vector assembler, and here I have to provide my input columns and
the output columns pretty much simple, pretty much easy. Now the next step that I&#39;m going
to do is that right output is equal to I&#39;m just going to say dot transform, because I
really need to transfer and this needs to be transformed from my df underscore art,
okay. So, let me just execute it. Now. Now, here, it has got executed here you can see
the entire output, all these things are created, these are my independent features. Now in
the independent and why we need to create this independent features that is the that
is the specification that is given in PI Spark, always remember, we need to create a group
of features and probably a list, all these independent features will be done together.
Now, if I go and see my output dot show, here, now you will be able to see, I will be able
to see one more feature, which has this, or let me just write output dot select, because
probably all the features have been grouped together. And it is very difficult to see
all the features in just one screen, I&#39;m just going to take this independent features and
just click on dot show. Now once I do this, here, you will be able to see all this particular
features are there. Remember, this needs to be shown in the same order. The first feature
is tip, then size six, underscore index, smoker underscore index, de underscore index, time
underscore index. So these are my independent feature. Now I just have one feature. And
over here, you will be able to see that it is just having a list of all the features
like this. And this is the first major step to create. Okay, now let&#39;s go to the next
one. Now I know what is my output. Now what I&#39;m going to do is that out of this entire
output, if I go and see my output, output dot show here, you will be able to see all
the features are available right in output dot show. So here we&#39;ll be able to see all
the features are available. Now you know, which is the output feature, right? So this
is my dependent feature. And this independent features are my independent feature. So what
I&#39;m going to do now I&#39;m just going to select output, or I&#39;ll say this is basically my finalized
data. And I&#39;m just going to pick up two columns, that is output dot select. And inside this,
I&#39;m just going to give my two features, which I&#39;m going to say one is independent features.
Okay, indie band features, I hope that the name is right. Otherwise, I&#39;ll just confirm
it once again. So let me click it over here. Independent features, and one is total underscore
bill. Perfect, comma, total underscore bill. Okay. Now if I just go and execute this, now
I&#39;m just picking up to features right from this. Now if I go and find out finalized data
dot show, now I&#39;ll be able to see two important features, that is independent features and
total bill. Remember, this is all my independent features. And this is my dependent feature.
This is very much simple till here. If it is done, guys, the next step is basically
I&#39;m just going to copy and paste some code, I&#39;m just going to apply the linear regression.
So first of all, from pi spark.ml dot regression, I&#39;m going to import the linear regression.
And then I&#39;m going to take this entire finalized data and then do a random split of 75 and
25%. Okay, and then in my linear regression, I am just going to provide my independent
features as my feature column. This is two parameters which to be given in linear regression,
one is feature column here I&#39;ll be providing independent features. The second one is basically
total bill which is my dependent feature. And now I will just do a fit on train data.
So once I do it, my regressor model will get created. And probably this will take time.
Now here you can see all the information, amazing information you are getting with respect
to train and test. Okay, and remember guys, whatever group you have made this independent
feature, this is in the format of UDT. Okay, you can see the full form of UDP, that is
not a big problem, I thought, okay, now I have my regressor. So what I&#39;m going to do,
I&#39;m just going to say regressor dot coefficient, since this is a multiple linear regression,
so I&#39;m just going to use regular So dot coefficient and these are all my parameters are different
different coefficients because I have around six parameters. So these are all the six different
different coefficients. Always remember in a linear regression, you will be having coefficient
based on the number of features. And you will also be having intersect. So here I&#39;m just
going to intercept. Okay, so this is basically my intercept, that is point 923. Okay, so
here you have both the information, now is the time that we will try to evaluate, evaluate
the test data. So here, I&#39;m just going to say test. And this is basically my predictions,
right? So here, let me write it as something like this. So predictions, okay. prediction.
And what I&#39;m going to do, I&#39;m just going to write pred underscore results. Results is
equal to this one. And this will basically be my results, okay? Or test is not defined
why test is not defined, because there should be test data. I&#39;m actually sorry, okay. But
it&#39;s okay, you will be able to get so small, small errors, okay. Now, if I really want
to see my price results, just go and see red dot predictions, they&#39;ll be something like
predictions dot show, okay. If you write like this, here, you will be able to get the entire
prediction. Okay. So remember, in this prediction, this is your independent feature, this is
your total bill, this is your actual value. And this is a prediction value, actual value
prediction value, actual value and prediction value here, you can actually compare how good
it is, you know, by just seeing your total bill and the prediction value, pretty much
good, pretty much amazing, you are able to see the data, I&#39;m just going to write my final
comparison. Okay, final comparison. Perfect. I&#39;m very much good at it, you can see it.
Now let&#39;s see some of the other information like what information we can basically check
it out. From this, we can we have a lot of things go probably you want to check out the
R square. So what you can write, you can basically write a regression.if, I press tab, this coefficient
intercept, then you have lost, then there&#39;s also something called an R squared. If I go
and execute this, this is basically my r squared. Or let me just write it down. I think, prediction
predictions. I don&#39;t think so r square is where Let&#39;s see whether we&#39;ll be able to see
the R squared value or not. In just a second, I&#39;m just checking out the documentation page.
Okay, oh, sorry, I don&#39;t have to use regressor over here. So here I will be using prediction
dot results. And let me compute the R square. So this is my r squared. Similarly, you can
also check out prediction results dot mean absolute error. So you have mean absolute
error. You also have prediction underscore result dot mean, squared. So all these three
values, you can definitely check it out. So here is your mean absolute error here is your
mean squared error. So these are my performance metrics that I can definitely have. And whenever
you guys whenever you face any kind of problems, just make sure that you check the documentation
in Apache Spark em lib documentation. Now in this way, you can definitely do this entire
problem statement. Now I&#39;ll give you one assignment just try to Google it and try to see how you
can save this particular file probably in a pickle format or probably in a temporary
model pickle file. You know, it&#39;s very, very simple, you just have to use the regression
dot save but try to have a look and try to see that how you can save this particular
pickle file. Now this was all about this particular video. I hope you like this now just try to
solve it any other problem statement. Try to do this. In the upcoming videos. I&#39;ll also
try to show how you can do one hot encoding and probably will be able to learn that too.
So I hope you liked this particular video. Please do subscribe the channel if you&#39;re
not subscribed to either make sure to have a great day. Thank you. Bye bye

Transcript for:Apache Spark and PySpark Overview

Transcript for:
Apache Spark and PySpark Overview