Hey, hello, welcome back to my YouTube channel. It's Ranjan and this is 10th video of Apache PySpark playlist. And in my last nine videos, I have covered what is PySpark, why we use PySpark, what is the difference between Pandas and PySpark and basic fundamental units RDD, dataset and data frame.
If you have not watched these nine videos, so just have a look then only you will get a better clarity on my next videos. So in this video, we will be applying machine learning algorithms on Spark SQL data frame. Not on RDDs because RDD is almost obsolete.
But if you are learning machine learning algorithms on SQL data frame, it would be very easy to apply on RDDs as well. So first of all, I will import some basic libraries. So here it is spark session which is importing from pyspark.sql. So this spark session is used to create a spark session in case of SQL data frame and next these are the basic machine learning algorithms. So this would be a vector assembler.
So it is a transformation technique. So we can say data pre-processing technique. So I will show you how we will apply vector assembler and this is linear regression and this is random forest regression.
So I would be showing you two algorithms linear regression and random forest regression and this is my regression evaluator. So it would be my metric which will give me performance of the model by using this I will be able to evaluate my model. So I will show you something like when you will open the spark documentation latest documentation spark 3.0. So you will see like here in this there are two packages first is PySpark ML package and second is PySpark ML lib package machine learning library and this is machine learning package so in this if you will see like all the algorithms exist in both packages even in machine learning package and even in machine learning library package but these are the obsolete if you are using on the RDD then you have to import from machine learning library but if you are using on data frame so we have to import from machine learning Package, so this is my machine learning package But if in case I have to import from the machine learning library package I have to do PySpark.ml live and then shift tab It will show me like clustering clustering common evaluation feature linear regression random forest regression stats So in case of RDD, I have to use these algorithms. So this package machine learning library, but I am using StataFrame So I would be using machine learning library only so I will remove this Now I will create spark session.
So I am giving a name spark to spark session. So I am giving an app name by spark ML algorithm. So it is not required. And this is get or create. Basically, I want to have a read and write as this on my data frame.
This is my data frame, which is created by using this CSV file. So I will show you what is CSV file that I'm using this CSV file you can get anywhere from the Google or it is in Kaggle as well. So I will upload that file in Google Drive and GitHub.
and if you will see like here are some scores these are my input and here is my output chance of admit if my GRE score is 37 TOEFL score 118 university score SOP LOR CGPA research so basically these are the inputs and on the basis of this inputs my model will detect whether I will get admission or not if it would be 0 or 1 so in that case it would be classification but here I am taking example of regression so I have computed some random values So these are the values. So basically it would be a regression example. So I have only 500 data.
It is 500 rows. So this data set I'm using I have already header details like Jerry's quote of a school University. So that's why I am putting header equal to true.
So now I will see type of the data frame. It would be SQL data frame and if I want to see so I will type show so it will give me all the information that I have it by default. It will show me 20 rows. Now I will see the schema.
So what are the information like what are the data type of all the rows? I am seeing that it is string. So by default it is taking everything else string.
So to apply any machine learning algorithm in PI spark or even any case we have to convert this into float by doing so we would be able to apply computation numerical calculations and in case of I want to see what are the columns in my data frame. I will type data from dot columns and I need to apply some SQL queries. So first I will import column which would be in pi spark dot SQL dot functions. So I will import this and I am showing you an example to just type all columns one by one.
So what I will do for C in new data frame to be my data frame. So for C in data frame dot columns print C. So it will do like this. So it is giving me all columns. But in case I want to apply function column apply column function.
So it would be like this. Now it is a column. It is specified. Now I am applying SQL query. So we all know that in SQL we apply select star.
and we have to specify a column name. So it will give me all columns. If you don't know about the SQL, so I will tell you like there is a query or there is a command in SQL which is SELECT. If you are doing SELECT star from table, so it will give me each and every information that exists in the table.
So in case of DataFrame SQL, what I have to do, I will apply DataFrame and there is a function which is .SELECT and in SELECT I will apply star and in star I will apply this column and C would be my new variable and here I am applying comprehension. So what it will do it will print each and everything. I will show you like it would be similar to this because here I am printing each and everything but I will show you in later cases.
I will apply some filters. So first I will run this so is giving me all information. So in the next step what I will do I will just convert each and every element into float because I have shown you this is in string.
So what I will do here. I will do dot cast float. So here I am doing dot cast float each and everything is same.
So now it has converted into float here. You will see in each term. I have decimal and it is converted into float. So here I have not saved. that information into my data frame.
So what I will do I will save into my new data frame. So my new data frame now would be like this if you will see earlier it was string now it is float everywhere and now I am importing some more functions. I have already import call here. I'm importing count is then and when so call is I have shown you that is column and count is basically it will count the number of the occurrence and is that that value is null or not.
and when would be equivalent to where in SQL so when it is giving a condition we can define a condition by using when so now I will check which all the values are null in which column so what I will do I will use new data frame and in this I will apply select so select would be treated as a filter it will give me information whatever I am trying to search so in this I am applying count so it will count the number of the nulls in my data set so when is equivalent to where in SQL so it is a way to apply a filter so here column C is the same so it will check in each column whenever value is null so it will increase the count and what is the column it is using it is using C and here I am applying comprehension so it will give me information like this so it is very randomly distributed so what it is doing it is adding some more values in each column so to remove these values to just make it simple header I what I will do I will use dot Elias here I will use Elias so what it will do it will just keep the original name as it is here it is showing me that whenever the count is null so it will show me so here I am using Elias so here I have applied Elias. Elias is a function in which I will define C when I am applying Elias so it would be like this so in GRE score I have 15 null values in to the office where I have 10 null values. To remove this NULL values I have to use imputation function which exists in the PySpark.
So what I will do I will import imputer which is in PySpark.ml.features So this comes under feature transformation so I am applying transformations. So what I will do I will create a new variable in this variable I will use this imputer and in case of imputer I have to define input columns what all are my input columns and when you will impute the values so what you want to define as a new column. So I am defining as a output columns are same. So these are my output columns and these are my input columns.
So I have created my variable. Now what I have to do I am creating a model like structure. So it this I have to use a fit.
So what I will do I will use imputed dot fit and I will apply my new data frame which is my this data frame. Now I have to define transform. So I am creating a new variable which would be my imputed data.
and this is my model.transform and applying a new data frame as a parameter in this. So now I will see the values will be imputed. Now I will check whether my null values are still there or not.
I'm applying same thing here. So you will see all are zeros. So perfect. So here what I am doing, I'm just dropping my chance of admit because this is my output variable. I don't need this.
So what I will do, I will drop this chance of admit from my features. So now my features.column would be these many columns. I have dropped the chance of admin because I don't want this. Now it is about vector assembler.
So it is like a transformation. It takes some input and produce some output. So basically it's combine a given list of columns. So these are my all columns.
These are my all input columns. So it will take this list of columns and it will create a single vector column. It would be a tuple like structure. It is very useful for combining raw features and it will create a new single feature vector.
I will show you how it looks like. So what I have to do I am defining a new variable here assembler and in assembler I am using a vector assembler function and in this I am defining my input columns which are Features.Columns. So I am taking all features as a input.
So in this input columns equal to features.Columns and output column would be my feature. So this is the new column which I am creating here. So now it is a model like structure.
I have to define transform here. So in the transform I am defining a imputed data as my parameter. So this was my imputed data this so now my output has created I will show what is the output. So if you will see These are my input columns. Now I have created a feature.
So this is my feature. So what it has done it has a converted a single feature vector in this if you will see first is 337 second is 118 and third is 4 and afterwards it would be like this 4591 and again same 324 107 4445. What it is doing here it is just combining all the input columns into a single feature vector. So now my features vector has been created and When I was applying Factor Assembler, I have already dropped Chance of Admit.
So if you will see in my Features column, Chance of Admit won't be there. It would be till Research. I will show you.
So if I want to see the complete details, I can convert this into 2 Pandas. If I will convert into Pandas, it would be a DataFrame. So what I am doing here, I just want to see only Features. So while converting into 2 Pandas, I will apply Values.
So, values is a function which exist in It will show me values. So here you will see like first is 337 and it will tell 9.65 and 1.0 So 9.65 and after 1.0. So if you will see the chance of admit is not there in features because I have already dropped that column chance of admit.
So this is my all inputs column. So instead of taking all these columns, I will take only this in my model further. So that would be my final data. So in final data, what I am doing, I am. just selecting features and chance of admit only two column it will be like this these are my features and this is chance of admit so you can say this is x and this is y so this x is a collections of inputs and that i have combined into a single feature vector so now what i am doing i am just splitting train and test i am defining data here and here i am defining random split so it would be random each and every time and in this i am defining is 0.7 for train and 0.34 test so I will show you so it will be like this is my train data and this is my test data so first I am applying linear regression algorithm so I am creating a my variable which is linear regression and here I am defining a linear regression which I have imported and in this I have to define features column so what would be my features input features here I am defining my features which I have created in this so these are my features and what is your label column so This is label column means your class, your output.
So here I am defining chance of admit would be my y. Now I am creating a new which would be my model. So in this I am applying a linear rig dot fit. I am fitting my model on train data. So I will run this and if I want to see what are my coefficients and what my intercept.
This is basically would be m and this would be my c. y equal to mx plus c. y equal to mx plus c is a equation of line. So these are my different M. So these are my 7 coefficients because I have 7 inputs in my data.
If you will see 1, 2, 3, 4, 5, 6, 7. So that's why it is giving me 7 coefficients. So it would be M1, M2, M3, M4, M5, M6, M7 and this is my intercept. So now I will see the summary to just see the what is the accuracy, what is the performance matrix for the training data.
so here I will show you on parameters of training data when it has trained the model on the train data so it is giving so this is a root mean square error so it is around 60 percent and r2 score is 82 percent so this is my training data so this accuracy and this matrix is comes from training data so now i will apply predict on my test data so here i am using my model linear model which i have created here in this i am applying transform on test data so in this if you will see it will give me prediction what our model has predicted and what was the actual output and what was your features so if you will see here it was 54 it is predicting 56 it was 57 it is predicting 51 so it is giving me both data prediction and original or actual so now i will compute the accuracy on my test data so i'm using regression evaluator which is my ml.evaluation so here i am creating a new variable which is which is my predict evaluator. In this I am using regression evaluator. and defining prediction column equal to prediction so this was my prediction column and label column my chance of admit and metric name is R2 which is R square and here I will print R2 score so it would be predict evaluator dot evaluate prediction so it is giving me accuracy around 78 percent so without any data pre-processing it is giving me good accuracy so I have not covered data pre-processing in PySpark data frame I will be covering this topic in my separate video in this I have applied machine learning algorithm without any data pre-processing and used PySpark data frame Now it's about random forest regression. So it would be same pretty same.
So here I am creating a new variable which is random forest regression in my I'm using random forest regression in my this function I am using features column equal to features and label column equal to chance of admit. So it would be same as we have used in linear regression. So first I will fit the model on my train data. So I have run this now I will predict. So it has predicted now I will show the predictions what are the predictions.
So these are my predictions. this was my actual data and this was my predicted data so now i will calculate the matrix on the test data so now i will run the accuracy so my root mean square array is coming around 58 and my r square score is coming around 82 percent so in linear regression i was getting 78 percent and in random forest i am getting 82 percent so this score comes without any Data Preprocessing and if you want to apply data pre-processing and some other transformations so i will show you like how you can do this so you can go to any module so i have clicked on machine learning package so this was my machine learning package and in this i can use any so i have used vector assembler i will show you i have used vector assembler so it was like this so you can refer this documentation they have explained each and every bit of this and you can apply standard scalar transformation one hotend coding to convert the categoricals into Numeric Normalizer, Min Max Scaler. Basically these all same algorithm which exist in our SQL Learner Library. So you can apply, you can check each and every function in this.
And in my next video I will be taking a classification example and will be solving using PySpark. So that's all in this particular video and I hope I am able to explain you this topic with good clarity. If you really think this video is more of good explanation, so I am expecting your feedback and your like on this video. for any suggestions doubts feel free to let me know by posting the comment your such gestures really keeps me motivated to upload videos more frequently so do share this video with your friends and colleagues so don't forget to press the bell icon to get the notifications of my latest videos in your inbox so see you in the next video till then goodbye enjoy happy learning keep rocking