Introduction to Siddharthan's Machine Learning

hello everyone i am siddharthan in this youtube channel i teach about artificial intelligence and machine learning recently i planned to make a complete machine learning course with both conceptual topics and hands-on part in python this will be a 60 hours long course with five parts and each part will be around 12 hours and there will be several use cases and projects in each of these parts and i have tried to add as much topics as possible that are like very important so all these videos are already present in my youtube channel individually and i wanted to combine all these things together so that it is easier for a person who is just starting to learn about machine learning so i hope you have a great time learning this and all the best to you so let me just quickly uh tell you what are all the topics that will be present in this first part so we totally have about 10 modules in this machine learning course and this first module will be machine learning basics where i explain you about what is the difference between artificial intelligence machine learning and deep learning then the next topic will be what are all the different types of machine learning such as supervised learning and surprise learning reinforcement learning and we will discuss in detail about each of these topics individually and i explain you about what is meant by deep learning and how it differs from machine learning and what are all the different applications of it so this is more of a theoretical part where we understand the very basics of eml and the second module is again a very important one on python basics for machine learning so all the use cases and programming that we do in this uh you know course will be in python so it is important for us to understand some of the basics of python so there will be uh you know how to use google collaboratory so google collaboratory is the platform that we use for our coding again in this course so i have given you the basic understanding of how to use this google collaboratory and what are all the features then there are topics such as like various data types list dictionary tuple etcetera and you know how to use loops in python and how to create functions in python so those topics will be covered in the second module and third module is again an interesting module where we discuss about some of the important libraries that we need in machine learning such as numpy pandas matplotlib and c bond so numpy is more of a library that supports this numpy array where we do several mathematical operations and pandas is more of a data frame which you can consider about like a tables and matplotlib and c1 are mainly used for data visualization where we build this you know plots and graphs and this is like a very important library that is like widely used in data science applications where we need to build these plots to understand the data better and the fourth module will be data collection and pre-processing where i will explain you about where you can collect this data what are all the reliable sources that we have for this data collection and how to do various data processing uh like handling the missing values and and you know how to handle imbalance dataset train test split label encoding and all these topics will be covered here and also i have explained about how to handle textual data as well so these are all the four modules that we have and uh once all these modules are completed so there will be three use cases video so the first use case will be on rock versus mind prediction where we train a machine learning model to predict whether a object is a rock or it is a mind so that will be the first use case and the second use case will be predicting whether a person will be having diabetes or not so this will be the second use case and the third one is an interesting one regarding textual analysis where we try to build a machine learning system that predicts whether a male is a normal male or it is a spam male so this will be the topics that we will be covering in this first part of this machine learning course so once this part is completed so i'll upload videos on the upcoming parts where you have like more advanced and complex topics on machine learning such as uh different machine learning models processes like cross validation hyper parameter tuning etc okay so again all the very best so i hope you have a great time and i'll create a github repository and put all the code and jupyter notebook files that is done in all these videos so you can like refer that as well and most importantly i'll also give you time stamps for all these individual topics so if you are interested in a specific topic you can like skip to that part okay so let's get started artificial intelligence machine learning and deep learning so when you start learning about ai or machine learning this is one of the most basic and inevitable question you need to answer and often interviewers ask this question to know whether someone really knows about this topic or is just making things up so let's try to understand what is the relationship between these terms and then we shall try to understand what these three concepts mean okay first of all so there is a picture which clearly represents the relationship between these terms so as you can see this image this venn diagram artificial intelligence is a broader field and machine learning is a subset of artificial intelligence and again deep learning is a further subset of machine learning okay so this is the relationship between them now let's try to understand about these topics separately so what happens is people sometimes think that all these three terms mean the same thing but that's not the case so as i said earlier ml is a subset of ai and deep learning is a subset of ml okay now let's try to understand these topics separately so what is artificial intelligence artificial intelligence is a branch of computer science that is concerned with building smart and intelligent machines so what does intelligent machines mean and what does non-intelligent machines mean let's try to understand this with some examples so examples of non-intelligent machines can be you know it can be a warts or it can be a bike because these machines cannot think these machines cannot make decisions or do new things they are given a set work and they just do that work repeatedly and what are intelligent machines so example for intelligent machines are autonomous cars example you can consider a test locker and you can also consider google assistant which we you know encounter it in our daily life so a tesla car is more you know intelligent than a normal car because it you know it doesn't require any input from the driver and it can drive the car autonomously and google assistance is something that you know it uses ai for its function so when we you know text google assistant in our phone we don't feel like it is a computer software if we feel as if it is a human being so these are intelligent machines because they can think and give you an answer they can do new things they can make the decisions of their own which is not possible in the case of non-intelligent machines so artificial intelligence is all about making these intelligent machines okay now let's try to understand about machine learning so what is machine learning machine learning is a technique to implement artificial intelligence that can learn from the data by themselves without being explicitly programmed so machine learning is all about the data okay so let's try to consider an example so i want a system to detect whether the image is the image of iron man or captain america okay so this is the task i said for the system okay so we need to make this system so when we take a machine learning approach what we will do is for the machine learning algorithm we will feed numerous images of both iron man and captain america and we will tell that algorithm that these images are of iron man and these images are of captain america so now the algorithm without our input it just finds the pattern between the images and when you give a new image it can predict correctly whether that images of iron man or captain america so this is all about machine learning okay so what we do is we just give the algorithm a lot of data to learn from okay it is just similar to a child seeing something and learning from it so we just give it data and we don't need to do anything else so it can find the patterns and learn from itself so this is called as machine learning and this is how we implement ai okay now let's try to understand about deep learning deep learning is again a subfield of machine learning that uses a special type of algorithms called as artificial neural networks to learn from the data so this is a pictorial representation of the artificial neural network okay so these artificial neural networks are modeled from our human brain from the neurons present in our human brains there are numerous neurons in our brain that are interconnected so each neuron process the information and sends the output to another neuron so this is the same concept used in the case of artificial neural networks so they are nothing but a mathematical model that are connected like the neurons in our brain okay so there are different layers in those artificial neural network so as you can see here the first layer is an input layer then there are several hidden layers and then there will be outer output layer okay so we can discuss about this in more detail in the future videos but you know in this video the idea is to give you a short idea about what these three terms means okay so first we see what is meant by artificial intelligence so then we saw what is meant by machine learning so machine learning is a technique to implement artificial intelligence and again deep learning is one of the subset of machine learning which just uses artificial neural networks okay so that's it about the difference between these three terms machine learning and deep learning in this video i would like to explain you about the different types of machine learning okay so first of all i'll explain you what is meant by machine learning with an example then we will look into all the different types okay so machine learning machine learning is a technique to implement artificial intelligence that can learn from the data by themselves without being explicitly programmed okay so the ultimate goal in machine learning is to make intelligent machines right and how we do is by making the machine to learn from the data okay so we don't do explicit program which means we don't tell the machine exactly what it has to do so it has to you know find those ways by itself so i will try to explain this with an example okay so we want a machine to see an image and to recognize whether the image represents a dog or cat okay so this is the goal for the system we are building now in machine learning what we will do is we will make the model or the machine learning model to learn from the data here the data will be several images of docks and catch okay so we will feed this images of dogs and cats to our machine learning model and with the help of these images it tries to find pattern in these images and when you give a new image it can recognize whether the image represents a dog or cat okay so this is how machine learning works so it basically learns from the data okay now let's discuss about the different types of machine learning okay so there are three main types of machine learning one is supervised learning the next one is unsupervised learning and the third one is reinforcement learning okay so in supervised learning there is some supervision to the machine learning algorithm by the programmers or by us and in unsupervised learning there is no supervision for the machine and reinforcement learning is completely a different type and it is not related to each of this supervised or unsupervised learning okay so let's try to understand this in more detail first of all surprise learning in supervised learning the machine learning algorithm learns from labeled data so we already see that you know in machine learning the model learns from the data right so what is what is meant by label data set let's say that a machine learning model has to see an image and recognize whether the image represents an apple or a mango okay so now what we will do is we will take several images of apples and mangoes and we will tell the machine that these images belongs to apples and these are the images of mangos okay now these apples and mangoes so this name is called as labels and we feed this label dataset to our machine learning model now our machine learning model or machine learning algorithm tries to find the patterns between these images okay and once it it has learned from the data when you give an unknown image it can correctly recognize whether the image represents an apple or a mango so this is how supervised learning works so we are telling it it is you know known as supervised learning because we are giving a supervision in terms of labels okay now let's discuss about unsupervised learning so in unsupervised learning the machine learning algorithm learns from unlabeled data here we won't tell what that data represents okay so we won't give any labels let's consider a similar example so we will give several images of apple and mangos to our machine learning model and we won't tell that these images belong to apple or these images belong to mango so we feed all these images without telling what it is to our machine learning model and what it does is it tries to again find the pattern and it tries to group all these images and it will group the images into group one and group two okay so all the apples will be grouped in one group and all the mangoes will be grouped in another group so when you give a new image of an apple or a main group it tells you whether it belongs to the group or group so this is called as unsupervised learning because we are not giving any supervision in terms of labels okay so this is called as unsupervised running and the third one is reinforcement learning okay so the reinforcement running is not similar to you know supervised or unsupervised learning it's quite different from both other types okay so let's try to understand this in more detail so this is the definition of reinforcement learning reinforcement learning is an area of machine learning concerned with how intelligent agents take actions in an environment to maximize its rewards okay so it can be a bit difficult to understand but i will try to break down this you know definition into a simple steps so there are four main aspects in reinforcement learning they are environment agent action and reward okay so there will be an environment and what we need to do is we need to build an agent that acts in that environment okay so that agent in that environment he tries to take some actions and for that action it gains some rewards okay so let's try to understand this with an example we want to make a computer software or a computer program that can play chess like a human being okay so here our chess board becomes the environment and our computer become the agent okay so in the environment of chessboard our agent which is the computer tries to take actions so the actions represents the move the computer takes okay so in the chess and for each step it gets a reward so the ultimate reward is winning the chess game okay so for each step it takes closer to winning it will get a positive reward okay so if it takes a bad step or bad move so it will get a negative reward so by this the machine tries to learn how to play that game okay so several applications are there for reinforcement learning for example you know several game playing artificial intelligence are based on reinforcement learning and all the autonomous systems like cars and automatic drones are based on reinforcement learning okay so these are the different types of machine learning so first we have discussed about what is meant by supervised learning where we basically give the machine learning algorithm label data set and in unsupervised learning we give unlabeled data set and reinforcement learning we will try to make an agent that acts in an environment to increase its chance of winning okay so this is the three main types of machine learning what are the different types of supervised learning okay so this is the agenda for today's video and i want this channel to be more interactive so from now on i will give a link for each video in the description and in that link you will find a google form containing mcqs for that particular topic for example so in the description of this video you will find a google form link that contains mcq on the topic supervised learning okay so once you complete watching these videos you can try to answer those mcqs okay so let's get started surprise learning so supervised learning is a type of machine learning in which the machine learning algorithm learns from the labeled dataset okay so here the most important thing to note here is the algorithm learns from the label data okay so what is this label data so in machine learning generally we feed the machine learning algorithm a lot of data and we tell the algorithm that that this data represents this label okay and the algorithm tries to map the labels and the data so that it can recognize it okay let's try to understand this more deeply with an example so we want our machine learning model to see an image and recognize whether it represents an apple or a mango okay so this is the task for a machine learning algorithm so in the case of super easy running what we will do is we will feed the images of apples and mangoes and we will tell the machines that these images represents apples and these images represents mangoes okay and we feed these images to our ml model and what it does it tries to find the relationship between these images and it maps it to the label which is apples and mangos okay now it knows uh you know our apple looks and how a mango looks okay so once it learned from the data when you give an unknown image it can predict correctly whether it is an apple or a mango so this is how supervised learning works so the important point to note here is we are training the machine learning model with data which is labeled okay so in case of unsupervised learning we don't give the machine learning algorithm the labels okay so that is the difference between surprise and unsupervised money okay now let's discuss about the types of surprise learning there are two main types of supervised learning one is classification and another one is regression okay so what is mean by this classification and regression so classification is about predicting a class or discrete value okay so that is just these class or labels okay so there is not continuous values like numbers so it will predict whether you know the problem statement we have to predict whether it's a male or female true or false like that okay so there will only be classes and in regression we try to predict the continuous values for example like the salary agent price okay say for example um we need to predict salary of a person from his work experience okay so the salary will be in a continuous number right so those kinds of problem statements are done using regression models okay in classification we just predict it's you know whether true or false or male or female say for example the example we have seen before where our model classifies the images into apples or mangos whereas in regression we find a particular number okay so let's try to understand this with another example so first of all classification so we want our machine learning model to see an image and recognize whether the image is you know a dog or cat so what we will do is we will give the labels and the data to our machine learning model so it maps with that label and the images now we can tell whether it is a dog or cat so this represents classification because we are just classifying the image into either dog or cat okay so there is no middle values right so there is not any decimal values here it is just binary in this case now lets discuss about an example for regression lets say that we need to predict the rainfall in centimeter value for a given temperature or pressure and different factors on which the rainfall depends okay so what we will do is we will train our machine learning model with this data like for example so we will tell the machine learning model that for this temperature there will be this much amount of rainfall and for you know different cases like this and when the model is learned from the data when we give a new temperature value it can tell you how much rainfall we can expect so the rainfall in centimeter will be a continuous value it can be a decimal value right so this is called as regression so in classification we try to you know predict the class or type but in regression we try to predict or we try to find a number okay so that is the difference between classification and regression so there are various lot of different types of application on classification and regression so which we will see in our later videos in our project videos okay now let's see what are some most important algorithms for classification and regression decision tree classification random forest classification k nearest neighbor classification so these are some examples of classification algorithms and regression model algorithms includes logistic regression polynomial regression and support vector machine recognition so it's okay if you don't understand what is meant by these algorithms so we will be working on each type of algorithms once we start doing the hands-on part so i'll be explaining about them in more detail once we start those all the types of unsuppression so if you haven't watched those videos do check out them now let's discuss about what is mean by unsupervised learning and what are the different types of unsupervised learning tasks okay so unsupervised learning non-supervised learning the machine learning algorithms learn from unlabeled data so this is the difference between supervised and unsupervised learning where in supervised learning we use labeled data so we will tell our machine learning model that this data represents this item like that and we won't uh say those things in unsupervised learning so we will train our model with unlabeled data okay so let's try to understand this with an example so we have several images of apples and mangoes and so once we feed this data to our machine learning model so what happens is it can group that group the data so group these images based on uh you know similar patterns so it can group the apples in one group and it can group the mangoes in the second group okay so what happens here is we are not telling the model that these images represents apples and these images represents mangoes so we are not giving that label whereas in supervised learning we will tell the machine that these images represents apple okay so that's the difference between surprise and unsupervised learning so what happens is it automatically finds the pattern between those images and it groups the similar items in one group and uh another items in another group okay so this is the idea behind unsupervised learning so now let's try to see what are the different uh types of unsupervision and actually one more thing so why we are calling uh this as unsupervised learning is that in super age learning we are giving a supervision to our machine okay so that is like a supervisor who gives the machine the labels okay but we are not giving any labels or supervision here means this is called as unsupervised learning so what are the types of function page learning there are two types of unsupervised learning so first task is clustering and the another task in unsupervised learning is association okay so what is will be clustering and association then so clustering is an unsupervised learning task which involves grouping the similar data points so this is the example which we have discussed right now which is uh the apple and mango example where we group the similar data points and in association super is learning test we just try to find some important relationship between data points okay so in a big data set we try to find which data points are associated so which are similar okay so let's try to understand these two tasks in more detail first of all clustering okay so let's say that we get a project from from a mobile network company so they want us to suggest some ways on how they can increase their user base and how they can increase their revenue okay so they are giving us their user data and what we find is so we are feeding it to a clustering algorithm so unsupervised learning algorithm and this model is uh clustering the data into two clusters so okay so this is one possibility where people who are having i call duration may have very less internet usage and people who are having eye internet usage may have i call duration okay so this is a possibility and now what we can suggest that network companies that they can give offers on internet usage for those people who are having any call duration and less internet usage okay and we can give offers on call durations for people uh who are having less internet uses and vice versa by this you know people tend to use both these features more okay so this is one way by which they can increase their revenue by you know where people can opt for both the plans so this is one clustering example where the machine learning algorithm can cluster the data based on the user experience based on the user data okay now let's try to understand about association so let's consider there is a supermarket and there are several customers who are buying these products specifically so customer base bread milk uh you know fruits and wheat so there is another kind of customers who are buying bread milk rice and butter okay so the important association between all these customer is that if someone buys bread that customer obviously is buying milk okay so this is one of the important uh you know relationship we have found and this is this can be used really well and when a customer buys a bread we can suggest them that they are they can buy milk so the third customer is most most probably is going to buy milk also okay so this is one of the method where we can use association so i would like to give you another interesting example in this case so we are all you know uh used to the famous odt platforms like netflix and amazon prime so those ott platforms use this kind of algorithms to suggest suggesters movies okay so let's say that i am watching avengers movie okay so now uh the netflix can suggest me uh movies uh regarding you know the superhero movies because someone who have already watched avengers may have watched other superhero movies so it associates those user behaviors and it can suggest me movies watched by that user so this is one of the interesting [Music] applications of unsupervised learning so these are some interesting examples now let's see what are some uh important unsupervised learning algorithms so so apart from these five algorithms there are also several unsupervised moving algorithms but these five are very important algorithms so we have k means clustering hierarchical clustering which are examples of clustering examples and there is another algorithm called as principal component analysis which is used to reduce the dimensions of our data let's say for example we have a data set where it contains 1000 rows and 100 columns so we want to reduce this dimension okay so the columns represent the features so we can use this principle component analysis algorithm to find which columns or which features are very important for our application okay so that is where principal component analysis is used and it is a type of an unsupervised learning algorithms where we won't give the machine learning algorithm any labels okay and there are other two algorithms apriori and eclat so these two algorithms are example of association task okay so these are some of the important unsupervised learning algorithms testing what is meant by deep learning what is meant by neural network what are the important applications of deep learning and also we are going to discuss about some important events that made deep learning so much popular okay so these are the topics we'll be discussing so let's get started first of all what is meant by deep learning so deep learning is a subfield of machine learning that uses artificial neural networks to learn from the data so we have already seen the difference between artificial intelligence machine learning and deep learning so we know that artificial intelligence is a subfield of machine learning right so in machine learning we basically use several data and we feed this data to our machine learning algorithm to make predictions right so in deep learning what we do is we feed it to a specialized algorithms called as artificial neural networks okay so this is the difference between machine learning and deep running okay so now let's try to understand how this uh you know artificial neural networks is inspired so this is the diagram of uh neurons present in our human brain okay so this neuron consists of it and the it consists of a nucleus okay so this is where the information is processed in our brain okay so once the information is processed it passes through the neuron body through the axon and from there it is transferred to another neuron and the pathway goes on like this so this is how the information is processed in our brain and it is transferred to some part of our body okay so this is the exact principle that inspired artificial neural networks as you can see here this is the diagrammatic representation of the artificial neural network code so basically what happens is we have individual neurons connected to each other which forms the neural network okay so each neuron has a mathematical function assigned to it so this neuron processes the data say for example we want to you know recognize what is the image represents so we want to basically do an image recognition task so we feed the image to this neural network and in the input layer this image will be splitted into it its respective pixels so there will be a lot of pixels and each of these pixels should be given to several neurons and in the input layer this uh information this pixel value will be processed and it will be transferred to the heater layer and then uh again there will be some processing happening in the even layer and then it will be transferred to the output layer where the image is predicted okay so this is how the neural network works so as i told there are three main layers in the neural networks so first one is the input layer then is the hidden layers and finally we have output layer okay so there can be any number of hidden layers in a neural network depending on the task we are doing doing okay so each neuron has a mathematical function so as i have told you and this process information and each neuron in the input layer is connected to each neuron in the hidden layer okay and this is how the information is passed and the respective prediction is made okay so this is all about artificial neural networks now let's try to understand what is the difference main difference between machine learning algorithms and deep learning okay so rather than defense it is you know deep learning has one main advantage over machine learning so that you know difference is feature extraction okay so what is meant by this feature extraction let's say that we want a machine learning model to predict whether a image represents a car okay so when you are giving it to a machine learning model we need to tell the model that these features are important for a car okay for example if it is a car it should have four wheels and it should have a shape like this and all that right so we need to give those features to our machine learning model we have to manually tell them that these features are important but we don't need to do that in the case of deep running because the neural networks are so much powerful than any machine learning algorithms they can determine those feature by themselves okay so that is the main advantage of deep learning or machine learning where we don't need to extract the feature manually okay now let's uh try to understand the events that made deep learning so much popular okay so there is a famous deep learning company called as deep mind so it is based in uh united kingdom so it was started around 2010 so deep learning was there you know from that point of time and even before that so what happened is in 2014 google acquired this company so deepmind basically made game playing artificial intelligence system okay so there is this famous game called as go so this go this board game like just but it is so much complicated and so much you know deeper than just because in just there is limited number of uh moves one can make but the possibilities in go is so much more so there are several moves one can make based on the configuration of the game okay so in 2016 they made a machine learning sorry a deep learning system that can play this go game and they challenged the world champion lee sedol so he is 18 time world champion and they challenged him for a five match tournament okay so that game playing system was made based on deep learning okay so they have been developing this uh over you know five or six years and uh they challenged him and what happened is in the tournament of five games so alphago which is the system deepmind build has beat lee sedol for forest one okay so it has won four matches and lee sedal won one matches and that is where people started to look at deep learning and realize that deep learning is so much powerful than any other algorithms in machine learning okay so after that several researches were made and several modifications have been done to the neural networks and several different types of neural networks have been uh you know invented after that so this is the point from which deep learning got so much you know popular and it was used in several kinds of fields after that okay let's try to understand one such example for this so diabetic retinopathy so diabetic retinopathy is a condition where a patient may lose his eyesight lose especially due to diabetes meditates okay so this deep mind developed a system uh based on deep learning that can determine whether a person has diabetic retinopathy from the eye scans okay so how this is basically made is the deep learning model will be trained with several normal eye images which does not have any dc's and the model will be again trained with several images which has diabetic retinopathy okay so once it trained when a new image is fed to this model it can predict whether the person has diabetic retinopathy or not so the interesting thing that happened here is it doesn't only predict whether the person has diabetic retinopathy or not it also predicted the gender of the patient whether you know the image of the is of a male or a female it also predicted whether they have some some other medical conditions or not so this is one of the fascinating thing that happened where the deep learning also predicted several other things rather than only the patient has diabetic retinopathy or not okay so these events led to the boom of deep learning after that it was used in several other fields now let's discuss about some of the important applications of deep learning first one is healthcare okay so the example which we have seen now is an example of healthcare applications apart from this deep learning is used in several diagnostic departments where it is used to predict whether a person has a specific image specific disease or not based on their scans images and other data so another example where deep learning used is the field of autonomous cars so autonomous cars like this lab doesn't need much driver input to drive the car right so they can drive the car by themselves and it is powered by deep learning models then we have computer vision so computer vision is one of the important application of deep learning so it is based on image processing techniques where the neural network is trained with several images so one such example is face recognition system in our phones so it is based on computer vision then there is natural language processing so natural example of natural language processing is chatbots which we most of us would have come across and other virtual assistants like google assistant siri and alexa all these technologies are powered by deep learning or neural networks okay so these are some of the important applications of deep learning videos the first video in our second module which is python basics for machine learning so in our machine learning course the first module we have discussed about the machine learning basics and this module is all about python basics for machine learning and now let's see how we can access google collaboratory so to use google collaboratory you don't need to install any software so you don't need to install any python software or other applications you just need to have a good web browser so you know google chrome is better suited for this one so just go to google and search google collab so here you will see this web page called as research.google.com so go to this welcome to collaboratory so this is where we are going to do our python programmings in most of our projects in this channel so here you can give this new notebook so you can also see this you know topics here so your google drive will be connected to your google collaboratory account so you can access your google collaboratory files from google drive as well so i create a new notebook from here so if you are starting new with a project or with a program you can go to this new notebook and this will take you to the site so this is the interface of google collaboratory so yeah so first of all let's change the name of this file so you can see here it shows untitled zero so this is where you know this is the name of the file and let me change this to google collaboratory basics so we call this in a short form as google collab so you can see this here i p y n b so p y and b means python notebooks there is another type of a notebook files for this uh jupyter notebooks you can also download this google collaboratory files so you can see this download option here so you can download this as either python notebooks or dot py so dot py means python files so you can open this uh python notebooks in your jupyter notebooks as well so after this we need to connect our system so you can see this connect option here so google collaboratory is basically a cloud-based you know application where we can run python programs so what happens is when we connect our system our you know python environment will be connected to google's backend server and that is you know in that service our codes will run so this is how it works and you will be allocated a ram and a cpu so you can check the details of your ram here so it says 12 gb of ram and we have about 100 gb of storage so 12 gb of ram is really good for us so good for doing several machine learning and the deep learning projects so now let's see what are the different features of this so you can add text in this google collab so this is called as a cell okay so in this cell we run our ports so you can also create text here so in this text you can give the description about your code so here i'll just give the code like now we are going to check the specifications of the system allocated to us so i'll just mention this as system specifications okay system specifications and uh in order to run one cell and go to the next one you can press shift plus enter so here let's see the system allocated to us for this i'm going to use a system command which is cat forward slash proc slash cpu info so basically you know google collab runs on unix so you can run unix commands here and whenever you you are running a system command we need to proceed it with an exclamatory mark okay if we are running python programs we shouldn't include any explanatory mark so that is one important thing to note here so when you run this cpu info it will tell us the details of the cpu allocated for us so you can see you see the processor details here so you know we have this intel xenon processor and this is our first processor with index 0 and this is the second processor and you can go through all the details here and then you can also check the ram allocated to you by using this command which is that proc slash mem info okay so let's run this so you can press this which is run cell or you can press shift plus enter to execute the save okay so you can see here 13.3 which is about 13.3 gb of ram allocated to us okay so this is how you can check the system specifications now you can go to this files option here so in this files there will be an option called as upload to session storage so we can upload some files here so this is a example of a data set file which is you know boston house price data so i'll upload this to my google collaboratory environment so this will upload this file here so you can use this upload to session storage or you can right click here so there will be this upload option okay so you can click this to view the file in a preview here so this will give you the preview of the file and this is how you can upload a file to your google collaboratory so in machine learning we often deal with data sets right so this is how you can upload a file to your google collab and to start working so there will be another option called as mount drive so if you give this mount drive your drive will be connected to your google collab and you can access all the files in your google drive from your google collaboratory so the importance of this is in some cases we may need to work with data sets that is large so that data sets with size of about 1gb or 2gb or even more than that so in that cases uploading it to google collaboratory takes a lot of time so in that cases what we do is upload it to our google drive account and uh we need to mount this drive after that and then we can access the file from google collaboratory like this so this is how you can access the files from our drive to your google collaborate okay so the interesting thing and very important thing in google collaborated is that most of the libraries in python most of the machine learning and data science library in python are already pre-installed and we don't need to install it separately so you know one or two libraries may be missing and now i'll show you how you can install libraries in google collab so i'll just make a text here as installing libraries let's say that we want to you know install pandas library so we know that pandas library is used in python to make data frames right so if you want to download or if you want to install any libraries in collab just go to google and search as pandas pypa so pypa means python package installer so you can see this pandas pypi so this is the command to install the library pandas so you can copy this library copy this command come to your google collaboratory paste it and you need to precede it with exclamatory mark as this is a system command so let's run this shift press enter so here you can see your requirement already satisfied so that means the library is already installed in our google collab environment so some of the libraries may not be installed so in that cases this is how you can you know install your libraries and now we can just import our library pandas as import pandas pd so this is how you import your library and after you have imported it you can load the dataset file to a pandas library so i'll just copy the path so i just i'm just giving you a demonstration on how you can run python programs so this is an example of you know python syntax right so i have copied this path now let's load this file to a pandas data frame so i'll name this data frame as df so df is equal to pd dot read csv okay so inside this parenthesis in the codes we need to paste the path of our data set file okay so this pd dot tcs we read csv function will read the csv file and unload it to a pandas data frame here i have uh imported the pandas library in a short form as pd and now i am using it in a short form so that's why we are importing it as pd so i'll run this now it will load the data from this csv file to a data frame so here you can see a data set file is a csv file which means comma separated values now you can you know print the sample of this data frame using df.get so this dot yet function will print the first five rows of our data frame so this is how you can run python program so these are all python programs importing pandas and loading into a data frame etc so let's run some simple uh programs like print we know that print is the key word right for python so in the coming videos i have explained about several python programs and what are the important data types and other things that are important for us to know in python for machine learning so let's try to print something so i'll print machine learning shift plus enter okay so that will print your line so this is how you can use google collaborating and you know that we have already uploaded this file right let's say that we want to print all the files that are present in this environment so for that we need to you know type exclamatory mark ls this will list all the files that are present in our environment so you can see here boston house dot csv sample data so these are the files and folders we have in our in this files section so this is how you can print this files so this is a basic introduction on what is google collaborating and how you can run this in the previous video we have seen how to use google collaboratory for python programming in this video we are going to discuss about the most basic concepts in python such as constants and variables data types print function and input functions okay so if you have any doubts on how to use google collaboratory you can watch the previous video so the index of the that video is 2.1 okay let's get started so when it comes to machine learning and data science two programming languages are widely used they are python and our programming language so r is mostly statistical based against python is preferred over r because python is a general purpose language where we use it in other applications also such as web development and other cases it also has several ready-made libraries for machine learning and it is also very easy to understand and easy to use these are the reasons we use python okay now let's try to understand the basic concepts in python so i have connected my google collaboratory so first let's discuss about the print function so i'll make a text here so i will be giving the link for this collab file in the description of this video so you can download it from there and uh once you complete watching this video do practice these codes in google collaboratory okay so print function print so if you are an engineering student in your first year in your c program class you would have came across the function printf okay so in c program this printf function is used to you know print some text in your screen okay so print some message on the screen so this print is also similar to that okay so in c program we use the keyword printf in python we use the keyword print okay so now let's try to print a string let me print machine learning so now we are going to print the text machine learning okay so as you can note here i have enclosed this uh two text machine learning in quotes right so you can use either codes either double quotes or single quotes okay so all the strings in your code should be enclosed in double quotes strings are nothing but text and sentences okay so either you can use double quotes or single code but you cannot start with a single quote and end with a double quote like for example you can use single quote here and a single code here okay so this will print that text in your screen or you can use double quotes so this tells our python interpreter that this is a string okay but you cannot use single quote and endless end it with a double quote this will throw a error as you can see here so we have to use either single quote or double quote so to run this particular cell you need to press ctrl plus enter or shift plus enter so shift plus enter will run this cells and automatically goes to the next cell okay now let's see how we can join two text so print parenthesis so again we have to put the text in quotes machine learning plus projects okay so this will concatenate these two terms the case of the machine learning and projects so concatenates means joining so as you can see here it prints machine learning projects so this is how you can join several strings in a print function okay so you can note here that i have given a space here if you just bring this without a space there won't be any space between them okay so that's why i have made a space here so this is how you can join multiple strings okay now let's try to print some numbers so i'll just type print parentheses on eight okay so this will print the number eight so as you can see here i haven't enclosed it in codes because only the strings need to be enclosed in quotes so this will print that number okay we can also do some arithmetic operation so if you give print but inside the parenthesis if you give 8 plus 3 so what you will get so as you can see here if we use plus sign between two strings it will join the two strings but in the case of integers if you put plus so it will add the two numbers okay so it will add the two numbers and it will uh print the sum of those two numbers okay so these are some basic things we use spring for apart from this error there will be a lot of places where we need to print something to uh in the screen so sometimes uh we need to print the entire data in our screen and uh there will be several other cases where this is very useful okay so these are some basic uses of a print function okay let me put this in a single section okay so in order to make a section in this text you just need to precede it with a hash so this will make a section and if you click this down arrow it will enclose all these cells in this section okay so if you press this again so it will expand it okay so we have five cells under the text print function okay now let's discuss about some very basic data types in python okay basic data types so the three basic data types in python are integers so we represent integers as int and floating points so floating points are nothing but okay floating points are nothing but decimals and we have strings so strings are represented by str so str represents strings which are nothing but text and sentences okay so let me clear this cell so now let's try to understand about these data types so apart from these data types there are also several other data types but these are the most basic ones so example for integer is you know the numbers like 8 10 or 19 so things the your numbers without any decimal points okay so let's make a variable called as okay let's try differently type it okay so here what i have done is i have used the keyword called as type and i have enclosed the number eight here so what happens is sometimes if you are not sure about what is the type of the data you can use this type keyword and this will show you what is the data type so as you can see here it shows int okay now let's try it with a floating point so as i told you floating point are nothing but decimals so let me put 5.3 here so it will show float and now let's try with strings so type so i hope you remember that we need to enclose the strings in double quotes so let's put english what is this type so this is called as string okay so these are the most basic data types in python so apart from this there are also other data types like list sets dictionary etc so we will be uh discussing about that in a later video so but by now you just need to know about these three basic data types okay now let's discuss about another topic which are constants and variables i'll create another section for this data types okay so we have two sections as print function basic data types now let's discuss about the constant and variables so as the name suggests variables are something whose values can be changed but we cannot change the values for constants okay so constant is constants are not much used in python but we have a lot of usages for variables okay so now let's try to understand more about variables so let me create a variable name as superhero okay so this is called as a variable and let's give a name for this superhero variable let me put iron man here so basically variables are like a container in which we store some value and these values can be changed in case of variables but we cannot change these values for constants okay so that is the important point to note here as you can see here here superhero is that container so superhero is that variable and iron1 is the value in that container okay so now you can try to print this term superhero so this won't print just superhero in your screen this will print ironman because we have given this value to this variable okay so this will print ironman so we can use this as our variable name or you can also use an underscore so if you have multiple words in a variable like you you want to represent this as let's say marvel superhero marvel superhero okay so if you run this you will encounter a error because the variables should be a single name or it should be connected with underscore so let me try to run this marvel superhero so this will throw a error because it is not a single word right so if you want to have multiple words consider as a single word you just need to include underscore between them so marvel underscore so now you can print this so this will work fine okay so i'll print this so we got iron man okay so as i told you for variables we can change these values okay now let's try to change that so marvel superhero so i'm using the same variable name which is equal to now i am mentioning captain america okay so let me print that print this node so marvel superhero so as you can see here we can change the value of these variables okay so but in the case of constant we cannot change that so these are about constants and variables we can also give the values for multiple variables in a single line of code okay let's try to understand this let me create variables as so you can also use digits in the variable names so i'm using 01 but there should not be any gap like this so there should be no gap between them so i am creating 0 1 0 2 0 3 okay so we have three variables here the first variable is 0 1 second variable is 0 2 and the third variable is 0 3 so we need to enclose the strings in codes the first era will be iron man second will be let's say captain america and third include dc superhero batman okay so what happens here is this 01 will take the value iron man and hero2 will take the value captain america and 03 will take the value batman okay now let's try to print this print 01 and 0 2 and print okay so as you can see here i am not using codes here because these are not strings these are variables we just need to uh put codes for strings and not for variables so let me print this now as you can see here the hero one is iron man and the name for hero2 which is a variable is captain america and hero 3 is batman okay so here you can just include a space before or after after this come up or you can just leave it as such so there is not any much differences in python so in some programming languages you would uh encounter some error but python doesn't give errors for much spaces okay so if you run this code it will also give us uh the correct output and we won't get any errors for this spaces okay now let's try to see how we can give multiple variables the same data let's say for example let's take these variables x is equal to y is equal to 0. so it is not that the variable should be a word it can also be let us okay so here we are taking three letters x y and z x y and z is equal to let's say 23 and i'm going to print all of them so print x print y and print is it okay so this line will give the value 23 to x y and z now we will get all this as 23 okay so this is how you can give a single value to multiple variables and another important point to note here is you cannot just give a capital x here and try to print a small x here okay so python is k sensitive and you have to use the same uh you know either caps or small letters okay so python is case sensitive and that is one of the important point okay so these are some basic things about variables and constants now let's try to understand another function which is input okay input function now so in the previous code we have discussed about the print function okay and now we are going to discuss about input function so what is this input function so in c program you would have came across the function called as scanf okay so print is a an output command where something you know some we get some output out of it and input is our scanf is an input command where the user will give an input to it okay so let me try to explain it to you with an example so i'm creating a variable called as number one so number one is equal to input so i'm using the keyword input here so input so you need to open a parenthesis and let's say enter the first number okay and let's say number two is equal to input enter the second number so what happens is this will ask the user for their input so now we need to give the input to this command let's say i'm giving a number 23 so i'm pressing enter now we need to give the second number let me also give it as 23 okay so if you're aware of or if you have practiced c programs so you would have came across that scanf function and it is very much similar to this input function okay so we use this input function where we need to get some data or some value from the user okay so this is why input function is used now let's try to do some more things now what we are going to do is we are going to get two numbers from the user and we are going to add the two numbers and we are going to print the sum of those two numbers so i am creating another variable called as sum so sum is equal to number one plus number two okay so what happens here is so this will get two numbers from the user and it will get the sum of these two numbers and now we need to print this sum so print sum so what do you expect the output of the code would be okay so if i'm giving the two numbers as 23 and 23 so the sum of these two numbers is 46 right but we won't get it we won't get 43 as the 46 as the answer okay let's see let me give 23 as the first number 23 as a second number so we expect that these two numbers should be added and the output should be 46 but it won't be the case as you can see here it is not the expected value so we get 2 3 2 3 as our output why is this case okay so this is one interesting thing to note here so what happens is so when you use the input function it assumes that this value given by the user is a string okay it doesn't know whether it is an in integer value or float value okay it just assumes that it is a string that is why it uh let's say for example when we give 23 as the value it thinks that 23 is a string and the 23 is also a string and it just try to concatenate the two two strings as we have seen in the print here so plus will join the two strings so that's why it has joined these two numbers which is you know strings for the for that program and it will print that number okay so we get two three two three now how we can rectify this you can include in tier okay so hint and another parenthesis and so this will convert this string value to an integer now let's try to run this i'm giving 23 again and 23 now as you can see here the output is 46 now so the important point to note here is so the input function will assume that the value is a string and we need to convert this value to what it is actually okay so here we are expecting the value to be integer so i am using this integer value so you can change the data type by using this uh method okay let's say for example changing the data type in python okay let me show you how you can change these data types let's say that num is equal to 5 okay um now let's try to print the type of this num so this will output that it is an integer so as you can see here now what you can do is you can change this integer to a float by mentioning that keyword and giving that variable name inside this parenthesis so what happens is the value of num is 5 and this will create this 5 which is an integer value to a float value okay and now let's try to print this number okay let me include that in the print function itself as you can see here this integer 5 is converted to a float value 5.0 so this is how you can change the data types in python and it is the same thing i have used here so i am converting a string to a integer okay so these are some very basic things you need to know in python okay so let's do a fast recap now so first of all we have uh seen some basic functions in print okay so how we can print strings so we have seen that we need to include the strings in double quotes or single quotes so we have seen how we can join two strings and how to print numbers and sum of two numbers inside the print function so and then we have seen some basic data types of python which are integer float and string okay so then we have seen how to find the data type of a value so we need to mention the keyword type here okay then we have discussed about constant and variables so we have seen that we can change the value of a variable and we have seen how to give multiple values for multiple variables and how to give the same value for multiple variables then we have discussed about the input function which is used to get input from a user okay then we have discussed how we can change the data type of a value okay so i'll give this collab file link in the description of the video in the next video we will be discussing more basics of python like data types operators and other python and the second module of this course is python basics for machine learning and this is the third video in this second module so let's get started with today's video so i'll be doing my python programs in google collaboratory i have already made a video on how you can access google collaboratory so you can check that out so let's get started so these are the five data types in python okay so they are integer floating point complex boolean and string so these are the basic data types we know that integer is a real number right so integer are numerical values and floating points are decimal numbers and we know that complex number is something that has both real number and an imaginary number and boolean is nothing but true or false values so it contains only two type of values one is true and the other one is false and string is nothing but text or statements okay so these are the five basic data types in python apart from these basic data types we have other complex data types such as list tuples etc so we will discuss about that in a later video so these are the basic ones so let's get started and let's try to understand each of these basic data types first let's discuss about integers so in python you can write comments using you know this ash symbol okay so i'll make a comment here as integers okay so integers i'm going to create a variable as a okay so in the previous video with index 2.2 i have explained what is meant by variables so we are just declaring a variable here as c8 and a is equal to 8 and now let's try to print the cl we know that 8 is a integer right so let's print a so in order to run the cell you can press shift plus enter so it will execute the cell and go to the next one so you can see here we have printed a and it has printed eight because we have given the value for a as eight we know that eight is an integer right so let us say that in some cases you are not sure what is the data type of a particular variable in that case you can use the keyword type okay so type is a keyword in python which tells you the data type of a particular value here let's mention a in this parenthesis so this will return uh what is the data type of va okay so let's run this press shift plus enter okay here you can see here int which means in change represents integer values right so this is how you can print integers values so you can give an integer value to a variable and you can print it now let's discuss about floating points so i'll make comment here as floating point and let's take the variable as b and let's give a value like 2.3 so we know that 2.3 is a decimal value so we know that floating points are nothing but decimal values and now let's try to print this b and let's also try to find the data type of b so type b okay so let's run this so you can see here 2.3 so we have printer b whose value is 2.3 and we have found the data type of b as float okay so short form for integer is in and for floating point its float now let us discuss about the third data type which is complex numbers so complex numbers so we know that complex numbers has both a real term and imaginary term so let's name this variable as c and i'll give a complex number as 1 plus 3 j here uh one is the real term and j is the imaginary term so j square will be equal to minus 1 so that is meant by complex number where you know we can also use i so i or j is you know square of this j is equal to minus 1 that is the imaginary path and you can print c okay so let's also try to print the data type of c okay so this prints the complex number which is one plus three j and also it prints the data type of c which is complex so these are the first three basic data types in python as we have discussed about these three data types there is another interesting thing that we can do we can convert one data type to another okay so i'll mention a text here as conversion of one data type to another so here i'm going to convert two data types so let's try to convert an integer into a floating point and let's convert a floating point into one integer so this will be integer to float so let's name the variable as okay so i'll give some other variable name which is x so let's say that x is equal to 10 and let's try to print x first and let's also try to print the type of x the data type of x so i'll run this okay so you can see here we have printed x the value of x is 10 and it is an integer now let's try to convert this to a floating point so let's name this variable as y and to convert integer into a floating point we need to mention the keyword float and inside this parenthesis mention the term which you want to convert here x is an integer right and let's convert this x into a floating point so you need to put x in this parenthesis so this will convert okay one second so this will convert x to a floating point value and that floating point will be value will be stored to y and let's try to print both y and the data type of y so type y here you can see here now we get the value as 10.0 so we know that 10 is an integer and 10.0 is a floating point or decimal value and we have successfully converted this integer value using this keyword float and we have tried to you know print the data type of y and it has printed float so this is how you can convert an integer to a floating point value now let's see how we can convert a floating point value to integer okay so float to in and uh i'll take the variable as x is x in this case let's give some random floating point value 5.88 and let's print x and also the data type of x so type ex so the value of x is 5.88 and the data type is float now let's see how we can convert this so y is equal to so previously we have used this keyword float right in this case we are going to use the keyword in which will convert the value to an integer value so now mention x in this parenthesis okay so x and once we have converted it let's try to print y and also the data type of y so we can see here this 5.8 decimal value is now converted to an integer value which is 5 and we have found the data type as integer so there is one main thing which you need to take note of in in this case so when you convert this uh floating point into into an integer value it doesn't round the value it doesn't convert this 5.88 to 6 it just removes you know this decimal values and it will give only this integer value so this 5.88 won't be converted into six but this you know this 0.88 will be removed and we will get only five okay so this is what happens when you convert a floating point to an integer now uh we have the fourth data type which is boolean right so here i just mention a text as boolean so as i have told you earlier boolean has two values one is true and the second one is false so i'll mention it here true and false okay now let's create the variable as ca and let's put a is equal to true so this is a boolean value okay so there is another main thing here so this t should be in uppercase letters so it shouldn't be in lowercase letters so this will give us an error so this is a predefined value so it should be you know as such it is in this case so the t should be in upper case letter and let's try to print this a now so print a and let's try to print the data type of a as well so type of a sorry so type of a so this is printed as true and we got the data type as b so bolt represents a boolean okay so let's create another variable as b and b is equal to false and let's print b now also the data type of b so you can see here so this is how you can uh you know assign a boolean value to a variable and you can print it and find the data type of it and the final thing which we have is string right so before that let's discuss one more thing about boolean so let's see where we can use these booleans so i'll create a variable as ca and let's say that a is equal to 8 is you know let's say that 7 is less than 3 so we are just you know checking whether 7 is less than 3 so you know that 7 is not less than 3 so this is this statement is actually a false one right so this will return a boolean value when you just compare two values so here we are checking whether it is less than 3 so we are checking whether 7 is less than 3 when you run these kind of codes it will return a boolean value now you can try to print a okay so let's also try to print the data type of a okay sorry so p shouldn't be in uppercase letter okay so you can see here 7 is not less than 3 so it will give a boolean value which is false and we have tried to print print the value of a which is false and we have also tried to find the data type of a so similarly i'll just copy this and let's change this symbol okay so let's say that 7 is greater than 3 and this condition is true right so let's call this as a condition so let's check whether this condition is true of course 7 is greater than 3 so this will return the value as true so let's see so yes so true and it is a boolean value so we use these kind of uh you know conditions in loops and in the case of if statements etc so in that cases uh we will get a boolean value so this is one main application of boolean data type and the final data type which we are going to discuss is string so strings are nothing but text and statements uh you know and let's try to print some statements so i'll print machine learning okay so machine learning is a string so these two words represents a string and let's try to print this so this will print you this string the main thing about the string is the strings should be enclosed in ports okay so here i have used double quotes right so instead you can also use single quotes so i'll just copy and paste this here and replace this double quotes with a single quote and i'll run this this will give you the same result okay so the main thing here is if you start with single quote it should in in you know single quotes as well if you start in double quotes it should end in double quotes as well so the code shouldn't change but you can use either double quotes or single quotes so let's discuss some basic operations in strings that we can use so i'll give a variable name here as my string okay so let's say that my string is equal to so as i have told you earlier strings should always be enclosed in codes whereas we don't have to enclose you know values this integer values and floating point values in quotes so only the strings should be enclosed in codes and here boolean is a different data type hence we do not enclose it in ports only the strings should be enclosed in ports either single quotes or double quotes and i'll give the values the same machine learning and i'll print this my string here we have created the variable as my string and i am printing it let's try to find the data type using the type keyword this type function so my stream so let's run this now you can see here our string which is machine learning has been printed and we found the data type of string so str represents string and let's try another thing print hello and okay so i'm using this symbol here this uh you know star symbol let's see what happens we know that this represents multiplication symbol in python right so when you use this line of code and let's try to print this what happens here is your uh string will be printed five times if you mention five so if you mention four your string will be repeated four times so this is uh how you can replicate your string using this line okay so now let's see how we can slice a string this process is called as slicing i'll just make a text here as slicing okay let's create a new string as let's name this variable as my string and let's give the value as programming so this word programming has totally 11 letters okay and i'm going to slice this string slice means getting you know only a particular portion of this string this particular word so let's print my string but i don't want to print the entire thing i just want to print a part of this word and you can mention the values such as 1 s 2 5 okay so let's understand indexing in python so indexing or the numbering in python starts with 0. so the index of this first letter p is 0 and the index of the second letter r is 1 and 2 and so on so p will be 0 r will be 1 0 0 will be 2 and so on so 0 1 2 3 4 and it goes on and we can mention the letters based on their index so 1 means the second letter r right because the indexing starts with 0 and 1 represents this second letter which is r and i want to print all the words from the first index are to the fifth index five let's see which is the fifth index so p is the zeroth one so r o g r a so uh zero one two three four and five so here is the index five so when you use uh this particular line 1s to 5 the values from this first index one will be printed all the way to this fifth index but this fifth letter won't be printed this fifth index value which is va won't be printed so let's run this and try to understand this okay so we got the values as rogr so this first index value will be printed which is r and this second index value phi won't be printer so you will get the values from r o g r so this second index value will be uh you know neglected so this is how you can slice a string so the main thing to note here is the first value up to the last value minus one so here the value should be printed so i will just make a note here so the values from index 1 to 5 minus 1 so 5 minus 1 is 4 right so will be sliced will be sliced so in every indexing cases so this second index won't be included so that is what you need to take note of in this case and there is another process called as step so we can also you know slice the words having a step value here let's print my string and my string and mention 0 to 10 let's take a step past 2 so i'm going to print from a 0 so the 0th index is p right all the way up to 10th index so 10th index is this end so 10 to get you know 10th index is n and sorry 1 second so this is a 0 1 2 3 4 5 6 7 8 19 so 10th index is g let's try to print this step so what happens when you mention 2 here is so p will be printed and r will be you know removed o will be printed g will be removed so that means you know step of two that means every second letter will be skipped let's try to run this okay so here you can see here the tenth index value is cn so as i have mentioned you here that the second index value won't be involved or it won't be included in this uh slicing so we will get the value from this zero index to the index nine and every second letter will be removed here uh when you take p as the 0th index the second letter r will be removed and we will get og will be removed and r will be printed and a will be removed and so on so this is how you can slice a string by mentioning the step which we want okay so there is another uh thing which you can do with string called as string concatenation string concatenation so concatenating is nothing but joining two strings so let's uh name this variable as word1 so let's give the value as word1 is equal to machine so i'm mentioning single quotes here and let's take the second value as word two so word two is equal to learning okay now i'm going to create another word so let's let's print this so print word one plus word two so when you use this plus symbol so these two words will be joined so this process is called as string concatenation so you can see here there is no space here right so uh you cannot get a space with you know when you mention this plus so if you want to have a space you can just mention a space here and let's run this so this will give a space between them so this is how you can concatenate or join two strings so these are some basic details of the data types in python and basic operations on the data types that we can do so i'll just give you a quick recap on what are the things we have done here so we have discussed the first you know the five basic data types in python which are integer floating point complex boolean and string we have tried to print uh we try to give uh you know give a value of integer to a variable a and we printed it and we have also tried to find the data type of a and then we have tried the same with a floating point value and we have also tried to create a complex number and then we have seen how we can convert one data type to another using their respective keyword so in this case we have used this float keyword to convert an integer value to float here we have used you know integer keyword to convert a floating point value to an integer and we have discussed about the two boolean values which are true and false and we have seen how we can use these boolean values in the case of conditions and then we have discussed about strings and how you can replicate the strings and how you can slice the string by mentioning their indexes and now you can slice a string by mentioning a step value and finally we have seen how you can uh you know concatenate two string so these are some basic details about the data types in python hello everyone this is siddhartham in the previous video we have discussed about the basic data types in python which included integer floating points string boolean and complex numbers so in this video we are going to discuss the remaining special data types in python okay so these data types include list tuple set and dictionary okay so basically there are two types of objects in python so immutable objects and mutable objects immutable objects are those whose value cannot be changed once it's created and mutable objects are those whose value can be changed once it's created okay and the examples of immutable objects are integers floating points strings boolean values and tuples okay and mutable objects are list set and dictionary okay so in this video we are going to discuss about these four data types which are list double set and dictionary okay so first of all let's discuss about list okay so list should be enclosed in square bracket so i'll mention it in the comment here so list okay so list should be included in the square bracket okay so list are nothing but they are like arrays in c program so they can store multiple values okay so all these data types which are going to see today uh can store multiple values whereas the previous uh you know data types we have seen which are integer float and strings can store only one value okay so these list set and dictionary can store multiple values okay so i'll create a list and name it as my list so my list is equal to so we need to put all the components in the square bracket so let's say one two three four and five so this is my list and let me print this so print my list okay now i'll also check the data type of this so if you are not sure about what is a particular data type you can use the keyword type and you need to mention the variable which you want to find okay so i'll put my list here so this will return the data type of this particular variable okay so as you can see here our list is printed in square bracket and so we have checked the data type of my list and it is written that it is a list data type okay so this is how you can check the data type and now let's discuss about some important properties and operations that can be carried out on list okay so list scan so let's stick and store multiple data types so here we have only integer values right so similarly it's not the case that only integers or only floating points should be present in a list so we can have integer values floating point values strings and boolean in a single list okay so i'll mention this list can have multiple data types okay whereas some of the data types such as set does not allow this okay so they can have only one type of data okay so my list is equal to let me put 2 3 so 2 and 3 are integers so let me put a floating point now let's say 1.8 and let's now include a stream let's say english so english is a word so it is a string value so and it's also include a boolean value true okay so we have multiple data types here now let me print my list okay so as you can see here we have different data types and list and multiple data types are supported in a single list okay now as i have told you earlier that list are mutable okay so list are mutable so mutable means they can be changed once they are created okay so they are changeable okay so now let's see how we can add elements to a list so add the elements to a list okay so first of all let's copy this list here okay my list so i'll copy this okay so you can use the function append to add a add an element into a list okay so let me mention my list okay so dot append is the function that is used to add any element so let's say that we want to add 6 to our list okay so this will add the element six at the end of our list okay now let's print our list so print my list okay as you can see here in our original list a new value is added which is six so that is why we are using the function dot append okay so each of these elements has a particular index in the list okay so as i have told you earlier in the previous video that indexing in python starts from zero so the index of this first element is zero and the index of this second element is one and third is two and it goes on okay so you can print the several elements in a list using their index okay so now we are going to see how we can print elements of a list using their index okay so let's print this so print mention my list and you need to specify the index in square brackets okay so so when i mention the index as 0 it will print the first element which is 2 okay now let's print the third element the third element is 1.8 right so my list the index of 1.8 is 0 1 2 okay now let's print and see this as you can see here it can print the individual elements once we mention the index okay so this is how you can call the specific elements in a list using their index okay so another important property of list is that it allows duplicate values so just allow duplicate or repeated values whereas sit does not allow duplicate values so in the case of set it removes a duplicate or repeated values so we'll be seeing that when we are discussing about set okay so here let's create a list as list one let's say list one is equal to so we want to put the elements in square brackets say one two three four five and again one two and let's see two and three let me print this print list one okay i'll run this as you can see here it allows the duplicate value so here we have the duplicate values as two one three okay so it is one main properties of list okay now we can check the number of elements in a list using one function okay so let's check how many numbers are there so totally one two three four five six seven so totally we have eight numbers in this list okay so you can check the size of the list here so using the function length okay so the short form for length is l e and len and inside the bracket inside the parenthesis mention the list name so list one so this will count the number of elements present inside the list one okay so i'll print this so as you can see here it has printed the number of elements present in the in this list which is eight okay now let's see how we can initiate an empty list initiating an empty list so let's say that list 2 is equal to square bracket so you need to mention the opening square bracket and the ending square bracket so this is nothing but an empty list so we can print this empty list as you can see here it is a list but it does not contain any values and you can add values to this using the function list2 dot append which we have seen already let's say we append the value 5 to it okay so now let's print this print list2 so this is how you can create an empty list and add a value to it so this feature is very important because in lot of cases we will create an empty list and we add values one by one okay so this is how you can initiate a list so as i have told you earlier lists are mutable right they are changeable so hence we can delete the element present in a list okay so delete an item in a list but we cannot do that in the case of tuples and other immutable data types okay let's say that list 2 is equal to let's copy the elements from this list okay we'll copy these elements okay and i'll put it in this list and let me print this list so print list two okay now you can delete the elements present in a list using the function delete so d e l okay so mention the list name and mention the index number which you want to delete so let's say that i want to delete this 1.8 the index of 1.8 is 2 so index of 2 is 0 this is 1 and two so it is in the normal numerical order instead it starts with zero okay so that's in about index so index of one point eight is two okay so list and mention the index in the square brackets say two and now let's print our new list so list2 so this basically prints the list without deleting any item in this line and then we will delete the third item using the index 2 and then we will print the deleter list okay so i'll run this so as you can see here the third element which has the index of two is deleted from this list so this is how you can delete items in a list and then one more interesting feature about the list is you can join to list okay so let me create the list as list three which is equal to one two three four five okay and i'll create another list as list four which is equal to 6 7 8 9 and 10 okay now we can join these two list okay so i'll create another list as list 5 and list 5 is equal to list 3 plus just 4. so the important point to note here is when i add these two list it won't add these elements element wise okay so it will join the two list so it will concatenate the two list so let's print this list five as you can see here the list three and list four are joined together and they are stored in the list phase so this is how you can join to list okay so these are some important features and properties of list okay now let's discuss about tuple so tuple are similar as list except the fact that they are immutable objects so once you create an element in a tuples you cannot change it hence it is called as an immutable data type okay so let's create a tuple as double one so tuples should be enclosed in parenthesis or round brackets okay so let's put the elements as two three four and five okay whereas the list should be enclosed in square brackets so let me print my tuple so print tuple one and let's check the data type also as you know using the function type of tuple one so this will print the data type of this variable tuple one okay as you can see here the tuples are always enclosed in around brackets or parentheses and this type function has printed our data type as temple okay now let's discuss some properties of tupper so likewise so you know similar to list tuples also allow multiple data types in a single tuple so you can have integers floating points strings etcetera in a single tuple okay so double allows multiple data types in a single tuple okay so let's create the tuple as tuple 2 which is equal to again parenthesis let's say one two three point five let's include a string as machine learning and another boolean as false so boolean data types are nothing but true and false values so let's try to print this tuple 2 so it runs properly because it can allows multiple data types integers floating points um strings etc okay now let's see how we can convert a list into a tuple okay so converting a list to a tuple so first of all let's create our list as my list and let's say my list as the elements so this should be enclosed in the square brackets let's say it's three four five and six okay now we need to convert this to a tuple before that let me print this list so print my list okay so now let me create another variable called as my tuple okay so my tuple is equal to so for converting a list to a double or other data types to a tuple you need to use the keyword tuple okay so tuple in the parenthesis mention this listener name so the list name is my list writes file paste it here so this will convert this my list to a tuple and store it to this variable my double so now let's print this my tuple print my tuple okay so as you can see here it has printed our list first and this list is converted to double so you can distinguish it using its square brackets okay so this is how you can convert a different data type to a tuple so similarly we have printed the individual elements in a list by using by mentioning their index value right so you can see here so we have mentioned the index value to print the elements of our list so similarly you can use indexing to print the elements of list as well sorry a tuple as well let's say we want to print my tuple the first element of my tuple so we will give the index 0 now let's print my tuple and now let's mention the index one so this will sorry index one this will print the first two elements which are three and four okay as you can see here this is how you can call specific elements using their index values okay so as i have earlier told you tuples are immutable tuples are immutable which means unchangeable so once they are created their elements cannot be changed so we can try this so in the case of list we have used the function dot append to append or join a value to a list but we cannot do that in the case of tuple let's say that my tuple dot append 6 so this will throw us an error because triple values cannot be changed okay so as you can see here so tuple object has no attribute happen so we cannot change the elements of the tuple using this append or any function because they are immutable okay now we can also find the number of elements present in the tuple using the same function length so let's print my double length of my tuple okay so as you can see here this my tuple has four elements so this is how you can print the number of elements using the keyword length okay so this is about the important properties of table now let's discuss about another important data type set so set are also mutable data types and so we have seen that the list elements should be enclosed in square bracket double elements should be enclosed in parenthesis around brackets and set should be enclosed in curly brackets okay curly brackets so let me create the set as my set which is equal to so put the elements in a curly brackets let's say 1 2 3 4 and five and i'll print my set and let's check the data type as well so type my set so as you can see here the elements are printed in the curly brackets and we can found we can sorry we can find the elements and and we can find the data type of the set and it is printed here using the keyword type okay now we cannot call an element present in the set using their index so we have seen in the case of list and tuple we can you know print specific elements using their index but set does not support indexing so let's try this let's say that we want to print the first element of my set so i'll mention zero so if i run this we will get an error because set does not support indexing so as you can see here set object does not support indexing so you don't have any index value associated to the elements of set so this is one main feature of set because uh this is what distinguishes between list table and set okay now you can convert a list through sorry list into a set let's see how we can do that so convert list to a set let's say that let's create a list as list 5 which is equal to so list should be included in square bracket let's say 4 5 6 7 and so these are the elements of the list now i need to convert this to a set so what i'll be doing is i'll create the set and store it in the variable x okay so x so you can use the keyword set to convert one data type to a set okay so now we are going to convert this list 5 into a set so mention the list 5 here so list 5 okay now let me print this x so what happens here is so we have a list here and this list will be converted into a set using this set function and it will be stored in the variable x and now we are going to print this variable x as you can see here we got a set because we have curly brackets and in the list we have square bracket so this is how you can convert list or tuple into a set okay so and one important feature of set when compared to list and tuple is that set does not allow duplicate values whereas list and double do allow duplicate values or repeated values in it whereas set does not allow duplicate values okay so let's check this and say that set 3 is equal to so i'll mention the elements in curly brackets 1 2 3 4 5 and again let me put 1 2 and 3 now let's print this so what happens is the repeated values automatically gets deleted okay so as you can see here the repeated values 1 2 3 are not printed because set does not allow any repeated values so this is about some important features of it and now we are going to discuss about the last data type which is dictionary okay so dictionary are special data types because they are quite different from the other data types list tuple and set okay so the peculiarity of dictionary is that it contains a key value pair so dictionaries have a key value path so each element has a key in it and a value associated with that key let's see always that so let me create the dictionary as my dictionary okay so my dictionary is equal to so dictionary values are also including curly brackets so first let's create the key as name okay so this is the first key and you need to put a column there and let's say that the name of the person is david okay so david now this is the first element so this key so name is the key and david is the element okay so this is the first element now let's create another element as h okay so let's say h so age is the second key and age of david is 30 so this is the second element and now let's create the third element as country entry of david so again colon let's put india okay so this set this dictionary has three elements so each element has a key in it and a value associated to it so in this third element country is the key and value is india okay so let's try to print my dictionary okay and also let's check the data type of this so put type and my dictionary okay as you can see here we have printed our elements in the dictionary and the elements are enclosed in curly brackets and we have got this data type as dictionary okay now so in the case of list and tuple we have seen that we can call specific elements in a list or tuple using their index but in the case of dictionary we can call this value using their respective key so we can call a value using their key so if i want to print this david i need to mention the key which is name if i want to main sorry print the age i need to mention age and if i want to you know print india i need to mention the country so let's see how we can do it let's say print my dictionary and in the square brackets you can close which you want to print so first i want to print the name and now let's print the edge and lastly let's print country okay okay so this will print their respective values so as you can see here this is how you can call the values inside the dictionary using their keys okay so in lot of the cases dictionaries are very useful for us okay so and another important properties of dictionary is that dictionaries does not allow duplicate values so list and tuple allows duplicate values whereas set and dictionary sorry list and double allows and second dictionary does not allow duplicate values or repeated values so does not allow duplicate values so let me create a dictionary as dictionary 2 which is equal to so i'll copy this key and value okay so i'll paste it here in the dictionary too and i'll again paste this so you can see here we have repeated values so we have duplicate values now let's try to print this dictionary tool and see what happens as you can see here the duplicate values are removed so hence set and dictionary does not allow duplicate or repeated values so that is all about the special data types in python list tuple set and dictionary so the difference between the other basic data types is that these data types can store multiple values in them okay so these data types are very similar to arrays in c programming okay so python do check out my previous videos so in this video we are going to discuss about one of the main topics in python which is operators so basically there are six operators in python they are arithmetic operators assignment operators comparison operators logical operators identity operators and membership operators okay so now let's discuss about each one of these separately okay so first of all arithmetic operators okay so so arithmetic operators are the basic mathematical operations that we can perform on integers and floating points and other data types okay so [Music] let's discuss about this so this is one of the most basic operators so let's create a variable as number one okay so num1 is equal to let's say 20 and let's create another number so the second variable is number two which is 10 so we have two numbers the first number the value of first number is 20 and the second number is 10 right so now let's see what are the different arithmetic operators so first is the addition operators okay so addition operator is nothing but the addition symbol so let's say create a variable called the sum and let's add the two numbers number one plus number two okay so this plus sign we are using right so this is called as the addition operators okay so similarly there are other operators in arithmetic type okay so you can leave space here or you can remove the space here so both are the same so it is not a significant point so you can do both the ways so first is the addition operator so let me print the sum okay so let's say that sum is equal to okay so this will print the sum of the numbers okay so this is the addition operators which is plus sign and there is subtraction operator so these are nothing but the basic math operations okay so let's create a variable stiff which represents difference so number one minus number two so this minus symbol is the subtraction operators okay so print difference so this will find the difference of the two numbers and it will print it here so difference is given by tl okay so the third arithmetic operators was product or multiplication so multiplication operators let's say that pros of pro means product so product is equal to number one so so i'll just write the code with the space you can write with space or without space so this symbol represents multiplication operator okay so let me print it so print product so product is equal to okay the fourth operator is division of course okay so division so let's write quo but which represents quotient so number one for division we use this forward slash okay so number one and number two and let's print the quotient q u t okay so quotient and here we need to print it okay so apart from these basic four operators there are also other arithmetic operators so they are exponent so how we can represent exponential let's create a variable as exp which represents exponent so we want to let's say that we want to find the exponent as 20 power 10 okay so you can write that as number one you need to mention this multiplication operator space it means number one to the power of number two okay so you can do this or you just can't remove the space between them okay so this will so this is similar to you know um 20 power 10 so it is similar to it okay now let's print the exponent okay so the exponent is equal to exp okay so then there is this modulus operation okay so let's create the variable smart which is equal to number one so this modulus operator is represented by the percentage symbol so num2 so this basically gives the remainder of this division so what happens here is so the this number two will be divided by number sorry number one will be divided by number two and instead of printing the quotient so the reminder will be printed okay so if you want to print the question we use the forward slash so if we want to print the reminder we use this percentage symbol okay so let me print reminder is equal to so let's print this so these are the basic arithmetic operators in python okay so sum which represents the addition operators and then we have seen the subtraction operators multiplication operator division operator exponential operators and modulus operator okay so these are the arithmetic operators so the second operators which we need to discuss is assignment operator okay so let's discuss about it now so this is assignment operators so assignment operators as the name suggests it's assigning values to a variable okay so let's say there is a variable a and the value of this variable is 5 so i am basically assigning the value 5 for this variable here equal to sign is the assignment operators okay so let's print the say of course it prints five okay so there are other assignment operators as well let's say that again a is equal to 5 and a yes plus is equal to 5 so basically what this means is so this is similar to telling our python interpreter that a is equal to a plus 5 so instead of writing a plus b so we can write a plus is equal to 5 so this will add 5 to this a okay so if we print a now it will give us 10 because a is already 5 and we are adding 5 to it okay so let me print this and we have 10 okay so similarly there are also other things we can do with it so is equal to let's say 2 okay so this is similar to v is equal to b minus 2 okay so let's print first of all we need to mention b right so let's say that b is equal to 5 and print b okay so these are some interesting assignment operators so we have seen two assignment operators there are other assignment operators as well so let me mention them in this text here okay so they are so the first one as we have seen h plus is equal to and next we have seen this and the next operators you can do it with the multiplication symbol so you can try all these with different variables okay so it is similar to all these things and i'll just mention it here so you do practice this so another one is the exponent okay and then we also have this for division and for modulus okay okay so okay so i have just shown you two basic uh assignment operators so you can practice all these assignment operators okay so now we have discussed about arithmetic and assignment operators now let's discuss about the comparison operators okay so comparator operators basically compares two variables or two objects okay so three is comparison operators okay so let me create a variable is equal to 5 and b is equal to let's say 10 okay so we need to compare these two variables let's see what are the symbols used to compare the two variables or two objects okay so let's print this comparation operators returns a boolean value so boolean means it outputs either true or false okay so these two things are only we are expecting so it won't print any function but sorry it won't print any values but it will print this boolean values which are either true or false so let's say that a is equal to 5 so i have mentioned 2 equal to symbol this means it will check sorry so a is equal to b so when i mention this equal to equal to 2 equal to signs it checks whether a is same as b okay so this is nothing but equal to comparison operator okay now let me do another thing so here so i'm mentioning exclamatory mark and equal to so this means a is not equal to b so let's say that in the first case it will check whether a is this a is same as b so it is not same so it will output the value as false and now here we have we are checking whether a is not equal to b so a is not equal to b hence it will output false okay so this is nothing but not equal to print is greater than b okay so here a is less than b so it will output false for it so this is greater than symbol of course so this is greater than symbol so similarly we have other symbols right so other common symbols which is less than b so all these conditions will be compiled and there is another two comparison operators which is less than or equal to b and is greater than or equal to b okay so i'll run this so this will output either true or false so this condition will be checked and the output will be printed so first condition checks whether a is equal to b so is not equal to b hence we are getting false and now we are checking whether a is not equal to b so it is the correct uh condition so it gives true so similarly all these conditions will be checked so these are the comparison operators so these symbols between a and b okay now we are going to discuss about the fourth operator which is logical operators so i'll mention it here so the fourth operator is logical operators so there are three logical operators they are and are and not so they are similar to the logic gates which we would have studied in the year secondary so let's say that a is equal to 10 okay so this logical operator also gives i gives boolean values so it will print either true or false and it won't give any values so numerical values okay so here we have given that a is equal to k so let's say that print let's say a is greater than 20 okay and a is greater than five so we have two conditions here and we have used uh logic logical operator and so what happened here is so these two conditions will be checked it will check whether a is greater than 20 so a is equal to 10 which is this condition is false because a is not greater than 20 it is less than 27 next this condition will be checked which is a is greater than 5 of course a is greater than 5 so out of these two conditions only one condition the second condition is correct and the first condition is wrong so if we use the and operator it will give the output as true if both the conditions are true okay so in this case only one condition is true we will get false as the output okay now i'll use or operator okay so i'm just copying the same thing but i'm just replacing this and with r okay so this r operators similarly check the two condition it will output the value as true if either one of it is true so it's it's obvious from its name so and which is the both of them should be true or means either one can be true and there is another operator called as not operator so it is similar to the not gate so it will inverse the values okay so let's say that a is greater than 8 so it is true and a is greater than 5 so both these conditions are true right so but this not operator will convert this or it will invert this output okay so if this gives an output of true when it goes to this not operator it will be converted to false so let's run this so as you can see here we have used and operator so only one condition is true strange we will get a false so in second condition also same the two only one condition is true but we are using an or operator so we will get a true and this condition outputs are true but we are using a not so it will be inverted to false so these are the three main logical operators so and are and not okay so the fifth operator is identity operator so fifth one is identity operator okay the identity operator are this and this node okay so it again compares the value and gives a boolean output let's say that x is equal to 5 and y is equal to 5 so both are same right so it will compare this sprint so the operator is nothing but is right so x is y so it will check whether x is the same as y and it will give either true or false so it is printing true because both of them are same right if i just do the same thing and give different value to 10 so this will give us an output of false because both of them are not same right so false so this is the identity operator is so we can do the same thing but this time with is not function so the second identity operator is not so you can do that so this is basically opposite to this is operator okay so x is not y it is false because x is the same as y right so again i'll do the same but in this case and change this value here of course x is not y so it will give us the boolean value as true okay so it will check whether both of them are same when we are using the operator risk and here it will check whether the two values are not similar okay so this is nothing but the identity operator so the finally we have this membership operator okay so membership operator checks whether a particular value belongs to a sequence of elements like list so i'll explain you this in a moment fix this membership operator okay so there are two membership operators in and not in okay so i'll put a is equal to 5 and b is equal to 10 and in c i'm creating a list so we know that lists are always enclosed in square brackets right so let's create a list with the elements one two three four and five now we are going to use these operators in and not in okay so this is this also gives us boolean values so i am printing the eis so a n c and i'm printing b in c so as you can see here a has a value of 5 which is present in this list c b has a value of 10 which is not present in this list so this fifth line gives us an output of true whereas the next line gives an output of false okay so as you can see here this is the membership operator so this basically checks whether the first element is one of the member of the second uh object okay so it is the not in function is similar to this but it's it's actually the opposite of it so i'll copy this so this will give us an inverse answer so let's say that a not in b not in so these are keywords okay so as you can see they are blue in color so it will give an opposite result so i am just checking whether a is not in c but a is actually present in c hence we are getting a false value and here b is not in c and we hence we are getting true value okay so this is about the membership operators so we have seen the six basic operators in python the first is the arithmetic operator so the basic mathematical operators and then we have seen various assignment operators so after that we have seen the comparison operators the equal to not equal to greater than less than symbols and then we have logical operators which are and or and not and then the identity operators and is not finally the membership operators in and not in okay so that is all about the operators in python hello everyone this is siddharthan this is the sixth video in python basics module in our machine learning course so in this video we are going to see what is mean by this if else statement and how we can use it in python okay so this if statement is one of the basic things which we will start learning while we are getting started with programming okay so the use of if else statement is that in some cases we need to perform or we need to run only a specific part of the code and we want to neglect a specific part of the code so for that purposes we will be using this if while statement okay so in this video i'll show you how you can use the say file statement in python what is the syntax for it okay so first let's create a simple if statement simple effects statement so there are more complicated ways of uh using the cfl statement so let's start from the most basic okay so let's create a variable called a ca and let's say that a as a value of 30 okay let's say that b has a value of 15 okay now given two values are given two numbers we need to uh find which is the greatest number okay so for this purpose we can use a fills condition let's say that if a is greater than b we need to print that is the greatest number okay now else we need to print that b is the greatest number okay so basically what we are doing is we are checking the condition so if a is greater than b so we will tell that a is the greatest number else so else condition is that b will be the greatest number so we will be printing b is the greatest number so it may seem very simple so this is just for demonstration purposes so when we are using if statements there will be a lot of complicated places so but this basic knowledge is you know important for us to do those complicated things so i'll run this so now it prints that b is the greatest number because b is 50 right so now we can also do this in a different way where a is equal to let's say yes equal to in input enter the first number okay and b is equal to end input enter the second number okay now you can just copy this here so basically what i'm doing is so in this case we have uh given the value for a and b in this code itself but in this case we will ask the value from the user so when i use this keyword input so it will get the value from the user and that value will be converted to integer okay so now let's run this so when we run this so the user will be asked to give two numbers and the greatest number you know this uh code tells us whether the first number is greater or the second number is greater let's say that the first number is equal to 150 okay i'm pressing enter so let's say the second number is let's say 100 so it obviously prints that the first number is the greatest so here a is the greatest number because we have just used it so you can just put here that first number is the greatest okay in this case second number is the greatest so this is how you can get some values from the user and you can print which is the greatest number okay so let me print 16 here and let's print 20. so as you can see here the second number is the greatest so this is a simple way of using if statement okay so this is a simple method there is another thing called as elsif condition so this is if and else statement okay so l if which means else if okay so the short form for else if scale if so let's say that we have three numbers in this case let's say that a is equal to 15 and b is equal to 25 c is equal to 30 okay so in the previous case we just add two numbers and we just wanted to check two conditions whether a is greater than b so if a is greater than b this code will be printed so it won't go into the sales condition so if this condition is proven wrong so this condition so sorry this statement will be printed okay so here we just have two conditions right but when we have three conditions or you know more than two condition we can use this l if uh statements okay and there is one important point to note here that if there is this if else pair so either one of this will be printed so like how can i say this is in no condition both of these statements will be printed okay so only this condition can be true or the else condition can be true never both the conditions can be true for if feldspar similarly in this case let's say that in this case we have three numbers and we need to compare these three numbers and print which is the greatest number okay so for this purpose we can use if else if undeal statement let's say that if a is greater than b which is again greater than c okay so let me put it this way a is greater than b and it is also greater than c okay sorry okay so a is greater than b and it is greater than c so in that condition i want to print that a is the greatest number okay so now i want to have two other conditions now i should use if fails sorry else if so else if so the keyword is l if so e-l-i-f if [Music] b is greater than a and greater than c i want to print that d is the greatest number okay so finally we have the else statement okay yes print so if both these conditions are false that means c will be the greatest number so i'll mention that in the else condition so c is the greatest number so there is no need that you need to mention the condition for uh else so if both these conditions are failed there is only one conditions so that is that c is the greatest number so you don't need to mention the condition for else okay so here as i have told you earlier in this uh you know triplet only one conditions will be true either a is greater than b and c or b is greater than e and c or c is greater than a and c in no cases two conditions can be correct or three conditions can be correct so in this triplet only one statement will be printed okay so i'll print this so it should tell that the c is greatest number okay so c is the greatest number so this is how you can use if you'll say fund else statement so there is another method of or another variation of this uh if else statement so it's called as nested if condition suggested if it's a statement okay so nested if means using a if statement inside an if statement so that is called as nested if condition so let's say that similarly we have three values let's say that a is equal to 20 and b is equal to 40 and c is equal to 60 okay and let's say that if a is greater than b okay so i want to check which is the greatest number so what i will do here is first i will check whether a is greater than b so if a is greater than b i should check whether a is greater than c also okay so if a is greater than b i need to check that if a is greater than c also so in this condition a is the greatest number because it is greater than b and it is also greater than c so here i can print that a is the greatest number okay if this condition is false which means if a is greater than b but a is not less than not greater than c means c will be the greatest number so yes i hope you are getting what i am telling so print so in this case c will be the greatest number so c is the greatest number okay now so this is for this pair okay so this is for this if else is this pair and there is another riff here so we are we are using one if l statement inside if condition hence this is called as nested if okay so that is another thing we need to note here so if a is not greater than b or else if b is greater than a this condition won't be satisfied and these statements won't be printed okay so the interpreter control won't go into this statement so in this case else so else means if else here means b is greater than a okay so in that condition we need to again check whether b is greater than c as spell so in that condition sorry if b is greater than c also we need to print that is the greatest number okay now we need to include another else condition so which means in this condition c is greater than c is the greatest number okay so what we are basically doing is first we are checking whether a is greater than b so if a is greater than b we need to check whether is greater than c also now i am checking whether a is greater than c if these two conditions are satisfied that means a will be the greatest member and this particular statement will be printed or else if a is greater than b but if it is not greater than c that means c is the greatest among all the three numbers so in that conditional you want to print this statement which means c is the greatest number then we have the sales condition where so when this condition will be uh you know carried out as if this condition this if is false okay so which means if a is not greater than b or b is greater than a so it will go to this statement because it is unsatisfied so it won't go into this statements so the control comes here okay and check these conditions so now we know that b is greater than a right and now it should check whether b is also greater than c as well now it checks it if b is greater than c this line will be printed or else this line will be printed which is c is the greatest number okay so let's run this so we got that c is the greatest number because it has a value of 60. so this is a simple example for nested if statement where we use if statement inside an if statement so you can see here among this fifth line you know and tenth line so only one conditions can be satisfied and among this sixth line and eight line only one condition will be satisfied okay so you would have noticed this indentation so when you mention a semicolon here in the if statement line so there will be indentation or space so we call this space as indentation so this is very important in the case of these if statements and for loops and while loops okay so so this indentation means this particular uh line of code comes under this if condition so this indentation means it comes under this else condition so whenever you use this column here so there will be an indentation in the next line okay so in colab it automatically creates that indentation in some of the basic python consoles you need to give that indentation or you need to give that space okay so these are some variations of if statements okay so we have seen a simple if else condition and then we have seen how we can use uh the input feature to get the numbers from the user and print the greatest number and we have seen how we can use this else if condition by uh get having more than two conditions then we have finally discussed about the nested if statement where we can use if statement inside an if statement so that is all about the fill statement so in this video we are going to discuss about loops in python okay so there are basically two important loops in python they are for loop and while loop okay so in this video we will see where we will be using loops and what is mean by this these loops okay so before starting the video if you want to learn data science you can check out my amazon data science course with python i have given the link for my course in the description of this video okay so getting started with today's video first of all let's understand about for loop okay so so i'll give you an example of where this is used so first of all let me create a variable as laptop1 so in this laptop on variable we want our user to type an input okay so laptop one is equal to input enter the price of the laptop okay so if i run this this will ask the input from the user so you enter the price of the laptop so here we can give some price let's say the price of the laptop is 20 000 okay so this is how you can get input from the user using this input function okay so when you get input from a user it will consider this input as string so strings are nothing but text okay so we need to convert it to integer so even though it is an integer or it is a number so this input function will think that this input is a string so we need to convert it by putting this in keyword here so this will convert this string into an integer okay so i'll run this again now it will be considered as an integer okay so now what i'll do is let's say that we want to get the price for five laptops okay so for that what we will be doing so we will type this lines five times right so laptop one so i have pasted it five times so i will change this variable to laptop to laptop three laptop four and laptop so when i run this so it will ask for the price of the laptop let's say again it's 20 000 for the first laptop and there is another laptop and the price of that laptop is 30 000 and next it's let's say 40 000 and 50 000 okay and the fifth laptop will be let's say it's 60 000 okay so this is how you can get price of five laptops right so with this data you can add all those prices and tell the user that what is the total price of the five laptops right so this is all we can get five inputs but you can look the code here this is not an efficient way to write a code so in this i have just repeated the same line of code again and again except this change in the variable name right so in this case only we will use loops so loop is used to repeat a certain action again and again okay let's see how we can use loop to do this same action so we want to uh ask input from the user five times about the price of the laptop but we want to do it in a concise way in a short way okay so for this purpose of reducing the size of the code we can use loops and to repeat the same action again and again okay so for this i can use the for loop as for i in range 5 okay so laptop and we put as laptop price so laptop places input so again you can just copy this here so i'll explain you about the syntax in a minute okay so just wait a minute okay so what i have done here is okay so this is the syntax for for loop so we will use this keyword for and i is a variable so for i in range 5 so range 5 means it's nothing but 0 1 2 3 and 4 so in python indexing starts from 0 as i have told you in the previous videos so it starts from 0 so 5 means 5 numbers so the 5 numbers are 0 1 2 3 4 okay so totally we have five numbers okay so this will count five times and i takes the value of these five numbers okay so when it runs this loop for the first time it will take the value of zero and then one two three four so totally we have five times our loop running okay so if i run this okay so it does not define so i should just make some spelling mistakes so it is int okay so i'll run this so now we can enter the price of the laptop in the same way as we did so let me put twenty thousand again thirty thousand forty thousand 50 000 and 60 000 so now you can see here in the previous uh piece of code we have did the same action in five lines of code but here we have did that you know in two lines of code so this is the advantage and use of loops so we can do the same thing again and again and it will reduce the size of our code and it will be a very efficient way to do that so this is the syntax so we will use this for i in range so instead of i you can use j or anything like that so this is just a variable name so this in a strap is important okay so what happens is for in5 so when this loop runs for the first time this i will take the value of 0 okay so 5 means these 5 values which are 0 1 2 3 4. so this basically counts the number of times the loop is running ok so for 0 in range 5 so the first time this code will run ok and this first line will be printed once we have given the price of the laptop so this will again go to the start now this i will take the value of 1 and again this loop will be performed so again this the second line will be printed and when i give the value for this particular line it again goes to the top of the loop and now i will take the value of two and this continues as long as it gets the first five values so that is when i'm mentioning the value as 5 okay so if i put something like 7 here so if i run this so it will ask us for 7 values let's say the price of the laptop is 3k so it's 3 000 okay so let me put us three just consider it's three thousand four five six seven okay so i've just typed the enter so that's the problem so let me run it again quickly six seven eight nine two three four okay so as you can see here it has printed this seven times so that is the use of this particular range okay so when you give seven so it will take the values as zero one two three four five one six okay so this particular number will be excluded because the number is starting from zero so this is the use of loops and this is how you can use loops okay so there is another uh way of using this okay so i hope every one of you is familiar with list okay so let me create a number list so create the list named as numbers and it contains the numbers as fifty hundred okay fifty 500 150 and 200 so this list contains four values 50 under 150 and 200 now what i want to do is print these individual elements so for that you what you can do is you can put this print function and you can mention the name of the list which is numbers and inside that you can mention the index values okay so i'll just copy this so i want to print all these four numbers so one two three okay so because indexing starts from zero as i have told you earlier so index of 50 is 0 100 is 1 150 is 2 and 200 is 3 so let me run this this will print the individual elements okay as you can see here so instead of doing this what we can do is use a for loop similarly as we did in the previous case and print these individual numbers okay so this is the list we have and now what we can do is use a for loop for in numbers so this for and in are the two important keywords for for loop so for i in numbers print i okay so as you can see here so we have this list which contains four elements so what happens when we run this for loop is so when the loop runs for the first time this i will take the first value which is 50 and it goes inside this loop so inside this we have one statement which is to print i so for the first time the value of i is 50. so once this i is completed it will take the second value okay and now it will print the second value which is 100 and 150 and 200 so this is how loops work okay now let's discuss about the second important loop which is f while loop so we have discussed about for loop and now we will discuss about while loop okay now what is the difference between for loop and while loop so you can see in this case so in this case we know that we want to print this uh lines seven times right so before we know that we have to print that five times so in this we are printing it some time so there will be cases in our program or in our code that we may not be sure how many times we want to run a particular statement so in that particular cases we can use while loops but when we are using a for loop we should mention the number of times we want to print a function okay so that is the difference between while loop when you are not sure about the number of times an action has to be executed so this is how you can create a while loop so you need to initiate value for i so it can be any variable so i'll use i is equal to 0 and y i is less than 10 okay so while i is less than 10 i want to print i so let me explain you in a minute what this code does okay so what i'm doing is so this is i'll just put the syntax of while loop here so it's while condition oil if the condition so there is condition here okay so this is the while while loop condition sorry syntax okay so so we will include this while keyword and there will be a condition if this condition is true then this statement will be carried out okay so in this case you can see here so we have this while keyword and then we have a condition which is i is less than 10 okay so if this condition is satisfied this program will go into this statement okay so you can see the indentation here so you can see the indentation in the case of for loop also so that means this particular statement is inside the loop okay so basically what we are doing is first of all we are initiating i is equal to 0 and we are checking the condition i is less than 10 so as long as i is less than 10 this loop will be carried out again and again okay so first of all i will take the value of 0 now it will check the condition so i is less than 10 so we have the value of i s 0 so 0 is of course less than 10 so the condition is true and now this will be carried out so i will be printed then we are incrementing i with 1. so this basically means so what this basically means is i is equal to i plus 1 so the short form for writing this particular line is i plus is equal to 1 so this these two are the same things okay so just delete here so what happens once i is printed so it will be added with one so now the value of i will be one right and now it will again check the condition now one is again less than ten so it will print uh i which is one now and again it will be incremented so this process will continue as long as ie value is less than 10 so let's print this as you can see here so 0 1 2 3 so this will print i up to 9 because so if we have made the conditional size less than or equal to 10 we will get a 10 right so that's why it has printed up to 9 only okay so this is how you can use while loop when you are not sure how many times the loop should run okay but in the case of for loop we will mention the number of times you want to run a particular action again and again okay so this is one case where you can use while loop okay so as i have told you if this condition is true only this while loop will be carried out so let's see what happens if the condition is not true so i is equal to 5 while i is less than 3 print i is equal to one so i'll run this so nothing will happen as you can see here it just executed but we i is not printed because this condition is not true so you can see here i is equal to 5 so here the condition is i should be less than 3 but i is of course greater than 3 right so this condition returns a false value and hence the program won't go into this while conditions in this while loop okay so that is why we won't get any outputs so this while loop will be skipped so when this condition is true as in this case we will be you know the statements inside these loop will be printed whereas when the condition is false the statement inside the loop won't be printed okay so this is about while loop so the important thing you that you need to know between foreign wireless so in for loops we know the number of times the code has to be repeated again and again but we can use while loops when we are not sure how many times the code should be repeated so in these cases i have just made some very simple very basic examples so that you can understand and when we do various projects and other programs in python so it will be very complicated very complicated and complex loops will be used okay so this is for your basic understanding so i hope you understood the basic function of for and while loops in this video i am going to explain you about one of the important concepts in programming so it is nothing but functions so we are going to see what is meant by functions and how we can implement functions in python okay so as you can see here i have made a description about function here so a function is a block of code that can be reused in a program so this is about function so let's say that we have a thousand line of code in python okay so in the thousand line of code there is a particular block of code which is then the length of that particular block of code is 100 lines and in our thousand line code we need to use this particular underlines again and again so what we can do is instead of writing this underlines again and again we can just create a function so by creating a function you can reuse it and you don't need to write the entire code again so you just need to mention a word and using that word you can call that entire hundred lines of uh code okay so i'll explain you this with an example okay so what we will do is so i'll create factorial program of a number so factorial factorial of a number so you can understand about functions better with an example so i'll explain you how you can create a function for factorial of a number so so what is a factorial so i'll just give you a definition about it factorial of a number is the product of all the positive numbers the positive integers less than or equal to i give one number okay so so factorial of a number is the product of all the positive integers less than or equal to number equal to the given number so we would have learned about this in our you know early classes in school so let me give you an example of factorial let's say that we want to find the factorial of 5 so factorial of 5 so as you can see the definition above so factorial is nothing but the product of all the integers of all the values that should be less than 5 in this case so it's nothing but so less than and equal to 5 so it's 5 into 5 4 3 2 1. so the product of all these numbers which are less than and equal to 5 will be the factorial of h and the value will be 120 so the factorial of 5 is 120 so this is how you can find the factorial of a number by multiplying it with all the numbers less than and equal to it okay now let's see how we can create or how we can find the factorial of a number in python okay so i'll create a variable as number and let me put input enter a number to find its factorial okay so this input function asks the input from the user so the user will give a number and we will find the factorial of that particular number okay and we need to mention interior because you can see the input function here so this input function consider that the value given by the user is string so strings are nothing but text so we need to convert that text to an integer okay so so number is equal to so we need to find the factorial of this number okay so what you can do is create another variable called as factorial let's put factorial is equal to 1 as the initial value so as you would know that factorial of 1 is 1 and also factorial of 0 is 1 so there is this important point to note here that the factorial of 0 is not 0 but 1 okay so i'll create uh this factorial vary variable and give it the value of one now let's say that if number if the value number is nothing but the value given by the user if the number is equal to zero so you should put 2 equal to is because it means the number if the number is exactly equal to 0 so if the number is equal to 0 in that case i want to print that the factorial of 0 is so as i have told you factorial of 0 is 1 okay so if the number is not equal to 0 if the user gives some other input input except 0 so what we need to do is we can make an else condition here so else we can use a for loop to find the factorial for i in range one comma number plus one okay so i'll explain you about this after completing this loop factorial is equal to factorial into i okay so what i'm basically doing is so you can see this step here so for factorial of 5 we need to multiply 5 by 4 3 2 and 1 so that is what i am doing in this for loop so let's say that the number is 5 so what happens is so the value of number in this case will be 5 and you can see here we have mentioned the number here so the range in this case will be so range of 1 comma number plus 1 okay so number plus one here the number will be five so five plus one is six so basically a range of one to six means all the numbers between one and six and this is the important point note here that it will include 1 but it will exclude 6 so when we mention the range here so it will include this first value but it will exclude this second value so what it basically means is so range of 1 comma 6 means it includes all the values from 1 2 3 4 and 5 so this 6 will not be included but the previous number will be included so 1 will be included and 6 will be excluded so as you can see here we have 1 2 3 4 5 and we need to multiply all these numbers which will give the factorial of that particular number so this is what we are doing in this uh for loop so [Music] when i is running when this for loop is running for the first time it will take the first value in this range so the first value in this range is nothing but one right so it will take the value of one so as you can see here we have already initiated the value as one for factorial so one into one which is one so and as it completes this particular statement it again goes to the top of this loop now it will take the second value which is 2 right now it will again go here now factorial is equal to factorial into a now it's nothing but so when the for loop runs for the second time so factorial will be equal to already the factorial is equal to 1 and second time the factor will be equal to sorry this i will be equal to 2 and as the for loop continues to run so in the third step it will be one into two into three in the third in the fourth step it's nothing but one into three into four and finally it's one two three four five so this one two three four five and the product of all these numbers gives nothing but the factorial of five so this is what we have done in this particular for loop okay so now i just need to print this factorial value right so let us print this but the spring should be in this indentation same line of for because it comes under the sales loop sorry else condition okay so by this for loop we will find the factorial of the given number so print the factorial of number yes so basically i am substituting the values so the factorial of uh the number given by the user is the value we have found here okay so let us run this now the user should give a value let's give the value as 5 okay so as you can see here the factorial of 5 is 120. so this is how you can find the factorial of a number in python so let me run this again now let's say that the value is 10 so we need to find the factorial of 10 so the factorial of 10 is this particular number so you can check it whether uh the answer is correct or not so this is the exact factorial of 10 okay so now we have created this factorial function let's say that we want to find the factorial of some other numbers okay so we cannot write this entire all the lines of code okay so for this particular purpose we can use function so this is why function is used because it can be reused so that is the application of function that it is a block of code that can be reused so i'll explain you how you can create a factorial function now so this is creating a factorial program and this is how to put that in a function so factorial function so for creating a function you need to mention the keyword diff so def which means define so we are defining a function so let's name our function as factorial value okay so we are going to do the same thing that we have done here but we will do that in this function name called as factorial value and inside this i will mention that input value we are going to give let's say that input number or i'll just put num okay so you can see the indentation here also so so we have to do all the things we have done here so i hope you understood how to find the factorial of a number so factorial is equal to 1 if the number is equal to 0 we have to say that the factorial is 1 so what we will do is let's not print but will the return so this function will return a value so if the number given by the user is zero so we have used this variable as num so we should uh mention it here so if the value given by the user is 0 it should return the value which is factorial so we need to mention this also okay so we have already mentioned the factorial is equal to 1 okay now yes so as you can see here then we have used the sales condition right so i'll just do that yes for i in range okay so i'm just copying this same piece of code but i am putting this inside this function called as factorial value okay so this will return the factorial value so let's try to run this code so we have successfully created our function factorial value so what happens is if you call this factorial value and give a particular number it will print the factorial of that number so i am going to print factorial value factorial value of let's say that 5 so you want to find the factorial value of 5 now okay so sorry it has took the value from here so we just need to mention that so this is the part where i made the mistake okay so i should give num because that is what we are taking here number means it takes this value so i'll just run this again now let's find the factorial of i so as you can see here now it will give the value of i so we don't need to write this entire piece of code again or we don't need to run this entire code we can just again just call the function which is factorial value and you can mention some number let's say that we want to find the factorial of 10 and now it prints the factorial value of 10 so as you can see here so instead of writing this uh 11 lines of code again so i have just put it in a single word so this word is now called as a function so as you can see here so inside this function and i have mentioned all the statements that can find the factorial of a number and when i call this function with a particular value in it it will give us the factorial value of that particular number so you can find any factor of sorry factorial value of any number so factorial value of let's say 6 so this is the use of function so as you can see here we have reused this particular block of code by defining the function as factorial value so what happens is when you mention this factorial value and dimension or number inside it or a value inside it so this will call this function which is factorial value and inside this number the input which we have given will go and this particular block of code will be carried out so this is the use of function so the advantage of function is that even under lines of code can be compiled to a single function name okay so i hope you have understood about function and the advantage of using functions in output first of all numpy so the full form for numpy is nothing but numerical python okay so this numpy is basically used for several new numerical operations and other numerical things which we want to do in our project or in our domain say for example in machine learning we will encounter large data sets so data set containing lags and even millions of data points and numbers okay so this numpy library is used for processing that numerical value is better and other such kind of things okay so yes it is uh the short form for numerical python is numpy okay so numpy arrays has two main advantages over list and purple so list and tuple are the inbuilt data types in python so they store more than one values in a data type say for example an integer offload can only store one value but in a list we can store multiple values so list and tables are nothing but a collection of values okay and the numpy arrays are you know they are just like a list and double but they are more advanced than that okay so the advantages of numpy array over list is that they allow several mathematical operations to be performed on them compared to list okay so we cannot uh perform as many uh operations that we can do on a numpy array on a list okay so and the other main important thing is the operations that we do on a numpy array are very faster as compared to list okay so these are the main advantages of it and so you can see this documentation here so you can just search for numpy documentation so you will find this numpy.org site so this is where you can find the explanation about the several functions and what is mean by numpy and what is mean by numpy arrays and all those kind of things so if you are if you have any doubt while you are working on this okay so now let's get started with this so first in order to use numpy you need to import this library okay so for that you just need to give import numpy okay so now what i'm going to do is i'm going to shorten this numpy to enp so i'll import numpy as np so what happens is this will import the numpy library in the abbreviation np okay so you can run this so press shift plus enter to run this so this environment is called as google collaboratory so in this google collaboratory you can run python codes so if you are new to this google collaboratory check out my google collaboratory basics video and the index of that video is 2.1 so in that i have explained how you can use google collaboratory and other features of google collaboratory okay so here we have successfully imported numpy okay so as i told you earlier numpy is a python library so libraries are nothing but pre-made functions and pre-made classes which are stored in a python file okay so we can access these functions which are pre-made for our uh programming say for example that can be under line of code okay a particular function can be under lines of code so we don't have to write the entire hundred lines of code so rather than what you can do is you can create a module or a library and store that under lines of code in a single word or a function okay so using uh that particular library you can call that call that function and this is the use of libraries so instead of just recreating the code you can just call it with a function or libraries okay so that is the use of libraries in python so and here we have this numpy library and i have imported numpy snp so as i have told you the one main advantage of numpy rs is that the operations are faster in it okay so i'll just show you i'll show you how we can find that so i create a text here as this to us is numpy time taken okay so here what i'm going to do is i'm going to perform a simple task on both list and number separately and i'm going to find the time taken to do that particular operation okay so or that particular process okay so for this i need to import time so from time so time is another library so from time input process time okay so this process time is used to measure the time required for a particular process okay so i'll run this okay now let's see the time taken for a particular task in a list time taken by a list okay one second okay so what i'll do is i'll create a python list so i'll declare the list as python list so we are creating a list named as python list and so list should be enclosed in square brackets and what i am going to do is i am going to create a for loop to assign values to this list okay so i for i in range for i in range let's say ten thousand okay so what i'm basically doing is i want this python list to have values starting from one to ten thousand so that is what i have mentioned through this range uh ten thousand so i'm creating a for loop that will uh give the list values from 1 to or 0 to 10 000 okay so now what i am doing is i am mentioning this start time okay so start time is equal to process time so we are using this process time to measure the time taken by this particular process now what i am going to do is so again i am calling this this python list and in this list i am going to add the value 5 to all the variables so we have totally we would have 10 000 values right from 0 to 10 000 so i am going to add 5 to all the values in this particular list okay so we can do that by using this line of code so i plus pi for i in python list so this is the similar for loop as we have used it so the difference is that so i am taking all the values from this python list and for each value i am uh adding the value 5 okay so this is the process i am doing doing and i will just give i end time here okay so in time is equal to process time and let's print the amount of time required by this particular process so it can be found by end time minus start time okay so what i'm basically doing is i'm first initiating or creating a list and in that list i'm putting the values from 0 to 10 000 and i'm creating a start time area and an in time area and between that we have this particular process happening which is to add 5 to all the values of this particular list so what happens is we have this start time and end time so this particular cross time function process time function will find the amount of time required or an amount of seconds required for this particular process to complete and we are finding the difference in it and this will print the number of seconds taken by this process to complete okay so let's run this so as you can see the amount of time required for this particular process so it's this is in seconds so it is around 1.7 milliseconds right so this is the amount of process taken by a list to complete this particular process now what we we shall do is now let's create a numpy array and do the similar process and see how much time a numpy array takes okay so i'll declare the variable as np array and let's say np array is equal to np dot array so you can see here i have imported the numpy library as np so i am just calling that numpy library here and this numpy dot array function is used to create arrays okay so numpy array i am just using the same code i have used in the previous cell which is i for i in range 10 000 okay it's the same code that we have used so i want to create a numpy array in this case with values from 0 to 10 000 and i'll create a start name okay i'll just copy the code from here so start time and now what we will do is we will add this 5 to all the values in this numpy array so it is the same process that we have done here so np array plus equal to 5 okay so this is similar to uh just adding numpy array is equal to numpy array plus five so this line so we have this line right so this is similar to this so np array plus is equal to five is similar to numpy array numpy array plus five so what i'm basically doing is i am adding this value five to all the elements of this array okay so that's why i am doing which is very similar to this process we have done okay so the only difference is that in this case we have added 5 for all the elements in a list but in this case we are adding 5 to all the elements in a numpy array now we need to mention this in time okay so i'll just copy this end time and start time so this is basically the same process now this process again will take place and for all the values in this numpy array 5 will be added and the amount of time taken for this process to complete is calculated okay so you can see the difference here so the time taken for this numpy array to complete is very much less as compared to this particular line right so it is almost 5 to 10 percentage faster than the list okay so this is the significance of numpy array so you can say that this time difference is not much right but in this case we are just dealing with uh you know 10 000 values but there are cases where we will deal with uh you know a million of data points or million of numbers so in that case the time difference is quite significant okay so this is one advantage that i have mentioned to you that in numpy array we can do operations much faster as compared to a list okay so now let's get into numpy arrays so let's see how we can create numpy arrays and how we can perform several operations or functions on a numpy array okay so i'll create a text as okay sorry numpy rs so let me show you how you can initiate a list so we are creating a list here so this one so i mean i'm declaring the name of the list as list1 and let's say that the value of this list are 1 2 3 4 and 5 okay so i want to print this list print list1 and i want to check the data type of the list okay so type list okay so first let's create a list and see so as you know the list should be enclosed in square brackets okay so this type list will give us the data type of this particular uh data type okay of this particular object so we you can us okay sorry it's list one okay as you can see here we have printed the list and we have found the data type of this particular object as list okay now let's do the same for numpy array i'll declare the numpy arrays np array and this numpy array is equal to let's say np dot array which is equal to so you need to uh pay attention to this parenthesis and square brackets so you need to mention this parenthesis first and inside this you we need to create a list and put the elements in this particular square brackets okay so we are going to create a numpy array and the values are same to this okay so one two three four five okay so now i'll print this numpy array so np array and also i'll check the data type once type np array okay so this is an example of a numpy array so you can see here in list the elements are separated by comma but in this case the values are elements are not separated by comma in the case of numpy okay and we have found the data type to be numpy dot nd array so nd represents n dimension array so arrays are similar to matrixes okay so we would have studied about vectors and matrices in mathematics so this array is similar to a matrix in matrix okay so let's see how we can create these kind of arrays and more dimensional arrays okay so creating a one-dimensional array so it is the same as we have done here so i will declare the variable as here so the name of the array will be a in this case so a is equal to np dot array and let's say the values are 1 2 3 and 4 and now we can print e okay so we have successfully printed the numpy array here now what you can do is we can check the shape of this numpy area this particular function shape will give us the number of rows and columns in that particular numpy array okay so you can see here we have only uh one value because this number is one dimensional so this four represents we have four columns okay so now let's create a two dimensional array so i'll create or i'll declare the variable as b so b is equal to np dot array and what i'll do is let's say the values are one two three and four five six seven and eight okay so now let me print b as you can see here this is similar to a two cross four matrix containing two rows and four columns right so this is how you can create arrays with multiple dimensions so in this case we have just created a one dimensional array containing one row but here we have created an array with two rows and four columns so now you can check the shape of b okay so the first value represents the number of rows and the second value represents the number of columns so we have two rows and four columns in the case of the array b okay so in this case we have just only one dimension so we have just one number here okay so in this case four represents the number of columns but in this case so the first value represents the number of rows and the second value represents the number of columns so totally we have two rows and four columns in this case now let's do another thing i'll create another array c is equal to np dot array and so as you can see here all the values are integers in this case right so now let's see how we can put values with float floating points floating points are nothing but the decimal values so np dot array which is equal to say 1 2 3 and 4 and 5 6 7 and 8 okay and now you can mention the data type as d type so d type represents data type float okay now let's print c as you can see here now we have floating point values which is this is similar to 1.0 2.0 etc so by mentioning the data type float so you can uh create an array with a floating point values so if you don't mention any uh mention any data types the default value is integers so we get an array of integer values okay so now let's discuss about placeholders in array so now we are going to discuss about initial placeholders in numpy rs okay so these initial place orders are nothing but in some cases we want to initiate arrays with certain values say for example in several cases we may need to initialize an array in which all the values are 0 in some other cases we need the initial values to be 1 in all the values so such kind of things so that is meant by initial placeholder so initial place order placeholder means the initial value is present in that particular numpy array okay so now let's create a numpy array of zeros so let's name the array as x so x is equal to n p dot zeros so zeros is the function which is used to create an array containing the all the values are zero so in this particular parenthesis you need to mention the shape of your array four comma phi let's say that we want to create an array of four columns four rows and five columns and we need all the values to be zero okay so that's what we are trying to do here so you can note here that i have used used two parenthesis here so inside this parenthesis i mentioned the shape of the array that i want so now let me print x okay now as you can see here we have created an array with all the values as zero of four rows and five columns okay so you need to use this np dot zeros function and inside that you need to mention the dimension of the array that you need now let's create a numpy array with all the values as one okay create a numpy array of ones so i'll create a the numpy array as y so np dot once is the function that's used to create array with value 1 so now let's put the shape of the array as let's say 3 comma 3 okay so i want a 3 comma 3 array with all the values as 1 let's print y as you can see here we have created a 3 cross 3 array with all the values as 1. so this is some example of initiating an array with the values of 0 and 1. now let's see how we can create an array of a particular value okay so particular value so let's say that the array is at and is it is equal to np dot full this full function helps us to create an array with a specific value so first you need to mention the shape or the dimension of the array you want so let's say that i want an array of five comma four uh shapes of five it means five rows and four columns right and next you need to mention the value which you want to give let's say that i want a 4 cross 5 array or 4 cross 5 matrix with all the values as 5 and now let me print is it as you can see here we have got a five rows and four column matrix with all the value as five okay so this is how you can create an array with a specific value now let's see how we can create an identity matrix so create an identity matrix so we would have studied about this ident identity matrix in our basic mathematics so this identity matrix means all the diagonal values will be having the value of one and the other values will be zero okay so this identity matrix is also used in various cases in our programming so i'll create the numpy s a and for creating an identity matrix you need to use the function i so np dot i and in that mentioned the the shape of your matrix okay now you you should not uh mention the number of rows and corners because identity matrix have the same number of rows and columns okay so we cannot have uh identity matrix of five comma four uh array or we cannot have an uh identity matrix of four comma five uh array okay so in the case of identity matrix the number of rows and columns should be equal so example this case so the number of rows and columns is equal so in such case we can have a identity matrix or in case where the shape of that is 4 cross 4 or 5 5 cross price so those kind of things so you need to mention so let's say that we want a identity matrix of four rows into four columns okay so let's print a now as you can see here we got identity matrix where all the diagonal values is equal to 1 and the remaining values are nothing but 0 so this is an identity matrix so you can just change this value and see so this will give us a 5 cross 5 matrix but it is an identity matrix ok so this is how we can create identity matrix using np dot i function okay so we have just given this pre-made values in this case so we we created array with all the values as zeros and then one and a particular value all those kind of things now let's see how we can create a numpy array so create a numpy array with random values so i want to create a numpy array with random values let's say that the number is b and you can use the function np dot random dot random so in that you need to mention the shape of the array that you want let's say that we want a three cross four matrix uh three rows four column matrix or three cross four column three rows four column array with random values so but there is one important thing to note here so let me run this first so we got these random values in our numpy array but the main thing to note here is all the values will be from or all the values will be between 0 and 1 okay so let's run this again we will get some other value okay so we won't get the same value so we will get some other values but that value will be between 0 and 1 okay so this is how we can create a numpy array with random edges now let's see how we can create an array with random values but we need random integers okay so let's see that so random value random values array and in this case we want integer okay so random values are a random integer values are a and we can mention the range we need so random integer values are is within a specific range okay so i'll create the number is c so c is equal to np dot random so we have used this random dot random function to create numpairs with values between 0 and 1 to 1 so now you can use this function np dot random dot rant in so it means random integers okay so now you need to mention the starting and ending point of your value so here you just need to mention the range in between you want the values to be let's say that it's 10 to 100 okay and next we need to mention the shape we want let's say that we want a three rows and five column array so basically what happens is we will get a 3 cross 5 array and all the values will be in the range of 10 to 100 okay so now let's print this so as you can see here in the previous case we got a decimal values between 0 and 1 and now we have got all the values between 10 and 100 okay so this is how you can create a random value array with specifying their starting point and ending point okay so if you run this again you will get some other value but the value will be in this range 10 and 100 okay so you can change this shape to get a different dimension array okay so next what we will do is let's see how we can create an array of evenly spaced values evenly values so d is equal to for this you can use the function np dot lin space okay so np dot lin space in that you need to mention the starting point and the ending point or the range in which you want the values let's say that i want the values between the range 10 to 30 okay so i want five values in this particular array okay so what i am doing is i am mentioning the starting point and ending point and the number of values i need so let's run this and see so we will get five values between this range uh 10 to 30 but all these five values will be evenly spaced so i am running this so as you can see here we got five values and these five values are evenly spaced okay so or you can just use six and c so i want uh six values and they should be evenly spaced and that value should be like should lie between 10 and 30 okay so here we are getting an array of evenly spaced values and here we are specifying the number of values you want specifying the number of values required okay so there is another method of creating a evenly spaced values but in this case we can mention the step number okay say i'll just do this and explain you so array of evenly spaced values and now we need to specify specifying the step okay so i'll explain you what is meant by this step so let's create the array as e so e is equal to for this uh you can use the function a range so a range i want the values between 10 and 30 and i want the step value to be 5 okay so in this case you need to note here is so we are not mentioning that i need five values here we are mentioning that the step should be five so the if the value starts from ten the step should be five the other value should be fifteen twenty thirty uh fifteen twenty twenty five like that okay so let's run this so print e as you can see here we get uh the step as five so here we are not mentioning that we need number of values so here you can see we have only four values so here we are mentioning the step value which we want okay so we want the values to jump with five units so that's what we are mentioning this using the sa range function but in the case of lin space you will mention the number of values you want so let's say for example in this case let's mention 5 so these both are the same but we are just using different function let's run and see this so it will give us 5 evenly spaced values but this will give us evenly spaced values between this particular range and mentioning this step so this is how you can create in your evenly spaced arrays now let's see how we can convert a list to an array convert a list to a numpy array okay so let's declare the list as list two so let's say that list two is equal to so the list should be enclosed in square bracket let's say that the list as well is 10 20 30 40 and 50 okay so let's create a numpy array as np array and we can use the function np dot as array so this np dot s array will convert one uh particular data type to a numpy array so we are converting this list to a numpy array here inside this mentioned list two okay so this will convert this list to which is typically a list to a numpy array and now let's print this np array and let's also check the data print sorry type so this will give the data type so let's run this as you can see here we have converted this list to a numpy area as you can see here there are no comma here so we have successfully converted this list 2 into a numpy array and we have got the data type as numpy.nd array okay so this is how you can convert a list to a numpy array so you can also convert a tuple to a numpy array so the tuple will be enclosed in parenthesis but the list will be enclosed in square brackets okay so this is how you can convert a different data type to a numpy array now let's see how we can analyze a particular array so analyzing numpy array so analyzing is nothing but getting various information about that it's just inspecting array okay so analyzing a numpy array so let's say that so let's create an array as c so c is equal to np dot random dot random so i want random values let's take integer values so random dot random so in that i want the values between let's say 10 to 90 and i want a five cross five matrix okay so i want a five cross five matrix or five cross by array with values ranging between ten and ninety okay so i'll also print this c so print c so we got a 5 cross 5 array with the values between 10 and 90 so let's do some analysis on this so you can find the array dimension using this function so we have already seen this sorry dimension so what you need to do is i'm just printing this so mention the array name so here in this case the array is nothing but c okay so put c here so c dot shape so this will give us the shape of the shape of the array is nothing but the number of rows and columns okay now let's check the number of dimension it has so number of dimensions so let's print so for that you can use the function c dot n dime so i am using c because the name of this particular array is c okay so this is the function we have so n dot dime so it means it gives the dimension value so as you can see here we have two dimensions two dimensions because we have rows and columns here right so it is a two dimensional array so if you have just a single row it it will be a single dimensional array as we have rows and columns in this case it is a double dimensional array or it is a 2d array okay so you can also check the number of elements present in the array so checking the number of elements in an array let's print c dot size so this size gives the number of elements so let's print this so as you can see here totally we have 25 values so we have five rows and five columns so totally we have 25 values so you can find the number of elements present in an array using this size function okay so now let's see how we can check the data type present in this array okay so checking the data type of values in the array okay so it's printed so print c dot d type so d type represents data type so we know that all the values are integers because we have initiated the random integers values so it is in 64 that means 60 in 64-bit integers values okay so this is how you can find what is the data type present in a particular numpy array okay so this is about analyzing or inspecting an array now let's see some mathematical operations that can be performed on an array so mathematical operations on a numpy array or np array so i just want to give you another example here so i need to show that in with the list so i'll create the list as list one which is equal to one two three four and five and i'll create another list as list 2 which is equal to 6 7 8 9 and 10 okay so now what i'll do is i'll just add both of this list and print it so let's print list one plus list two okay so what do you expect this to give so let's try to run and see this so we have we tried to add this list but what actually happened is so when we use this add symbol the elements are not adding so these the element wise addition is not happening say for example the the addition value does not give us the values of 7 and 9 where we add the values element twice okay so in the case of list when you use this add symbol between two lists it will concatenate the two list okay so concatenate means just joining the two list so this plus sign will concatenate our joins to list so we cannot add uh or we cannot have element choice addition in the case of list so concatenate or join to list but we can do this in the case of ah numpy rs so let's see how we can do that so let's create a numpy array so a is equal to np dot and just copy this from above sorry okay i'll just do that do this here so np dot random dot random so i want a in random integers for integer values and i want the values between 0 and 10 okay so 0 and then now let's mention the shape we want let's say i want a 3 cross 3 array so 3 row and 3 column array and all the values should be between 0 and 10 and i'll create another array i'll name this array as b okay so we have basically we have two rs a and b and now i want the values between 10 and 20 in this case okay so first array contains value between 0 and 10 and the second array contains value between 10 to 20 and the shape is similar which is 3 cross 3 so let's run this let's print both of these arrays so print a and lens print this b so as you can see here we we got an array with random values so the first array has values between 0 and 10 and the next array as values between 10 and 20 so now what we can do is we can run some mathematical operations in it so let's run let's say that it's a plus b so we want to print a plus b so in the case of list when we add the two list it will concatenate or joins the two list but in the case of numpy array when you add two numpy arrays so we will get element twice addition say for example in this case the first element is 7 in the first number and in the second number array the first element is 15 so both these values will be added so that will be elementwise addition so let's also do a minus b and let's do all the mathematical basic mathematical functions just a into b so you a into b and a divided by b so a by b okay so you can run this so this will give us element to its addition subtraction multiplication and division so we get four numpy arrays and all this the element twice operations will be done okay so in the first numpy array a and b will be added element twice then it will be subtracted then it will be a multiplied and uh divided okay so this is one way of doing it where we just add both of these arrays so we can do this in another way so it is we will just print just mentioned np dot add okay so in that you need to mention the two is which you want to add let's say that we want to add the other two numpy arrays a and b so i'll just copy this and i'll just make another array here so i want another a and b so i'll run this now we will have different values for a and b let me print that so print a comma b sorry so print a and print b so we will get new values for this so instead of just putting a plus b what you can do is you can mention this np dot add function this add function will add two arrays so i have mentioned a and b now print mp dot subtract so this will uh find the difference between the two arrays element twice so a comma b and then we can np dot multiply a comma b and finally let's print mp dot divide e and b okay so let's run this so this is similar to this particular code but it is an another way of doing that using this np dot add on np dot subtract function so we got four mathematical basic operations performed on a numpy array okay so this is how you can perform some basic mathematical operations on it okay so now let's do some array manipulation so i'll just create a text here all right manipulation so let's create an array so i'll declare the name of the array array itself so mp dot random dot random and i want the values between 0 and 10 and i want the shape of the array to be 2 comma 3. so it will be a 2 row and 3 column matrix as we know now i'll just print this array and also i'll print the shape of this array so print array dot shape okay so this is similar to a two cross three matrix where we have two rows and three columns right so now you can create the transpose of this matrix so i'm going to create the transpose so let's name the transversus trans which is equal to for that you can use the function np dot transpose okay so mp dot transfer so inside that mention the array which you want to convert so i want to convert this particular array and find its transpose so np dot transpose array and let's print this trans okay and now let's print the shape of this transverse array as well so trans dot shape so transpose is nothing but if the matrix is 2 comma 3 it will be converted into a 3 cross 2 matrix so all the rows will be converted into columns and the columns will be converted into rows so if we have the values as 1 2 8 0 7 the values will be 1 8 2 0 and 2 7 so let's run and see this so as you can see here the array is transposed so this is how you can create an array and you can create the transpose of that particular matrix so there is another way of finding this transpose so i just copy this right so it will give us new values okay so as you can see here now the values are different so there is another method of finding the transpose so i'll create trans another transfer says trans 2 which is equal to so mention the array name so here the array name is array dot t so when you mention this dot t it will find the transpose of this uh array and it it will store that in this trans two okay so now let's print trans2 and let's print the shape of this trans two so trans two dot shape okay so now we can see here that uh this trans this particular matrix is transposed using this dot t function so these are the two ways of finding the transpose of a matrix so now let's uh see another thing so this is the last function which we are going to see so reshaping array so let's create a array which is equal to np dot random dot random and let's say that we want the values between 0 and 10 and i want the shape of the matrix or shape of the array to be 2 comma 3 so it is a 2 comma 2 rows on 3 column array with values between 0 and 10 so let's print a and also let's check the shape of a dot shape so as you can see here this is the random variable which we got so it is a two comma three uh shape matrix or two comma three shaped array so now you can reshape this array let's say that let's create another array as b so b is equal to a dot b shape and let's say that 3 comma 2. so in this case we have a 2 comma 3 matrix and now let's say that i want to convert this to a 3 comma 2 matrix so in this case it will be 3 rows and 2 column matrix so this particular array which is here so you can see this is the array so this will be converted to a three comma two array okay so let's and it will be stored in this array b okay so let's print b and let's also print the shape of b so i'll run this so now you can see here this is how we can reshape our array so this reshape is one of the most important function we will be do we will be using in array okay so these are some of the most important functions and the most important properties you need to know about arrays so you can make note of this or you can save this code anything you want but these are very important so these can be seen very simple for you and it is actually very simple and these will be used in various uh you know times when we are working on machine learning projects or any other projects so this is why numpy array is very much important for us so the main thing to note here is the numpy arrays are very faster than the inbuilt python data types like list or temple and we can do several mathematical operations and other operations as well so there are also other functions beyond this but these are the important functions which we will encounter regularly so i hope you have understood all the things we have done here pandas library in python okay so this is the third module in our hands-on machine learning course with python in the first two modules we have discussed about the machine learning basics and the python basics required for machine learning okay so in this module which is the third module we are going to discuss about some of the most important libraries that are required for machine learning okay so in the previous video we have discussed about the tutorial on numpy library and in this video we are going to see about pandas library okay so as you can see here i have made some description here so pandas library so the main use of pandas library is for data processing and analysis okay so in machine learning we deal with lots of data okay so there can also be in cases where we will be dealing with even millions of data points okay so we need a certain tool or a suitable tool to perform various operations and functions on those data okay and pandas library is one of the main tool we will be using in machine learning for data processing and for also analyzing the data set okay so pandas contains an important object type called as data frame okay so as you can see here pandas data frame is two dimensional table or data structure with labeled axis okay so these are nothing but very structured tables okay so these are two dimensional tables having rows and columns okay and there will be uh column names for each columns okay so this is about the pandas library so pandas library basically has two objects so one object is this data frame and there is also another object called as pandas series okay so but in machine learning we will be using this data frame predominantly so i will explain you most of the important functions that are that we will be using in this data frame okay so i i will be explaining you all the functions by taking two example data sets so i have already uploaded one data set file here so diabetes.csv file so this is the diabetes prediction dataset so we have already made a project video on predicting diabetes of a patient using this dataset so you can check that video as well if you are if you haven't seen that okay so and we will be using another data set called as boston house price prediction data set so these are the two datasets we will be using and we will be doing all the code in google collaboratory so if you are new to this channel or if you new to google collaboratory you can check my video on google collaboratory basics in which i have explained you about how you can access google collaborated through google chrome and what are the various features of google collaboratory okay so now get started with this pandas tutorial so first we need to import the library importing the pandas library so libraries are nothing but pre-made codes so these codes are pre-made already and they are stored in these modules or these libraries and we can use these libraries for those specific functions okay so for that you need to mention this import so this will import that library so let's input pandas so i want to import pandas in a short form so i don't want to use pandas in all the code so i just want a shorter form of this library so let's import pandas as pd okay so this is the general convention that we use in machine learning or any other python related codes so we will be importing pandas as pd and if we are working with numpy library we will import numpy snp okay so i am importing pandas pd here so you can press shift press enter to run this code okay so this will import our pandas library now let's create a pandas data frame okay creating so you can join my telegram group i'll give the link for that in the description of this video so in that i will be notifying you once i post new videos okay so getting into this we need to create a pandas data plan so first of all let me also import the boston house price data set so we have already imported the csv failure now we will import the boston dataset from sklearn okay so importing the boston data okay master no space data so it is uh present in the scale and library so you can use this for that from sklearn dot datasets import load boston okay so i'll create or i'll declare the variable as boston dataset so boston dataset is equal to load boston so this will load the boston dataset to this boston dataset variable okay so i'll run this you can check the data type of this particular object so type so i want to check the type of boston data set okay so as you can see here it is a sklearn utils bunch so bunches like a dictionary object so it contains a lot of data so let's try to print and see this data okay so print boston data set so this is the data set we have imported as you can see here so i hope you know that a dictionary contains a key and a value so here the key is nothing but this the word data and these are the values in it so we have various values in it and this is the target so this target is nothing but house price so these values are in thousand dollars okay so twenty four thousand dollars twenty one point six thousand dollars and those kind of things and uh these uh data represents various values like age of the person owning that house or the tax they need to pay and the crime rate and all sign all kinds of data and this target variable contains the price of the data set sorry the price of that particular house okay so you can see the feature names here so these are the different features or columns so we have the crime rate zone number etcetera okay so we have already done this project for predicting so you can check that video if you want to get more information on this okay so now we have successfully imported this boston house data set from escalant as you can see here there are so much numbers here right so this you know type of data is not very suitable for analysis so this is where pandas comes into play so this pandas helps us to import this data set into a more structured table okay so let's see how we can do that okay found us data frame so i'll declare the variable as boston df so df means data frame okay so is equal to you can use this pd dot data frame function okay so pd is nothing but panda so we have imported pandas pd okay so in this mention what is the data that you want to include so let's say that i want to include all these data okay so i want to uh include this data from this boston dataset object so i don't want to include this target or price of this right now so i just want to include all these data in my data frame so you need to mention so pd.dataframe mention the dataset name which is faustin dataset dot data okay and also mention the column names or you can just mention columns so columns is equal to boston data set dot feature names okay so i'll explain you what is meant by this so we are creating a pandas data frame and inside this we need to give the data we want and the name of each of the column we want so let's first see about this so this is nothing but boston dataset.data which is nothing but this so you can see here that we have printed this boston dataset right so these are the data so i want to include all these values and i want to get the columns column names as well here the columns names are nothing but these feature names the crime rate zone etc okay so i'm loading all this data to my data frame which is known as boston df and i am importing all the feature names as column name okay so let's run this okay so now let's see the sample of this data set so boston df dot here so when you use this yet function in a pandas data frame it will print you the first five rows of that data frame so as you can see here we have printed the first five rows so we have all these columns as crime rate zone indus etc okay so this is how you can see the sample of this data set now what we will do is let's check the shape of this data frame so boston df dot shape so this tells us the number of rows and columns in this particular data plane so the first value represents rows and the second value represent columns okay so totally we have five not six data points are fine all six values for a different houses and then we have 13 columns okay so this is how you can check the shape of a data frame so by this we can by using this data frame function this is how we can load a data set into a pandas data frame now i am going to show you how we how you can import a dataset from a csv file okay so importing a pandas sorry importing the data from a csv file to a pandas data frame okay so as i showed you earlier i have already uploaded this diabetes dataset so it is it is nothing but diabetes.csv so csv are nothing but comma separated values i'll show you how this data actually looks so i'll open this in notepad so in this all the values will be separated by comma okay so this is the comma separated values so this is the diabetes dataset vr and you can see the column names here so each of these represents each column name and these are the data okay so the problem with this is that we cannot do any analysis if we have the data like this right so this is why we need a structured table and this is where we will use this data frame okay now i'll show you how you can import this csv file to a pandas data frame so [Music] csv file to round us data frame okay let's create the data framework declare the data frame as diabetes and dfs which means diabetes data frame so the previous data set we have worked on is boston data frame so diabetes data frame just copy the path here so copy copy the path of this diabetes csv and for this you need to use the function td dot read csv so this read csv function will read the csv file and store all the values in the csv file to a data frame whereas in this case we have used this data frame function because we already add this data okay so now use this pd.treat csv and inside codes mention the path of this file okay so it's nothing but diabetes.csv okay so now let's run this and now you can check the data type of this particular object so which is diabetes df okay okay so it's not defined that is okay so there is a small spelling mistake here okay so as you can see here it is a panda score frame data frame so it is a data frame object so you can check the same with this boston df as well so boston df so as you can see here it is a data frame whereas before it was a scalene or cycle and bunch data type okay so this is how you can load a csv file into a pandas data frame so you can similarly find the head of this data frame as diabetes df dot here so this will print the first five rows so i'll run this so these are the first five rows so these are various uh columns such as pregnancy so this dataset basically contains the values for uh women okay so we have the pregnancy values blood glucose values blood pressure values skin thickness insulin bmi pedigree function agent finally we have this outcome okay so outcome is nothing but one represents that the person is diabetic and if the label is zero it means the person is non-diabetic so these are nothing but the labels so we have two labels here one and zero okay so this is about the sample the sample of this data set so you can also find the shape of this data set by using this that is pf dot shape okay so we have totally 768 data points or 768 0009 columns okay so you can also read a excel file i'll just make a text here so in this we have read a csv file right so you can also do this with a excel file for that you need to uh use this particular function it's loading the data from the excel file to a pandas data frame okay so i'll just put it in text you can check this with some excel file okay so it is very similar to this you just need to use the function pd.read excel okay so i'll just mention the function here so pd dot read so this is the function and in the codes you need to mention the path of the file so file path so this will read the excel file and it will store all the values to a pandas data frame okay so now let's discuss how we can export a data frame to a csv file okay so exporting a data frame to a csv file okay so we have discussed here how we can load the contents of a csv file to a pandas data frame using this read csv function now we are going to discuss how we can load this particular data frame to a csv file okay so it is the reverse order for this so what i am going to do is i am going to load all the contents of this boston data frame to a csv file okay so i'll take this boston data frame so mention the data frame name name here here the data frame is boston df so boston df dot 2 csv this 2 csv is the function that will so you can see this here so it will load all this data to a comma separated value file so in this you need to mention the file name so i want the file name to be boston.csv okay so this is how you can do this so i'll run this so you can see here we don't have any file as boston csv here so once we run this particular code it will create a csv file containing all the values from this data frame okay so here you can see that we have this boston.csv file so this is how you can convert a data frame to a csv file so i just download this csv file so let me open and show it to you so you can do this for an excel file also you can also load the contents of a data frame into a excel file as well so just a second it's taking some time okay so we have this boston.csv file now now you can open this in a in excel or in notepad to see how this looks okay so we have successfully converted this boston data frame from uh this sklearn bunch to a data frame and now from data frame to this csv file okay so this is how you can convert this now you can also do this by storing or by importing or exporting all the values from a data frame to a excel file okay so for that you just need to use the function so like pd dot read csv you just need to mention the data frame name here so i'll just put it here exporting the pandas data frame to a excel file so for that so you need to mention the data frame name here so i'll just put df so in this case you need to mention as boston.tf dot 2xl okay so this will store all the values in a excel file so in the parenthesis you need to mention the file name okay so it's the same procedure so this is how you can do this so now let's see how we can create a data frame with random values so i am going to create a data frame with random values so for this particular function i need the numpy library as well so i'll import numpy library so i'll just import it in the first line okay so i'll run this so this will import the numpy library as np so we are going to create a data frame with random values so let's declare the name of the data frame as a random df so this random df is equal to it's the same function which is pd dot data frame which we have used before so pd dot data frame here you need to mention mp dot random so this mp is nothing but a numpy library so random dot rand and in this parenthesis you can mention the number of rows and columns you want so i'll say that i want 20 rows and 10 columns so this will create a data frame containing 20 rows and 10 columns with random values okay so i'll run this now you can see the sample of this data frame so random df dot yet okay so i'll run this now you can see see here so we got the first five rows of this entire data flame so totally we have 20 rows and we have 10 columns so we got these 10 columns and these are nothing but random variables but the main thing to note here is when you use this random dot rand function the values you get will be in in the range of 0 and 1. so all the values will be between 0 and 1. so this is how you can create a partners data frame with random values between 0 and 1 okay so if you want to have different uh rows or columns you can change this here okay so you can check the data frame shape so random df dot shape okay so this will give us the shape which is 20 and 10 so which we have used here okay so this is how you can create random data frame now let's see how we can do some inspecting on a data frame so we are going to inspect certain features or properties of a data frame so inspecting a data frame so first let's see how we can find the number of rows and columns finding the number of rows and columns so this we have already seen which is to use this dot shape function so for this i am going to work on this boston data frame so we have this boston data frame loaded as boston boston underscore df right so i am going to use this particular data frame okay so mention the data frame name here so once you mention the data frame name just put dot and shape so this shape function will return us a value containing the number of rows and number of columns of that particular data set so we have seen that this boston data frame contains five or six data points or five or six rows and 13 columns so now let's see how we can print first five rows so i have also shown this to you already so first five rows in our data frame so just mention the data frame name boston df dot yet so this will print the first five rows so as you can see here so this is helpful for us to understand what is the range of the values in a data frame okay and what are the different values are looking like okay so you can also print the last five rows asked by rows of the data frame so you can do that by mentioning the data frame name which is boston df dot a okay so this yet function will print the first five rows whereas the tail function print the last five rows so you can see here so this printed us the last five rows of the data frame okay so now let's see how we can get some information about the data set so this particular function will give us informations certain informations about the data frame so mention the data frame name which is boston df dot info so use this info function so i'll run this this will give us the information such as the number of entries entries are nothing but rows so 0 to 5 0 5 so in python indexing starts from 0 and not 1 so it is inch it has given us 0 so 0 to 5 not 5. so totally we have 5 or 6 values and then we have totally 13 columns so these are all the different columns we have so the 13 columns so it's starting from zero and we have five not six non non values so non null means these are this basically means that there are no missing values okay so these are like uh if there is a value is not available if a value is missing then it will show it here so like for example let's say that in this crime column uh 10 values are missing okay so in that case it will give us 496 non non values because 10 values are missing okay so it also gives the data type of the values so here we can see that here the value is in 64 bit floating point so floating point are nothing but decimal points you can see the values here so this is a 64-bit floating point values okay so it also gives the size of this file in on kb or mb whatever it is okay so this is how you can get certain information about a data frame now you can also use this function to find the number of missing values in each column so finding the number of missing values so this particular info function gives us the number of real values or number of values that are present in it okay so as you can see here the non-null count so you can find the number of missing values using this particular function boston df dot is null dot sum this will give us the number of missing values in each column so but as we have seen this year the the no values are missing in this particular boston data frame okay so you can print this so as you can see here so the number of missing values is 0 in all the columns so which is a very good thing because in several data sets there will be a lot of missing values and mistakes should be there so in those cases we need to uh replace those missing values with some other values so we will be discussing about this in data pre-processing module but for now just understand that we can use this null function dot sum to find the number of missing values in a typical data frame okay so now what we shall do is um i want to show you some more function but i want this diabetes data frame so those functions can be explained well in this diabetes data frame so i'll just take this so diabetes df so let's print this yet again diabetes so this will print the first five rows as we know okay so now what we are going to do is we are going to count the number of specific values so in this case you can see that see that the labels are nothing but 1 and 0. so i have told you earlier that if one that represents that the person is diabetic and 0 represents that person is non-diabetic so now what we are going to do is count the values based on this particular label okay counting the values based on the labels so um i'll mention this type it is df you can use this value counts function for this so value counts and in that mention outcome so i want to count the number of zero and number of one values okay so let's run this so totally we add about 750 values so out of those 500 values has a value of 0 and 268 values has a value of 1 so it means in this entire dataset we have non-diabetic values for 500 data points and diabetic value for 268 data points so this is how you can count specific values in a column so you can just use this value counts to see what are the different values and how many values are present in this say for example if we just mention edge inside here so what happens is it will tell how many people are in the age of 50 or 31 32 etc so this is the use of this value confirmation so this is very helpful to count the values based on their labels so in this case the labels is nothing but the outcome okay so now let's see how we can group the data based on these labels so we are going to group the values based on their mean okay based on the mean so mention the data frame name diabetes df dot use the function group by okay so in this group i mentioned the column name outcome dot mean so let's see what we are getting okay so there is something wrong here okay so i just need to mention the parenthesis so now as you can see here we got the two values of zero and one right so these are the two labels we have zero and one and so here we have grouped the values based on this outcome based on their mean value so for all the labels of zero for all the non-diabetic people the average values for glucose level is one not mean so let's not uh look at this pregnancies values let's look at this glucose value the mean glucose value of all the non-diabetic people is one not nine okay so but the mean or the mean glucose value for people who are diabetic is 141 and you can also see this change or a change in this mean value for diabetic and non-diabetic people in this insulin value so this is very helpful for us to understand what are the different mean values for each of these labels so what is the mean value for people having diabetes and non-diabetes so this is how you can group the values based on their mean okay so now let's see how we can get some statistical measures about the data set statistical measures so these measures are very helpful for us to understand what are what is the mean of all the columns and what is the standard deviation of all the columns and such kind of things so for this i am going to use this boston data frame so i'll just use that here so now first we are going to see the count or number of values in each column number of values count or number of values so i will use this boston data flame in this case so boston df so dot count so when you use this count function it will tell us the number of values we have in each column so as you can see here in each column we have five not six values okay and it also gives us the data type of all the values so it is 64-bit integer in this case okay so now let's see how we can get the mean value so mean value column wise so boston df dot me so this gives us the mean value for each of this column the mean value for this crime column is 3.6 for this zone is 11.3 index is 11.11.1 etc okay so now let's find the standard deviation standard deviation this is also column wise austin df dot std so this gives the standard deviation column wise and minimum value so this particular function which is min gives us the minimum value in each column so boston df dot min so this gives us the minimum value in each of the column and similarly we can get the maximum value as well so maximum value for each column so boston df dot max so these are the maximum values for all the columns okay so this is how you can get the statistical message about the data and there is another method so instead of doing all this separately we can get all the statistical measures in one go so all the statistical measures about the data frame so for that you can use the function describe so boston df dot we are going to use the function describe okay so this gives us so the count so the number of values mean value for each of the column the standard deviation minimum value percentile values so 25th percentile 58 percentage 75 percentile and maximum values okay so this percentile means 25 percentage of the values are less than 0.08 in this crime column and 50 percentage of the values are less than 0.25 value okay so these are meant by percentage and these are different from percentages okay so this is how you can get all the statistical measures like mean standard deviation etc using this describe function okay so this is one of the main thing that we will use because it tells us an idea of what is the range and what is the mean value of this data frame so it is very useful for exploratory data analysis now let's see how we can manipulate a data set so manipulating a data frame so in this we are going to see how we can uh drop a row how we can drop a column or how we can add new column to a data frame okay so those things we will be seeing here now let's see how we can add a column so adding a column to our data frame okay so if you have seen before so we have imported this boston data frame from sky learn right so we have imported all the values so all the data but we haven't imported this target value okay so this target is nothing but the price of houses okay so the prices houses in boston in thousand dollars okay so now i'm going to add all these values in my pandas data frame so in this boston data frame so you can see this data frame here we don't have this last column which is price so i'm going to create a column named as price and i'm going to store all these values okay so you can do that by so you can see here the name of this particular target sorry the price values is target so you need to mention boston data set dot target so before we have used boston data set dot data to get all those data now i'm going to use this dot target to get all the price values so this is how you can add a column to a data frame and the main thing to note here is the number of values should be same so some boston df mentioned give a square bracket so this will create a column so i want to create a column named as price okay so this price is equal to boston data set so this is the data set that we have imported from sklearn okay so boston data set dot target okay so this is the price value okay so now let's see the end of this data frame so boston.df.eight so now you can see that there is another column called as price so this is the last column so this dataset is basically used to predict the price of the house given all these data okay so you can see the previous data frame of this boston data set so that is not this price column right so we have added a new column so the main thing to note here is the number of uh values should be same so we have totally fine or six data punch so this price column also contains five or six data points so if the number of values is uh different we cannot do that okay so this is how you can add a column and now let's see how we can remove our row and how we can remove a column so removing a particular row so mention the data frame name so boston data frame so boston df dot drop so this drop function is used to drop a particular row or column anything so let's see how we can uh drop row first so you here you need to mention the index so index is nothing but this number okay 0 1 2 3 4 it's like the serial number so you want to remove the first index which is the first row okay so you need to mention the index is zero so you can see the index here so i want to remove this first row for now and you need to mention the axis as zero okay so if you want to remove a row you need to give access is equal to zero and if you want to remove a particular column you need to give the axis as one so let's run this and see now you can see here that this zeroth row which is the first row is removed and the data frame starts from the first row which is the second row basically so this is how you can remove a particular row mentioning their index number okay so now let's see how we can drop a column so mention the data frame name which is boston df dot drop and in that mention the name of the column so let's say that we want to uh remove the column is a 10 okay so the zone value so i'll do that now by using column dot sorry boston df dot drop and in the column variable you need to mention the name of the column so let's say i want to mention this zone is a 10 so you can see the name of the column here and as i have told you earlier if you want to remove a column you need to give the axis value as 1 so if you want to remove a row you should give access value as 0 so this will remove the zone column okay once again so it should be columns so now you can see there is not this is a 10 column so we have removed this one column and now we get a data frame without this is a 10 column okay so now let me show you another thing how you can locate particular rows or locate particular columns let's say that we want to uh locate this particular second row so second this is basically the third row which is given by the index two so i'm going to show you how you can print this particular row using this index value okay so locating a row using the index value mention the data frame name boston data frame oh sorry boston df dot i look so this i lock function is used to locate a particular row or column so let's say that i want to print this third row the index of which is 2 okay so i'll run this so this gives us all the values in uh the second row so this second row or sorry the second index which contains values from point zero two seven nine seven seven point zero seven so all these values will be printed okay so now let's see how we can locate a particular column so if you note here in this data frame so we have removed the particular row okay so this won't be saved here so we have removed this 0th row but in this case it it it's coming again so if you want to remove that permanently you can just create another variable another data frame name as boston df2 okay so what happens here is so that row will be removed and we we will get a new data frame without that row so i just want to remove these rows and columns temporarily since i can do this method okay so same with the column as well so now we get the zone column again because we haven't permanently deleted it so if you want to delete it permanently you can store it in a different data frame okay so now let me show you how you can build specific columns locating a particular column so i'm going to print different columns so mention the data frame name boston df dot me look so you need to mention the square bracket here colon comma and 0 so if you do this you will get the first column so i will just make a comment about this so what is when by this particular language it prints the first column of the data frame okay so you can just do this again ok so i will just change this to this will mention minus okay so as i have told you earlier in in python indexing starts from zero so the index of this first column is zero okay where it is okay so the index of this first column prime is zero and this is a 10 is 1 2 3 and it goes on okay so i am printing this column specifically by mentioning their index number so if you want to print only the columns alone you just need to uh mention this value behind this column and comma okay so now so this particular line will give us sorry so it will give us the value of second column and this will give all the values for the third column and when you use minus one it gives all the values for the last column okay so last column values so here the last column is nothing but the price right so let's try to print this so it will give all the values so we cannot uh show all the values in this output hence it has shown this dot three dots so basically all the values are in between it so we have printed the first column which is uh crime rate so you can see the crime value starts from point zero zero six three so this is the crime rate so which is the first column so you can also see the name of this particular column so by using this index value i am printing the all the values in that particular column including the name of the column so similarly i printed all the columns and finally i have printed the last column using the index minus one here the last column is nothing but price okay so this is how you can locate a particular row or a particular column so this is the final thing which we are going to discuss here in pandas which is correlation so i'll explain you what is meant by this correlation but basically there are two types of correlation positive correlation and negative correlation so you can see the data frame here so we have totally 13 columns right so totally we have 13 columns so we call this columns as variables also so this these are nothing but 13 variables so like excluding the sorry including the price we have totally ah 14 columns so 14 columns are 14 variables so correlation is nothing but the correlation is used to find the relationship between these various uh columns let's say for example let's consider this crime rate and price so we can say that the time rate and price are negatively correlated because if the crime rate increases in a city the price of house in that particular area will definitely decrease right so one value or one variable decreases in one value when one variable increases so this is known as negative correlation okay so like this we have positive correlation so positive correlation are cases in which one value increases if the other value increases okay so let's see let's consider the number of rooms in a house if the number of hours in a room increases the price of that particular house also increases rate then in that case the number of rooms and price these two variables are positively correlated but in the case of crime and price both the variables are negatively correlated so this is nothing but the correlation so you can also find the correlation of a data frame by using this particular function so i'll mention the data frame name here so boston df dot core so this function will give us the correlation value so negative correlation value means they are negatively correlated so you can see all the columns column names here and here as well so all the values will be all the columns will be compared to other columns as well so let's consider this first row so here the crime value or the crime column is matched with this price column okay so you can see a negative value so negative value means it is negatively correlated so if one value increases if crime value increases the price value decreases by minus 0.3 so you can see this positive correlations here so we have a positive correlation of 0.36 here so for this zone column if this particular column zone column increases the price also increases so these are positively correlated so you can see this rm so this is nothing but i think it's the number of rooms so if the number of rooms increases the price also increases so it is positively correlated so this correlation value is very important for us because it tells us which columns or which features are very important for analysis okay so it tells us which columns are related to each other which columns are related to the price of that particular house so this helps us to understand more about the data understand more about that particular features okay so we can also visualize this in a heat map which will be discussing in a different video so that is all about pandas data frame okay so this is how you can create a data frame and how you can ah inspect a data frame such as uh the shape of the data frame printing the first five rows of the data frame and such kind of things and i have also explained you how you can manipulate the data in a data frame by removing a particular row or adding a particular column and such kind of things and also finally we have seen how we can find the correlation between the data frame okay so i hope you understood all the things we have covered in this video hello everyone i am siddharthan welcome to my youtube channel in this video i would like to give you a detailed tutorial on matplotlib library in python okay so as you can see here i have mentioned here that matplotlib is very useful for making plots and graphs okay so often in machine learning and data science we will deal with immense amount of data and it is not possible to derive meaning from this data by just looking at this raw data but when we plot this data in plots and graphs it gives us important insights from the data okay so this is where matplotlib library comes into picture so in this video i'll explain you detailee about various functions and plots that we can make in math broadly library okay so before getting into the video i just give you a quick introduction about my channel so this is my youtube channel and here i'm making ants on machine learning course with python so you can see the introduction video here so here i have mentioned the course curriculum so you can see about the videos which i am going to post in the future so i will be posting three videos per week two videos will be on monday evening and wednesday evening and these videos will be in the course order and friday i will be posting one machine learning project every week okay so you can download the course curriculum from here so it contains all the details and you can also join my telegram group where i will post regular updates about my videos okay so i'll give the link for my telegram group in this video so you can go to this playlist here so currently we are in the third module so this is my machine learning course so in the first module i have explained about all the machine learning basics like supervised running unsupervised learning deep learning etc so in the second module i have made videos on python basics okay so you will find videos or all the basic things you need to know in python programming okay so this is the model now we are currently working on so in this module i have already posted two videos so these two videos are numpy tutorial and pandas uh tutorial okay so and then we have machine learning project videos so i will be posting project videos every friday as i have told you before and now we are currently in this third module so subscribe to my channel and stay connected and follow this course okay so now get into this today's video so this environment is called as google collaboratory so if you are new to google collaboratory and if you haven't know like you if you don't know how to use it you can check out my uh google collaboratory basics video so it will be in this second module so as you can see here in this 2.1 i have explained how you can access google collaboratory and how you can use it okay so it basically runs python programs okay so now let's get into map broadly okay so first of all let's see how we can import matplotlib library so if you put ash in your code it means it means like you are writing some comment about your code okay so it's always uh important and you know essential to write comments about your code of what you're trying to do so it helps a third person to see your code and understand what you're trying to do okay so now we are going to import so importing matplotlib library okay so import math plot lip dot pi plot okay so pi plot has pld so pi plot is nothing but python plot so i am just you know instead of using this map lib dot pi plot i just want to use it in a short form so that's why i have imported it it has plt so it is the general convention in python so we import uh matplotlib.pipelot as plt okay so now you can run this so you can run this particular cell by pressing shift plus enter so it will run this code and it will automatically goes to the next cell okay so as i have told you earlier matplotlib is useful for making plots but we need data for plotting right so when we are working on machine learning projects we will have the data set which we will use to plot a graph so now let's take some random values so for that we need numpy library so i have already made a tutorial on numpy library so you can also check that one if you are new to numpy so import numpy to get data for our plots okay so import numpy as np so the general convention of importing number is np okay so let's import numpy snp now we are going to get some data okay so let's say that x is equal to so let's get some values for x and let's also get some values for y so let's say that x is equal to np dot lin space okay so this particular lin space live lin space function is present in this numpy library so as you can see here i have imported it as mp so i'm using this linspace function in numpy library and in this i'm going to mention 0 comma okay so what happens is it gets equally spaced values between 0 and 10 okay and how many values it takes it takes android values so we will get evenly spaced 100 values that lie between 0 and 10 okay so that's why i'm using this linspace function and so we will have 100 values in x okay so now i'll create y and now what is this y is so y is the sign okay so sine means like we have uh we have read about this in trigonometry right so np dot sign x so this will uh find so this will take x as the angle and you will find the sign value for all those angles okay so np dot sign is the function that gives us sign value of this particular angle okay so let's also take easier so is it let's check that it is cosine value of x okay so x is the values between 0 and 10 so evenly spaced under values and y is the sine value of those standard values and is it is the cost value of all those android values okay so let's run this so press shift plus enter now let's print x y and z so i'll show you how these values are so we will get evenly spaced floating point values so totally we will have 100 values here so this is the values of x okay so now let's try to print y print y so this will give the sign value of all these values okay so these are the sign values okay so you can also print z similarly so is it is nothing but the cosine values as you can see here it is np dot cos it is cosine value okay so now let's try to plot this okay so i'll make a text here so plotting the data so as you would have guessed by now that if we plot x and y we will get a sine curve and if we plot x and e z we will get a cos wave okay so here we will get a sine wave and here we will get a cost name so let's do that so we are going to build a sine wave so what you need to do is so as we have seen that we have imported this map.lab.pipeplot as plt and this is what this plt is what we are going to use to make plots so i'll mention it here plt dot figure okay so this will create a empty plot a empty figure okay so and in that figure i want to plot two values x and y okay so x are values between 0 and android and y is the sign value of all those values okay so this will plot the graph of x and y and you need to mention this function plt dot show so this will print our plot so i'll run this okay so there is some error in this so plt dot figure x comma y okay so we should not use figure so it's plt dot plot okay so plt dot plot xy plt dot show so as you can see here we have got a sine wave because x is all these values between 0 and 10 and this y is nothing but sine value of all these values so if we don't mention what kind of plot we want so it will give us a line plot so this is a line right so all the points will be plotted and it will be joined by a line so we will get a sine wave because we have y is nothing but the sine value of x now let's similarly build a cosine wave so sorry cosine wave it's the same function which is plt dot plot the only difference is that now we are going to plot x and e z because is it is the cosine value of x so again mention plt dot show okay as you can see here now we got a cosine wave okay so this is how you can get some values for x and y or z and you can plot those values using matplotlib library okay so now uh if you see this graph it is not complete right so this graph does not have any title it doesn't tells us that what is this x-axis and it does not tell what is this y-axis now we can also add x label y label under title to our graph or to our plot okay so let me explain how you can do that so adding title x-axis and y-axis labels okay so let's again plot the sine wave which is x and y so plt dot plot so mention x comma y and plt dot x label so this x label function helps us to give a x label value so i want the x label to be angle because x is nothing but the value of angles right so we are finding the sine value of all these angles so x label let it be angle and plt dot y label is the function to name our y axis so y level let it be um sine value and let's name our plot as sine wave okay so plt dot title so this title function is just to give a title to our graph so this title is sine wave okay so now it's the same plt dot show so let's run this as you can see here we have plotted this sine wave but in this case we have x label which tells us it is angle and y axis is nothing but sine value and the title of this uh plot is sine wave so you can similarly change uh the title x level and viably anything you want okay so this is how you can give names for the x-axis y-axis and title for the plot okay so now we have successfully plotted sine wave and cosine wave now let me tell you how you can you know plot a parabola okay so parabola let's take x as np dot lin space so line space is the function which gives us numbers between particular range okay so i want in this case the numbers between minus 10 and plus 10 and i want 20 values okay so this will give us 20 equally spaced values between minus 10 and plus 10 okay and let's say that y is equal to x into so this means x power 2 or x square okay so this will give us a parabola curve okay so let's plot this plot sorry plt dot plot and x comma y plt dot show okay so i'll run this as you can see here this gives us a parabola so this is how you can construct a parabolon this is the equation for it so similarly uh you can give x label and y label and title to this graph when you are practicing it okay so now let me tell you how you can so this is a line plot right so as i have told you earlier if you don't mention what kind of uh line you want it will just take the default value as line okay so now i tell you how you can plot this with dots and other symbols okay so now let's plot this same parabola so let this x and y values be the same values so you can also print the x and y values and see so now i am going to plot it with the different symbols so plt dot plot and i want to plot x and y and i want to plot this in red color and i want to use the plus symbol okay so plt dot show so as you can see here it will plot the parabola but in this case it will uh plot with the plus symbol as i have mentioned here so r represents red color okay so it uh plots with red colored plus symbol so you can also use other symbols as well so for example plt dot plot x comma y and in the codes let me put g and dot so it means it will plot with dot symbol okay and the dots will be green in color so plt dot show so as you can see here now the values are plotted in green dots so this is how you can plot uh graphs with the different symbols and colors there is also another thing so let's do another thing as well so plt dot plot x comma y comma let's see r and x so this will plot the values with exmark so plt dot so this is nothing but rx so that means x symbol and r which means red color so there are similarly various symbols and colors so you can refer matplotlib documentation just you can just search matplotlib documentation in google so it will take you their official site where you will see the explanation for all their functions and other things as well so there you can see what are the different colors and different symbols you can use for these uh particular plots okay so now i'll tell you how you can plot multiple uh graphs or multiple lines in a single graph so let me take x as np dot in space now let's take the values from minus 5 to plus 5 and in this case i want 50 values okay so i want values from 5 minus a to plus 5 and 50 equally spaced values and i'm going to plot so in the before cases so previously we have seen so we will get a value for x and we will get a value for y right so you can just do it in a simple way by this so you can see here plt dot plot let's say in the first graph i want x value and sine value of x so i can mention it here instead of just putting y and giving that mp dot sign so i can just give it here so np dot sign x this will plot x and sine of x okay and i want the graph to be g and let's put iphone here so it means it will uh this represents green color line okay a normal line okay so this is what we will get so this is a sine wave and in that graph i want to plot another thing as well so plt dot plot i want to plot x and np dot cos okay of course and in this case i want a red color dotted line so by mentioning two iphone here so it will give us a dotted line graph so plt dot show so now let's see what we are getting as you can see here we have two gaps in this particular plot so we have made a straight a solid line using this green color and vm dot limb using our uh using this uh np dot cos function okay so this is how you can plot multiple uh lines in a particular or in a single plot okay so now let me explain you another type of code so this is bar plot okay so bar plot is also very important in data science and machine learning so it will give us several insights so you can also watch my project videos where i have you know implemented these bar graphs in different projects so now we are going to see about bar plot so let us create a variable as figure so figure is equal to plt dot figure so this will create a empty plot and in that empty plot we will do several things so mention ax so ax represents axis so figure so mention this uh variable we have used which is figure so in this figure we are storing this plot okay so figure dot add axis add axis 0 comma 0 1 comma 1 so what is this is this will enclose our plot in a rectangle so this first two zeros represents the coordinates so zero and zero means switch the origin and one and one represents the height and width of the rectangle so it is generally you know it's usually to mention uh the area in which we want to have our plot okay so in this case we have uh used this plot function so we don't need to mention those axes but here we are just creating an empty figure and adding access to it so that's why we are adding this access function here so as i have told you this 0 and 0 is original one and one command represents the width and right of this plot which we want okay so i'll create another variable called as languages okay so language is equal to let's take five languages so let's say english and let the second language be fringe and let's take spanish and latin and finally german so we took five languages okay so let's say that we have a group of people and in that group uh people speak five languages okay so let's say that there are about 100 people and now uh that hundred people you know some number of people speak english french and a few people speak spanish latin and german okay so let's give some random numbers for these languages people speaking languages so people so i'll create another uh variable as list so this is basically a list so list will be enclosed in square bracket so let's create another list so let's say that there are totally 100 people who are speaking english and let's say that 50 people are speaking french and 150 people speaking spanish 40 people speaking latin and just some random numbers let's say 70 people are speaking german so as you can see here i have enclosed these values in uh quotes because you need to mention sorry you need to include the strings or the text in codes but you don't need to uh you know enclose the numbers or integers in the codes okay so we have created two list one is languages and the another one is people so we have 100 people uh speaking english 50 people speaking french and other things as well okay so now let's plot this in a bar graph and see what we can do so a x dot bar i want to plot languages and people okay so i'll give x label as let's say x label is languages so let me put all in caps and go edges and i want y label to be number of people right so it's number of people okay so now we can print this graph using plt.show okay so i'll run this so this will put all these values in background so this tells us you know which language is spoken more and which language is the second and third those kind of things okay so when you have you know 100 or more data points it is very useful to plot all those values in these kind of bar plots okay so it tells us how many uh categories are there and how many people are there in uh such categories okay so it helps us to you know visualize the magnitude of that particular value okay so for spanish it is huge of course so those kind of things so in several cases in data science and machine learning we use these bar graphs to understand about the data okay so this is known as data visualization so in data visualization we use several plots and analysis and this is one of the important plots which we will use okay so now let's discuss about another important uh plot which is a pie chart okay so pie chart is very useful to find the distribution of the data in the entire data set okay so now we are going to build a pie chart and let's create a figure as figure one so figure one is equal to plt dot figure so it is the same procedure so plt dot figure and i'll create access as so it's figure one dot add axis so it's the same axis which is zero comma zero and one comma one so it is the general uh value used you can also change it and see it so i just copy these values so okay so i'll paste this here so now let's see how we can plot a pie chart okay so you can use the function so in a x or x we have you know use this add access so on that we are going to build a pie chart okay so a x dot i so this will create a pie chart and mention what you want here so people so i want to you know built a pie chart containing the number of people okay so people and i want the labels to be languages so labels is equal to languages and there is another parameter here auto pct so this is to tell the plot that how many uh you know floating or decimal points we need so let's say that we need one point so one which it gives us a one floating point or one decimal value of after the decimal okay so this is the syntax for it okay so 1.1 plt dot show okay so let's run this okay so we have did something wrong here so ax dot people labels is equal to languages autopilot so what is missing here figure one plt add access stored by people labels languages okay so it's nothing it's just we need to include a f here so the floating point values so let's run this now so now we will get a pie chart containing various values so we have all the language names as labels so we have french english german latin spanish etc and it tells us the percentage of people in all those languages so we can see here the spanish is uh is the language which is spoken more so spanish has about 36 percentage of data okay so this is how you can build a pie chart using matplotlib okay so once again okay so now let's build a scatter plot so make a text here as scatter plot so let's take the x value as np dot in space so let's take values from 0 to 10 and let's take 30 values and let y be the sine value of x so np dot sign and x and is it is equal to np dot cos x okay now let's build a scatter plot let's declare it as figure 2 figure 2 is equal to this it's the same procedure plt dot figure okay now we need to mention the axis so ax is equal to figure 2 dot add axis and let it be 0 0 1 comma 1 okay so i just need to put a parenthesis here okay so it's the same procedure until now now we need to mention scatter plot okay so ax dot scatter so scatter is the function which helps us to build a scatter plot so in that mention x and y okay so let the color be green okay so this will uh plot a scale this will plot a scatter plot in green color okay so we are taking the two various x and y so let's also plot another uh plot with x and e z okay so a x dot scatter and i want x and e z let's put color is equal to b plt dot show so is it will be plotted in blue color and y will be plotted in green color okay so let's run this so as you can see here in a single graph it has plotted a scatter plot so this is a scatter not where the data point should be scattered so there won't be any line joining this punch so this is a scatter plots and it is very useful in clustering applications okay so this is how you can build a scatter plot so in this plot i have plotted two uh plots so all the values of uh y are plotted in green color and all the values surface that are plotted in blue color okay so this is how you can build a scatter plot now finally let's see how we can build a 3d plot okay so we are going to build 3d scatter plot so let's create the figure as figure 3 which is equal to plt dot figure okay access is equal to plt dot access and then that we need to so before we have mentioned the axis as a rectangle right so rectangle is a 2d form and now we want 3d form right so in plt dot axis we need to mention the projection is equal to 3d so this will create a 3d plot let's take some values so is it is equal to so 20 so i want a spiral shaped scatter plot so i'll just use these values so these values you have just to plot a spiral uh scatter plot okay so 20 into np dot random so i am just taking some random values so random dot random so this function in a numpy helps to you know gives us a random under random values so you need to mention the number of values you want so i want under values hence i'm uh mentioning okay so this will give us random values in z okay so and let's take x as np dot sine of z and y is value of z okay now we need to mention the plot we want so we want a scatter plot so x y is it and we also need to mention another parameter okay so let's put so you can see the description of the parameters here okay so this is the thing which we are going to use now so array or list like list of colors okay so x y z so x are nothing but the x value y value so as this is a 3d plot we are using uh you know three axis so x y and z then c i'll mention is it and c map so i want the color map to be blue so it will gives us a plot in blue shade okay so you need to mention blues and let's put plt dot show okay so let's run this so this will give us a 3d scatter plot as you can see us see here that we got a spiral kind of a scatter plot here in a 3d uh projection so this is how you can create 3d products so similarly you can create line plots in 3d and other types of graphs as well graphs as well so that that's all for this video i hope you have understood the things we have covered so i'll just give you a quick recap of what we have seen here so we have seen that we can uh import the matplotlib library as plt which is the general convention and the map library is used for making plots and graphs and we imported this numpy library for creating some data for our plots so while we are working in data science and machine learning projects we will plot the data in our data set okay so then using this linspace function we got some android values between 0 and 10 so on we have found the signed value and cost value of them and then we have plotted this sine wave on cosine wave using plt dot plot okay and then we have seen how we can give x label y level and title for our plots and then we have seen how we can build a parabola curve and then we have seen how we can plot some graphs with different symbols or different colors and we have also seen how we can you know plot multiple graphs in a single uh you know place and we have also seen about bar plots and then we have discussed how we can build a pie chart and then uh a scatter plot with multiple graphs and then we have finally discussed about the important 3d scatter plot okay so these are some important uh plots that we we have in python okay so in the next video let's discuss about seabot so c1 is also another uh library for making plots hello everyone this is siddhartha welcome to my youtube channel in this video i would like to give you a detailed hands-on tutorial on c-bond library in python okay so in the previous video we have discussed about matplotlib library in python and c-bond is another library which is very useful for data visualization purposes so in machine learning we often deal with thousands or even lacks of data and it is impossible to derive meaning or insights by just looking at the raw data set but when we plot those data in a suitable graph or plot it helps us to give some meanings that we want okay so this is where data visualization comes into play and this is where cbone is going to help us okay so this is the topic which we are going to see today and before getting started with today's video i would like to give you a quick introduction to my youtube channel so in my youtube channel i'm making ants on machine learning course with python so if you go to my channel you can see this machine learning course curriculum so in this i have explained all the modules and the videos which i will be posting in the future in my channel so you can also download the course curriculum file from here so you can also check this playlist so we have already completed two modules in our machine learning course the first module is on machine learning basics okay so the second module is on python basics and currently we are discussing the third module which is important python libraries for machine learning so in this module we have already discussed the numpy of basics pandas basics and math problem basics and this video is on c-bond okay so you can also check the machine learning project videos okay so i'll upload one machine learning project video every friday evening okay so you can uh stay subscribed and watch those videos if you want okay so the remaining modules will be uploaded every week so with that being said we can start today's video so this environment is called as google collaboratory so in google collaboratory you can run python programs so if you are new to google collaboratory you can watch my google collaboratory basics video so it will be in this uh python basics for machine learning so when you go to this uh second module so it is the first video so in this video i have explained how you can access google collaboratory and how you can you know use the different features that are available in google collaboratory okay so if you are new to google collaborate just check that video okay so with that being said let's get started with today's video so the first step is to import the libraries okay so i'll make a text here as importing the libraries okay so first let's import c bond library so c bond import c bond so i want to import this c bond in a short form so let's import as sns so this sms is the general convention that we use for importing c bond library as we use np for numpy okay so now let's also import matplotlib library so input matplotlib library as plt okay so sometimes we make this matplotlib plot and on top of that we make some c bonds plot okay so that's why we need this map problem library as well i'll also import numpy and pandas so we may need it in our code so import numpy snp and import pandas spd okay so let's run this so you can press shift press enter to run this cell and go to the next one okay so we have successfully imported our library so the main focus will be on this c bond now uh what we are going to do is import some datasets so c bond library has some toy data sets some example data sets that we can use to understand the different plots that are available in c bond now i'm going to show you how you can access this uh toy data sets so i'll just make a note here that c bond has some built-in data sets okay so these are some basic machine learning data sets okay so the first data set is a tip data set so total bill was a steep data set so this is this our data sets gives uh the total tip given by the person in a hotel and the total bill they have given and uh whether they have eaten lunch or uh dinner though so such kind of thing so it's basically tells how much tip a person will give based on their total bill and other features okay so let's just understand this data set i'll declare a variable as tips tips is equal to so mention the c bond library sms dot load data set so this load data set you have just to get this data set we want and in the bracket mention the data set we want so i in this i want this tips data set okay so if you want to know some description or detailed uh things about the functions you are using you can just search for a c bond documentation so it will take you to the official c bond documentation where you can learn about what these particular functions does okay so in google just search for c bond documentation so now we have uh we are going to import this tips data set to this tips variable so i'll run this okay so this will uh be imported in the form of a pandas data frame so in the kind of pandas data frame we can print the first five rows of the data frame by using the function it okay so tips dot yet so this will give us the first five rows of the data set so as you can see here we have totally uh seven columns so total bill tip sex the gender of the person smoker whether the person is a smoker or a non-smoker what is the day and time whether it is a dinner or lunch and the size of uh the family okay whether a two person are coming or three percent are coming so this gives the total bill and this is the tip they are paying okay so this is in dollars so what we are so the purpose of this data set is to analyze the various things such as the total bill paid by the person and whether uh the person is a male or a female and whether they are smoker or not and whether they a dinner or a lunch and what is the size so those kind of things so and using that we are going to predict what is the tip that they are going to uh give okay so there are some logical things that if a person uh you know eats for a large amount of bill they are going to pay more tips of course right so those are some basic logical things so we can visualize those things by plotting the data but and it is not easy to you know infer those things by just looking at these data sets so now what we will do is we will try to plot this data okay so we are going to visualize the data this tips data set so we need to use the function sms dot replot so this will give us two plots so i will show you what are the parts are now from the first parameter which is data we need to mention the data which we are going to plot so we are going to plot this tips data right so i have i imported this data set in the variable called scripts now we need to mention this data here okay so this is the parameter so mention it here tips okay okay next we need to mention the x axis we want okay so i want the x axis to be the total bill paid so let x axis be total bill so enclose it in uh coach so here we need to mention the column name which we want in our x axis so it is total bill so total underscore bill and in the y axis i want this tip column so mention it here y is equal to tip and i want the column to be time so basically what happens is it gives uh two plots for us so the first plot will be on uh for all the data for which the person add dinner and the other plot will be for all the person who have lunch okay so that is this call uh parameter and i want this u to be smoke okay and now we want the style the style will be smoker so these are the parameters for our plot or you know the plot will be differentiated based on these uh values so there is some mistake in the quotes here okay so style is smoker and size okay size is equal to so basically i mentioned the important columns which i need for my plot so in the x axis i want the total bill so that's what i mean mentioned here and in the y axis i want this tip and in the column i want the time whether it is dinner or lunch and in the style i want whether the person is a smoker or not so the data will be differentiated based on whether the person is a smoker or not so us also the similar kind of thing and this is the size okay so let's see what plot we are getting okay so let's run this okay so that is some mistake here so sorry so it's it should be plot so it's uh means relative plot kind of thing okay so you can see here now we get plots in which the if the person is a smoker the plot will be in uh you know blue color dots and if the person is not a smoker those data points will be plotted in orange x marks okay and this represents the size so if uh you know only one person is coming for the lunch or dinner then the size of the dot will be small and uh if there are you know five or six number of people who are you know in a family then those plots will be made on a larger uh blue dot marks okay so that is what this is and this is the advantage of c bonds over a matplotlib so in matplotlib you need to mention that you know i want this uh smokers to be you know differentiated in blue color dot marks and i want the other persons to be you know marked in x mark or i want the different sizes for different sizes okay so you need to mention that manually when you are using matplotlib library but when you are using c bonds you don't need to do that so c bonds automatically finds those differences and it will plot it based on that okay so you can do another another thing as well so you can set a default theme for your plot so setting a theme for the plots so for that we need to use this sms dot set t okay sms dot shifting so just watch the plot here so when i uh run this plot after running this sms dots at the theme will be changed so it will be more spacious and we will also get some grids so as you can see here now it is you know more spaced and it's it's more convenient to see and we have also great so this is the default theme which we have set using this uh sms dot set theme function okay so you don't need to run this every time so once you run this in your code you don't need to run it again so every plot which you are making in c bond will be based on this thing okay so with that being said now let's uh take another example uh data set and let's try to see what are the different plots we have in c bond so in this i am going to load the iris data set so the iris dataset is also another important data set in machine learning when you are starting to learn so these are some basic data sets so i'll create another variable as iris iris is equal to sms dot load dataset the same function which we have used for the tips dataset sms.load dataset and in the codes you need to mention iris okay so let me run this and let's see the first five columns of this iris data data frame so it is it okay so i'll run this okay so now you can see the first five rows here so we have sepal length sepal with better length and petal width okay so these are the parameters of the iris flower and we have this species here so there are totally three species in iris they are iris setosa iris virginica okay so the idea here is to predict whether the you know to watch pieces the particular iris flower belongs based on their supple lens sepal with petal length and petal width okay so this is the you know problem statement for this particular data set now let's try to plot this data set okay so now we are going to cluster it so we can uh best understand this data using a scatter plot so i mentioned us text here of scatter so let's see so you can make a scatter plot in sls by using this sns dot scatter plot function sms dot scatter part and in x i want let's uh you know in x axis let's take this uh sample length sepal length and in y axis let's take petal width so i tell with or let's takes better length okay so x axis let it be super length and y axis a little bit better length okay so petal in and u so you will be species so this uh plot will be differentiated based on this different species okay so you know set also will be in one color and the second species will be in one color and the third species will be in another color so that is what this u is and now you you need to mention the data which you want to plot so here the data is this iris data so mention it here iris so now let's run this now you can see the plot here so all the iris setters have been plotted in blue color and iris versicolor have been plotted in orange color and virginia have been plotted in green color so you can see the definite distinction in this data only these points have some common uh you know species so this is how you can differentiate and plot the iris flower based on their pattern length and sample length okay so you can you know this helps us to you know differentiate the data points or differentiate the flowers based on their uh you know features now you can also make another scatter scatter plot but with different parameters so before we have used this separate length and petal length right now let's use sample length and petal width let's see whether we get a better plot or not okay now i would say this this plot is kind of better than this okay because uh there is more you know common points here so this is how you can plot based on different features using a scatter plot so this is the you know importance of scatter plot where you can cluster the data points based on their parameters okay so as we have seen our scatter plot now let's see you know some other different types of plots and for this plot i'm going to use another interesting data set so this is titanic data set so loading the titanic dataset so let me explain you what is meant by this big acid after importing it so let me declare a variable as titanic which is equal to sms dot load data set and in the bracket let me mention titanic and titanic dot it okay so you can see here this is basically the whether a person is survived in the titanic crash or not so zero represents they haven't survived and one represents they have surveyed okay so this represents the class of their tickets whether they are in first class second class or third class and their gender age and such kind of things okay so the idea behind this data set is to predict whether a person has survived this titanic based on these features whether they are a male or a female what is the their ticket has and what is their age say for example uh when in such kind of crisis or a crash and those kind of things when our ship is sinking the importance will be given to you know women and children right so you know that things will be taken into account and we will try to you know implement our machine learning system and try to predict whether a person has survived or not based on these features so that is the problem statement of this data set and now we are going to make some plots in this data set and analyze it so first let's make account plot in this titanic data set okay count plot so this is this helps us to you know give the number of people or number of data points we have in each parameters so you can use account plot by this function so sns.count dot and in x-axis let me put class okay so here you can see the class third class first plus or second class and i want to use the data titanic data set okay so now let's run this okay so no attribute gate what's the thing i have done here sms dot com plot x is equal to class and data is equal to okay so it shouldn't be in this quotes because it's a variable you can see here so it should be in quotes only the strings should be in quotes okay now you can see here the number of people we have in third class is more and the number of people in first class and second class are less so this helps us to give the number of accounts or number of people we have okay so we can also check the number of data points we have using the function titanic dot shape so it gives the number of rows and columns so totally we have 891 rows that means 891 people and in that 891 people almost 500 people are in third class and you know 200 people are in first class and you know almost 180 people are in second class so that is the influence we are getting we can also another important count plots here and let's see how many people have surveyed sns dot count plot and x is equal to survived and data is equal to titanic okay so you can see the survived columns 0 uh the person has not survived and one represents that the person has survived so you can see here so the number of people didn't survive this more and the number of people survived the titanic link is kind of less okay so out of the 891 you know values almost you know 590 or 600 people uh haven't survived the sink and almost you know 300 people have survived the crash of this helps us to you know visualize the data but this is not possible to just see the data set and you know find how much percentage of people have survived it okay so this is where data visualization is very important so now let's see another important type of plot which is bar bar uh chart so so i'll make a text as bar chart so for making a bar chart you need to use the function s dot bar plot okay and here let me mention x as let's put x based on this six okay so let's plot it based on the gender so first let it be sex and in y let's put survived and let's classify it based on their class whether it's first class or second class or third class so u is equal to class and finally mention the data which is titanic okay so now let's run this so now you can see here we have a bar chart with a person as survived or not okay so this tells us the proportionality of a person is survived or not so what is the influence we are getting it is that means like the number of female survived the sink is more when compared to the number of males so it clearly tells us the importance for survival have been given to female like for example so when that thing happened so you know a lot of female have been you know chosen to survive the crash instead of males okay so that is what the influence we are getting and this uh you know red color thing represents the second casio you can see the u vr which represents the class so blue color represents the first class and the orange represents the second class and green represents the third class so this is how you know you can plot different parameters in one parameters say for example in mail we have these three parameters right so in mail we have you know the number of uh first class people number of second class and third class plan we also have the same for a female so this is how you can plot for one parameter so here the one parameter is survived and the second parameter is six and in that we have you know used another parameter which is class okay so this is how you can plot for multiple parameters and this is where bar chart is really useful okay so now i'm going to take another interesting data set for two important plots that we are going to see so this is house price dataset so house price has it so this data set is not available in c bond so i am going to import it from sk learn library so escalant is another machine learning library so we have already made a project on this house price prediction so you can check that video in my machine learning project playlist so now i'm going to import this library so on the in that library i'm going to import this house price data okay so this basically is the boston house price data so from sklearn dot data sets import boston okay load boston so this is the uh host prices of boston state in us so i'll create a variable as house boston so house boston is equal to load boston so this will load this boston data set into this house boston variable and it won't be in a pandas data frame so we want the data to be in a pandas data frame now i'm going to store it in a pandas data frame so house is equal to pd dot data frame so we have already imported the pandas data frame as pd so you can see in the first cell here so we have imported pandas spd so that is what we are using here so house is equal to pd dot data frame and inside that we need to mention this host boston so house boston dot data and i want the columns is equal to feature names so house data set dot feature names so i'll explain you in a minute what are these parameters are so let me complete it first so feature names i'll create the price column price is equal to house data set dot target okay so let's run this whose data set is not different so house boston.data house dataset is not different where i am mentioning this house dataset okay so instead of this house dataset i need to put house boston because i have mentioned the peter's house boston right so house boston dot target oh again so pd dot data frame all columns okay okay so here the problem is okay so now let's print the first five rows of this house data frame so i have you know imported this data frame in this variable called as house so house dot yet so this will give us the first five rows of the data set now you can see here we have different parameters such as the crime rate in that particular boston state or that particular place we have the zone indus column and such kind of thing so this rm column represents the number of rooms and the age of the average number of the people deciding there and such kind of things and finally we have this price column okay so the problem statement of this data set is that we need to predict the price of a house depending on several other parameters such as crime rate number of uh rooms their zone etc okay so that is what this data set is so what i have done here is so this uh you know this load boston data set will be in a numpy array format okay so i want to load this numpy array format in a data frame so that's why i've used this data frame function to convert this numpy uh array into a data frame okay so in that numpy array all these data all these numerical values will be in this uh particular dictionary called as data and you know in this feature name contains all the column names so what i've did is so i have imported all the data first and i have included the column names second using this feature names okay so and then i have uh you know imported this price so you know in the let me just make a code here and show you how the data set is so you can just mention this house boston so it will be more easier for you to understand so house boston print so this is an sk learn uh sorry this is a numpy format so you can see here so here is an array called as data okay so this particular data array contains all the values so that is what i have did here to import all these data and you can see the feature names here so it consists of this feature names and this feature name contains all the column values okay so that i have imported using this uh columns parameter and this data doesn't contains the price of the house so you can see the data here but this price of the houses will be in a separate array called as target so that's what i have did there so i have created a column called as price and i have imported all the price values in that particular column so that is what i have done here and now we have seen the first five rows of this data set so we have different parameters and last we have this price column okay so now we are going to see two important plots in uh c bonds using this data set so we are going to see a distribution plot and a correlation matrix okay so first let's see a distribution part so distribution right so this distribution plot will you know give us a plot of you know what is the range of the values so in what values does most of the data points like say for example uh let's create a distribution plot as sns dot dist plot yes for this plot so in that let me mention the data frame which is house house and i want to find the distribution of the price okay okay so let's run this so you can see here that the more values are in the range of you know 20 dollars so this is what we are getting and the values so 20 means it's 20 000 so there are less values in this particular range and there are less values in this particular range and we have the more values in this particular range so this helps us to give us the distribution of the values so this is one of the important plots we need because it tells us what is the distribution of the price so what is the magnitude of the price and what uh range as the maximum values so this particular you know maybe it's 15 to 25 so this particular value has the most number of data points so that is the important of this distribution plot and [Music] so you can use this disk plot sorry so it will give us a probability curve so you can see the maximum values are in this you know 20 range okay so this is the importance of this distribution blocks which gives us the range in which the data is distributed so now let's see another important process so this is very very important so this is a correlation matrix okay correlation and there are two types of correlation they are positive correlation and negative correlation okay so positive correlation means so when we say the two variables let's say that uh you know this is the rm room says and uh this is the price we know that if there are more number of rooms in house the price will be definitely more so that means they are positively correlated because so these two are variables right room numbers and this price so if two variables are positively correlated then it means if one value increases the second value also increases okay so when we say let's see when we can say that two values are negatively correlated so let's consider this crime rate value on price so if the crime rate is more the price will reduce right so this means so if one value increases then the other value decreases so this is where we say that it is negatively correlated okay so now let's see how we can you know plot this correlation value using i eat map so for this we are going to use a plot called as eat map okay so constructing the heat map i'm using this uh matplotlib library okay so okay before we need to find the correlation using this so i'll create a variable as correlation which is equal to so mention the data frame which is house house dot co rr so this will give us the correlation value so let me run this now we can plot our heat map so if you remember we have imported the map.lib library as plt so you can see here so now we are going to create a basic matplotlib figure and on top of that we will be creating a c bonds heat map so i'll explain you what is a heat map in a minute so plt figure now we need to mention the figure size so the figure size let it be 10 comma 10 so you can give any size v you want so 10 comma 10 is convenient in this case so sms dot each map is the function which will just give us give this so mention the correlation so we want to plot this correlation value okay so as i have told you earlier so what is meant by this correlations or heat map mention correlation and now we need to mention some parameters c bar is equal to true so first i'll just fill all these parameters and explain you what is meant by these parameters so square is equal to true and from d is equal to dot 1f and annotations is equal to true and then and not kws is equal to true okay so we need to mention the size so eight and c map is equal to blues okay so now let's run this okay so invalid syntax so a node is equal to true okay so here should be comma okay so we got a correlation matrix and this is a heat map so let me explain you what is meant by this so we have all the columns in the left hand state so you can see the column name here so we have all the columns here vertically and we also have all the columns here horizontally okay so this basically plot all the values of one column against all the values in one column so if you take this first row it is the crime value against all the other columns okay so similarly it is carried out for all the columns so you can see the color map here so this color bar here so if the color is light it means the value is in a negative value if the color is dark it means you know it is positively correlated okay so what we are going to say is so if you take this crime column as i have told you earlier crime and price will be negatively correlated because if the crime rate increases the price will decrease so you can see the value as minus 0.4 so it is negatively correlated but if you take this rm which is the number of rooms so it is positively correlated because it's 0.7 okay so because if the number of rooms is more than the price value increases so this is called as positive correlation so the you know these values are calculated using this house dot core function and this is how you need to construct uh you know eat map to find which columns are positively correlated to this target which is price and which columns are negatively correlated okay so this correlation is what we are plotting it so that's why we have mentioned it here in this heat map and color bar so this is the color bar whether we want this colored bar or not so this is the square i want all this to be in square and the fmd represents the number of floating point so i have mentioned one floating point after decimal so it gives only one decimal point so if you put 2f it will give us two decimal points our notation so these are the annotations so annotations is equal to true this is the annotation size and they want the column to be blue so this is how you can create a correlation matrix and correlation matrix is very important because it tells us which columns are you know important for our prediction which columns are not important so that's it for this video i hope you have understood all the different blocks hello everyone this is siddharthan welcome to my youtube channel and in this channel i'm making a hands-on machine learning course with python so we have already completed three modules and this is the fourth module and this module is on data collection and pre-processing okay so in this video i'm going to explain you about where you can get the data for your machine learning projects and how you can get this data okay so this is what we are going to cover in this video so i just give you the agenda of today's video so first of all we will understand what is the importance of data in machine learning so why we need data for our machine learning projects and such kind of things so once we understand clearly what is uh the role of data in machine learning so i will explain you where you can get this data okay so there are some important sites and you know websites where you can get this data which you need so i will be explaining you and giving you informations about those websites and then i will be giving you a demonstration of how you can collect this data okay so this will be a hands-on part okay so i will be downloading downloading this data and showing you okay so this is what we are going to cover in today's video so before getting started i would like to give you a quick introduction to my youtube channel so as you can see here this is my channel and when you go to my channel you will see this uh course curriculum video so in this video i have explained all the modules and videos which i am going to make in this channel so you can also download the course curriculum file from here okay so you can hear towards this playlist so here you can see the modules which we have completed so the first module was on machine learning basics okay so all the basic things you need to know about machine learning and the second module will be on the python basics for machine learning okay so all the python concepts are covered here and in the third module we have seen about some important machine learning libraries in python such as numpy pandas matplotlib c bonds etc okay so you can also find this machine learning project videos so we have about eight projects videos as of now okay so we will make more project videos in the future so i will be uploading three videos per week so two videos will be on wednesday evening and monday evening and one video will be on friday evening okay so three videos per week so with that being said let's get into today's video so first is to understand the importance of data in machine learning okay so let's consider this example so we have the images of cat and dog okay and we want uh you know a system to see this image and recognize whether that image represents a dog or a cat okay so this is the problem statement we have so what we can do is we can feed images of dog and cat to our machine learning model okay so this can be any model so this can be a neural network or it can be a support vector machine model or anything okay so a machine learning model so what happens is so this model will find patterns in those images okay so let's say for example so if the eyes are small so it may represent a cat and if the size of the cat is you know if the size of the animal in the image is small it may represent scat right so all those kind of features will be recognized by our model so all those patterns will be recognized by our machine learning model and it will uh get the ielt power model to predict whether that image represents dog or cat so this is how we will use data in machine learning okay so in this we are not just talking about you know tens or twenty uh number of images okay so we often deal with thousands or even lacks of data so we we may deal with lacks of data of dog and you know another lack of data for cat so that is the magnitude of data we may need okay so this is a very simple example of where you can use machine learning so there are also more advanced uh applications like you know health care where you can detect whether a person has a cancer or not whether a person has diabetes or not by you know going through the their medical records their scan images and such kind of things so this is where data is very helpful for machine learning so as we have understood the importance of data for machine learning so now let's see where we can collect this data okay so the most important website where you can collect the data is kaggle okay so kaggle also host competitions for data science and machine learning so you can participate in those competitions so if you you know submit uh your code so that code will be evaluated for that competition and they also have these uh you know price monies so and kaggle is very important site where you can collect this data okay so second site where you can get data is you see a machine learning repository okay so in uc machine learning repository also you can collect the data you want so these two websites have so many number of data sets which you can use okay so this is the logo of uc machine learning repository and there is another thing called as google data data set search so this is very similar to google search but it is exclusively for searching data sets okay so i will be giving you a demonstration of how to use these kind of data okay so in these uh you know three sites you can find pre-made data okay so you can also make data set for yourself say for example uh if you want to make a face recognition system which are you know detects your face so in that case you will take you know several hundreds of your images and train your machine learning model so that's how you can create your own um you know data for your machine learning project so in this video we will uh be seeing some demonstration on how you can get data from kaggle you see a machine learning repository and google data search okay so now let's get into it okay so do subscribe and you know stay tuned for more videos so you can search as google data set search in google so this is the site or data set search dot research.google.com so you can go to the site so i'm going to get one of the basic data switch and one of the basic projects which we will deal in machine learning so it is boston house price data so it's just a demonstration so boston house price okay so inserts this okay so as you can see here it will give us uh you know the sides which contain this boston house price data so these are the sites which as this data set okay so first we have this kaggle site right so you can give this explorer at kaggle to go to kaggle and from here you can download your data set okay so as you can see here this is one competition so you need to sign up for kaggle so once you sign up so you need to accept the competition rule so so once you accept that you you can see this data here so data task code etc so you can go to this data and you can find the explanation about this data so what are the various columns in the data set and how many uh you know rows and columns are there in a data set so this gives the description of the data set which we are going to download okay so these are all the column names which we have in this data okay so totally we have 11 columns so on we have you know a sample here okay so you can go ahead and download this data from here so you can see this download option from here also you can download it you know from here so i'll just choose this and download it from here so it is a csv file which represents comma separated value okay so there can be you know many types of uh datasets okay so this data set will be downloaded so this is a very small data set okay so this is how you can uh search data set from uh google data set search and uh you know we have seen how it redirected us to this kaggle site okay so i just opened this folder okay so i'll just put it in my desktop so i'll also show you how you can you know upload this data set to our google collaboratory and do some processing okay so this is how you can get a data set from data set search and kaggle okay so you can also search for kaggle okay so search for kaggle in google so this will take you to site so there you can see their data set and competition which they are hosting okay so if you are you know a well known good knowledge on machine learning you can also participate in competition so you can go to this complete option here so this will give us various competitions in which you can participate okay so you can check out the competitions and the data set it does okay so this is about kaggle and next is uc uci machine learning repository okay so here i have searched uca machine learning repository iris dataset okay so you can see this first site here so i'll go here so this iris classification is another important and basic machine learning project so basically it contains data set with three species of iris the three species of are you know you can see the description here also so they are acetosa virgin econ diversity color so there will be these three species so you can see the species names here so you have the supple length of each species supple with petal length and petal width okay so based on these four parameters we need to predict whether uh the iris belongs to which species okay so there are three species so to download this data you need to go to this data folder okay so just search the name of the data set you want and put you say machine learning repository in google so it will take you to the website so you need to go to this data folder so as you you can see some files here this iris dot data is the data set which we want okay okay so this will be downloaded so i have already downloaded this uh you know data set okay so you can see see this here so we have this iris dot data so we need to change this extension to dot csv okay so i'll just rename this to csv so csv represents comma separate value so let's open this in a notepad to look how this dataset is so as you can see here all the values are separated by comma and we have this uh name of this uh species okay so we have this set of sas pieces and their parameters the sample length and width and petal length and width we have the second species versicolor and the third species virginica okay so this is how you can download a dataset from usa machine learning repository okay so we have seen about how to search data sets in google data set search and how to download the data set from kaggle and we have also seen how to get data set from usa machine learning repository now i'm going to show you how we can you know upload this data set to our google collaboratory environment and how you can use it okay so this this environment is called as google collaboratory so if you are new to google collaboratory you can check out my video on google collaboratory basics okay so that video will be on the python basics module which i have shown you before so from that you can watch that video and you can see this files options here so you need to give connect here so this will connect us to google backend server on which we can run our python programs okay so you can see the files options here so go to this files option and in this you can give this upload option so upload to session storage or just you can press right click okay so on then you can give this upload option so i'll upload our house press data set so my house please data set is in desktop so housing.csv okay so this is a csv file as we know and i'm going to show you how you can import the csv file so i need pandas library for importing this csv file so input pandas as pd so pandas has a function called as read csv and this function will help us to uh you know read a csv file and store it in a pandas data frame okay so i'll make a comment here it's loading a data set to a pandas data frame okay so i'll just name the dataset as dataset so dataset is equal to pd.csv and in that you need to put codes and in the codes you need to mention the path of this file okay so you can go to this option uh here and there you can copy the path of the file so once you copy that just paste it here so now you can run this so you can press shift press enter to run this cell okay so this will load our csv fill to a data frame so you can use this dot yet function okay so data set 8 function to print the first five rows of our dataset okay so as you can see here we have you know several rows in our dataset so this is how you can load our data from a csv file to a dataset so so that's it for this video i hope you have understood about uh you know how to get data from uh kaggle or usa machine learning repository and load it to our pandas data frame okay so you can also check out my machine learning projects videos in which i have explained you how you can predict the price of houses based on several parameters okay so do check out that video as well subscribe and stay tuned thank you so much hello everyone this is darthan in this video i am going to explain you how you can import data sets directly from kaggle to your google collaboratory and for this we will be using kaggle api and the full form for api is application programming interface so apis are nothing but software intermediaries that allows two software or two application to talk to each other to carry out some function so in this case the function is nothing but to share the data from kaggle to our google collaboratory okay so what is the use of this api why we need to use this api to get the data set so let's try to you know answer this question so in machine learning we may deal with a very huge data set so those data sets can be you know 5 gb or 10 gb or even hundreds of gb so we cannot download that data set and again upload it to our google collaboratory environment and in those cases apis are really helpful because it helps us to get this you know these data sets quickly and use it in our collaboratory environment okay so that is what we are going to see in this video and before uh you know going into the video i would like to give you a quick introduction about my youtube channel so in my channel i am making ants on machine learning course with python so you can see the course curriculum video here once you go to my channel and you can also channel downl you can also get towards this playlist page which contains all the modules that i have covered so the first module we have seen is the machine learning basics and in the second module we have seen all the python basic circuit for machine learning and in the third module we have discussed about some important machine learning libraries such as numpy pandas smart plot lib c bonds etcetera okay so currently we are in the fourth module which is data collection and preprocessing and as you can see here i have given the number as 4.2 okay so we also have about eight machine learning projects and we will work on more projects in the future okay so this is about my channel and with that being said let's continue with today's video and this environment is called as google collaboratory okay so google collaboratory helps us student python programs so if you are new to this google collaboratory or if you are not aware of this google collaborate so you can go towards this module to this playlist in my channel and in the first video you can see those google collaboratory for python in this video i have explained how you can access google collaboratory and uh how you can access several features in google collaboratory okay so now the first step to import the data set is to install the gaggle library okay so just one second so i'll just make a comment here installing the kaggle library so for this we need to do pip install and just precede the code with an exclamatory mark so this is a system command and all the system commands should be proceeded with an exclamatory mark in google collaboratory so let's put pip install kaggle okay so the k is a short you know in small letter and it's not a caps so you need to make note of it so let's run this pip install kaggle so you can press shift press enter to run the cell and go to the next one so which is requirement already satisfied that means the collaboratory environment already have this kaggle library installed now we need to upload our kaggle.json file okay so i'll just make a text here upload your kaggle dot json file so this contains our account details this json file and it helps us to give the authorization from this google collaboratory tree to our kaggle account okay so and that is one one another important thing to note here so the account the email id you are using for your google collaboratory should be the same in which you are signed up with kaggle okay then only the json file will work and the api will load okay so you can just go to this google uh google and search for kaggle.com so this is the first page which we will uh encounter and i am going to show you how you can import a big data set for this i am going to search earthquake prediction so this is a very interesting uh project in machine learning okay so earthquake prediction so search for earthquake prediction and go to this competition options so you will get this la in l earthquake prediction and this is the data which we are going to get okay so when you first open this so it won't uh tell that it is late submission so here will be an option called as joint competition so before downloading the data you need to accept some rules and conditions and then you need to join the competition then only you can get the data set okay so you should also sign up for kaggle so i hope you are clear with that and once you sign up you need to join this competition so once you join this competition it will ask for the submission button now what we need to do is we need to go to this account so you can just click here so it will give various options and you need to go to this account option so here we will download our api token so if you scroll down you can see this api option so now we need to create a new api token so this will download a kaggle.json file and we need to upload this file to our google collaboratory okay so now so i'll also explain you about the data set which we are going to uh import so uh and the order in which i'm doing is very important so first you need to accept the competition rules you need to join the competition and then only you need to download the json file so if you do it the other way around it won't work so just do it in the same order i'm using so you can also download other data sets as well so our json file has downloaded so now we can upload it so for uploading it you need to go to this files option and you can give this upload options or you can just right click and go to upload okay so let's upload this kaggle.json file okay so this will be uploaded now we need to configure the path of this json file to read it so i'll just give the code you know this link for this google collaboratory in the description of this video so you can open it from there just copy the code snippet and run so configuring the path of taggle.json file okay so you need to precede it with exclamatory mark because these are system commands mkdir so this represents make directory okay mkdir means so make directory so forward slash dot kaggle so we are creating a folder called as kaggle and cp gaggle.json legal.json and forward slash kaggle chmod 600 toggle and kaggle.json so this is basically configuring the path so we need to locate our json file so it this is the code snippet for it so we are just making a directory called as kaggle and we are locating this uh kaggle.json file okay so we can run this now i just check it once so make directory p candle okay so cancel.json okay so let's run this just press shift press enter okay so now we can import the dataset so i'll just make a text here so we are going to import so importing the so we are going to import it through our api so api to fetch the data set from kaggle okay so just change this now go to uh this earthquake prediction page and go to this data okay so here you will find the api command so you can see the line here so this is the ap command so this is use the apa kaggle ap to download the data set so we need to copy this api so just go to this option so this will copy your api command so i'll copy this and we need to paste it here but we need to proceed it with exclamatory mark okay so now let's run this so this will get all the data set from it so you can see the speed in which it's downloading the content it's almost 100 mb per second so and you don't need to worry and your mobile data or your wi-fi data won't be charged for this so it's just happening online so it will happen real quickly okay so it will just take one or two minutes or even less than that to get the data set okay so you can also see the timer here so it's in seconds so about 45 50 is just done so basically it contains all these uh you know files so we have this test folder these are the test data so once you you have made the model you need to test it on those data okay so this is basically a kaggle competition in which people submit their competition codes and their codes will be evaluated okay so our data set is downloaded so you can see the files here okay so these are the test data points and the important data data set we need is this train dot csp okay so this is a csv file and this is a very huge file so you can see here it is about 2 gb and it contains you know even lacks of data points so you can just go here and see the strain dot csv so it gives the sample of this uh data okay so we have uh two columns acoustic data and time to failure i'll just explain you in a minute about these columns okay so this will download our data set so the important thing to note here is this is a zip file so this is a compressed file right so we need to extract this file to train dot csv okay so let's see how we can do it so now we are going to extract the compressed file so i'll make a comment here extracting the compressed dataset okay so we are going to use the library zip file so from zip file import the zip file function zip file and i'll store this data set path so data set is equal to so i'm creating a variable and i'm going to store my data set path to this variable so i'll just copy this train dot csv dot zip copy path and let's paste it here so data set and put it in codes now we can extract it with so this with function or with keyword is used to open a file so with zip file so we are calling this function with zip file mention the path which is stored in this dataset variable so we need to mention dataset and we need to read the file so r represents it to read so i read as zip so now we need to use the function zip dot extract on so this extract all function will uh you know extract that compressed file and once we have completed it let's print that the data set is extracted okay so so we can run this and this will take maybe two or three minutes so let's run this and wait and in the meantime and i'll explain you about this data set okay so in this uh project uh earthquakes are created in a laboratory scale so you can see the overview of this so this is basically to forecast earthquakes and we have the data set here which contains acoustic data or it's uh you know you can see here so it's acoustic data and time to failure i'll just go to this train dot csv so you can also expand it and see what is mean by those columns so acoustic data represents the sesame signal so those signal represents the strength of an earthquake and time to failure so the time in seconds until the next laboratory earthquake is carried out so you can see that so if this is the acoustic data this is thus you know time in seconds that will you know take to uh next earthquake to occur so basically we will train our machine learning model or neural network with this data so when you give this acoustic data it should give the time to failure so how much time it requires for the next dead crack to occur so this is about this data so this is a very interesting and you know very complex uh you know machine learning project so you can also work on this so this is very interesting so you can go to this code section and you can see some samples of the codes that has been uh submitted for competition so i suggest you to practice this code and see how all the things work here okay so this is how so you can see the data set size yet it's almost 10 gb and we have imported imported this data set in no time so it you know almost took 31 seconds so this is how you can extract large data sets through kaggle api okay so this is still you know extracting so once this file is extracted so you know this line will be printed and after that you can just import it to this you know pandas data frame and do some processing and feed it to our machine learning algorithm and do some you know predictions so i hope you have understood how we can import uh data sets from kaggle so this will be very helpful in your machine learning journey so that's it for this video so do practice this code and let me know if you run into some error okay thanks for watching hello everyone this is siddharthan welcome to my youtube channel and in this video i'm going to explain you how to handle missing values in a data set when it comes to machine learning and data science okay so there are mainly two methods to handle missing values so those methods are nothing but imputation and dropping okay so these are the things which we are going to cover in today's video okay so we will be doing this python code in google collaboratory so google collaborator is an environment in which you can run python programs okay so we will be taking a sample dataset for our processing so i am taking a placement data set so which contains several uh parameters like what is the ir secondary the person has studied and what is the percentage they scored in io secondary and you know what field they have studied in their undergraduate and such kind of things and based on all those features uh we need to predict what is the salary they may get okay so this is about the this placement data set okay so this data set contains some missing values and i'm going to explain you how you can handle those missing values so before starting with today's video i would like to give you a quick introduction about my channel so in youtube i am making a ants on machine learning course with python so you can see this course curriculum video in this i have explained about all the modules on the videos which we are going to discuss in my channel and you can also download the course curriculum file from here so it is given in the description of this curriculum video okay so you can also get towards this playlist page so in this machine learning course i have already completed three modules the first module was on machine learning basics the second module was on python basics for machine learning and the third modules is on some important libraries which we need for machine learning okay so in this module i have explained about the modules uh the more libraries such as numpy pandas matplotlib c bonds etc okay so i've also made some machine learning project videos so do check out this videos okay so currently we are in the fourth module about handling missing values okay so this fourth module is on data collection and processing and this is the third video in this fourth module okay so if you are new to google collaboratory you can just go to this second module which is python basics for machine learning and here in this 2.1 video i have explained about google collaboratory basics on how to how you can get access to google collaboratory and what are the various features in google collaboratory okay so with that being said let's start with today's videos so as i have told you earlier there are two methods to handle missing values imputation and dropping so where does this missing values come okay so i hope you know that in machine learning or data science we use data sets so and this data set is used to train our machine learning model and once our machine learning model is trained with this data set it can make new predictions and that data set may contain a lot of missing values and we cannot feed this data set with missing values to our machine learning model so we need to replace all those machine learning you know all those missing values okay so that is what we are going to see in this videos okay so this is the sample data set we have so i'll give the link for this data set in the description of my video and so let's uh get started i'll just make a text here as importing the library so we may need some basic libraries for this so importing the libraries so let's import pandas as pd so pandas library is used for you know making pandas data frame so pandas data frame are nothing but structured table okay so we also need matplotlib matplotlib.pipe. so these are just the general convention the short form for uh importing the libraries so we will just plot the values and see which method we can use so that's the reason for importing this map problem it's also import c bond c bond is also another data visualization library which is used to make some plots and graphs okay so let's import c bond as sns so i'll run this you can press shift plus enter to run this cell and go to the next one okay so now we have this csv file we need to load the csv file to a pandas data frame so just go here so you can see this options button here so go there and you can copy the path from there okay so once you have downloaded the data set you can just uh when you are in this google collaboratory you need to connect your system from here and after that just go to this files option and you can upload the data set from here so you can give this upload option or just right click so you will find this upload option and then upload this placement dataset which you will you know find the link in the video description so i have copied the path of this file now let's load it to our pandas data frame so loading the data set to a pandas data frame okay so i'll create a variable called as dataset and we are going to use the function pd.readcsv so as you know i have imported pandas pd so dataset is equal to pd dot read csv so this v8 csv will read the csv file and load the content into our data frame so we have already copied the path so i'll paste the past you know the path here inside the codes so let's run this okay so this will create a data frame you can see the first five rows of this data frame using the yet function so mention the data frame name which is in this case it's data set because we are loading the data frame in this you know variable so data set dot eight so this will give us the first five rows so as you can see here we have several columns so first column is on the serial number then it's the gender whether the person is male or female then this this represents the secondary school percentage and this is the secondary school board whether they have studied in central board or state board such kind of things then we have the ir secondary percentage board and what is the you know stream they have studied whether it's commerce science etc so we have then we have this degree and such kind of things and we have the work experience whether the person has work experience or not so such kind of things and finally we have this salary and we also have whether the person is placed or not and also we have salary okay so the idea behind this data set is to you know using all these features you need to predict the salary of the person but that is not what we are interested so we are interested in you know finding a good method to replace the missing value so you can see this thing here so it represents n a m so n a n represents not a number okay so these are missing values and we need to replace this missing values before feeding to our machine learning model okay so now let's see how many rows and columns are there in our dataset so basically we are going to see how many data points we have so you can check that by mentioning the data frame name which is dataset dot shape so this gives us the number of rows and columns okay so totally we have 215 rows and 15 columns now let's see how many missing values are there in you know each column so mention the data frame name which is dataset dot is null so this is null function gives us the number of missing values and we are going to find it for the all columns so as you can see here there are no missing values in the serial number column general column so we have only missing values in this salary column okay so there are about 67 values missing in this 215 values okay so there are totally 215 rows and in those 215 rows about 67 uh you know salary values are missing okay so there are two methods as i have told you earlier one is imputation and the second method is dropping dropping is nothing but just dropping or deleting all the rows which has missing values okay but this is not a efficient way to do this you know when you have a very large dataset say you have uh you know 20 000 data points 30 000 data points or you know lacks of data points in those cases you can just drop the missing values so it won't be a very big factor but when you have a very small data set like 200 300 or within thousand dropping the missing values is not an ideal method so that's where we will use imputation okay so imputation is nothing but using some proper statistical values and replacing these uh missing values with those statistical values so those statistical measures are nothing but mean median or mode okay so these three terms mean median mode are also called as central tendencies so central tendencies okay so first one is median sorry the first one is mean so i'll explain you about what is meant by this mean median mode as well so second is median third is mode okay so i hope you know what this will be mean mean is nothing but it's the average of all the values so let's say that we have a data set so that data set contains the values one two three four and five so mean is nothing but the average of all the values so we just count all the values so it's nothing but three plus three six and six plus four ten fifteen so fifteen by five so the mean of this particular data set is five right so sorry the here the mean is nothing but three so mean is nothing but the average value now let's see what is meant by a median so when you have a data set what we will do is we will arrange the values in ascending orders here the values are already in ascending order then we will take the middle value and this middle value represents the median of that dataset okay so it's nothing but the middle values once you arrange it in the ascending order so when you have even number of uh data points let's say we have six values right so here the middle we have two middle values three and four right so in this case what we will do is we will find the mean or average of these two values so here the average is nothing but three point five and we will take the as median and mode is nothing but it's the number which has the most frequency let's say the data set as one two three three and five six and three so this is the data set we have here the number three is repeated three times okay so here mode is nothing but the number which is repeated for most number of times okay so here the mean is nothing but the average of all the values and median is nothing but the central value and mode is nothing but the number which is you know repeated for several times so these are called as important central tendencies in statistics and it's often a good method or a good way to you know instead of dropping the rows which i have missing values but you know it's not a very good method to drop the rows but instead we can impute the values the missing values with you know any of these central tendencies okay so but there are actually uh you know cases where we need to use mean where we need to use median and where we can use mode okay so let's try to understand when we can use these values by first plotting them okay so we cannot just use mean in all the missing values in all the cases in all the datasets so i'll explain you when you need to use mean and when you need to use other things such as median or mode so for that we need to plot this uh salary column so we need to find how this value is distributed okay so now we are going to analyze the distribution of data in a so in the salary column okay so we just want that to one column because that column alone contains missing values okay so i'll create two variables figure and access so this is how we need to create uh plots in matplotlib and c bond so plt so plt is nothing but matplotlib library so you know i have imported matplotlib.pipe.splt so we use this plt function to make some plot so we are going to use subplots function okay and here we need to mention the figure size we want okay so i'll mention the figure size to be maybe a comma eight so you can give any values you want it is just the dimension of the plot so now we are going to use the c bond library so we have imported c bond library as sns and now we are going to create a distribution plot okay so this plot which represents distribution plot and you need to mention the data which you which are going to plot so we are going to plot this salary column right so in this data set uh you know we have we have named this data frame as data set and we are going to print you know plot the salary column so we need to mention it here in the parenthesis so sns.displot and in the parenthesis mentioned dataset.salary okay so now let's run this and see what is the distribution of the values as you can see here so this is the salary value so 2 into 10 power 6 which it basically represents you know uh 2 lakhs this represents 4 lakhs and such kind of things so you can see here that the value is more around this you know 2.5 right but there are you know one or two values uh you know more than uh six lakhs we have one value on around uh 6.5 x and one value around 9 lakhs so the data is very dis you know highly distributed in this particular area but we have some outliers these are some outlets so when you have these kind of outliers and we have the data distributed in only one side this kind of curve is known as skew okay so this distribution is known as skew and in these cases we cannot use mean values for replacing the missing values so we cannot use mean in this cases because when you have these outliers this will increase your mean value so i will just explain you with an example so in a college you know there are 10 people placed and out of the 10 people eight of the people have got placed with an average salary of about uh you know three lakhs per annum and two people have placed you know placed about 10 lakh per annum when you get the mean of the entire salary it won't be a very good data set okay because like it will affect the overall mean because we have two outliers right so in these cases we cannot use the mean so when we have this kind of skew distribution in those cases we will use either median or mode as our you know replacement for the missing values okay but when you have one you know almost normally distributed values where you have uh values distributed in all the magnitude in such cases we can take mean as our replacement for our missing value so in this case as the distribution is more on one side as we get a skew curve we are going to replace the missing values with median or mode okay so now let me tell you how you can replace the missing values with median so we are going to i'll make a text here replace the missing values with median value so we are just taking the salary column alone it's median value so median value is nothing but so once we arrange it on the ascending order it's the middle value okay so mention the data frame so here the data frame name is nothing but data set mention the column so here the column is salary okay so this all salary column dot field linear so this fill in a function will you know fill all the missing values with so i just mentioned so fill in a data set and in this data set in the salary column we need to find the median okay so in this entire column all the values will be arranged in the ascending order and the median value will be picked so we need to mention this salary column and we need to find the median of this salary column because we cannot find the median of other columns and put it here we just need to find the median of this particular salary column okay so that's why i have mentioned it here so fill in a in dot dataset.salary and in that i am getting the median value in place is equal to true okay so in place is equal to true so when you run this all the missing values will be replaced by this median value okay so let's run this switch data set salary dot fill in a dataset salary median okay so this is successfully run so now let's check the number of missing values so we have already checked the number of missing values and it showed that there are about 67 missing values in this particular data set so i just copy this as we have replaced the missing values here now let's check whether there are any missing values great so you can see here now we don't have any missing values the number of missing values in each column is zero so this is how you can find the median of our data set up you know median of a particular column and replace the missing values with that particular median value okay so this is the line of code for that so in some cases when the distribution is normal when you know the values are distributed correctly so in that cases we will replace it with mean so i'll also explain you how you can do that so this is filling missing values with mean value so previously we have seen how to fill it with median value this is about filling with mean value okay so i'll just copy this so it's it's actually the same thing the only change which you need to do is change this median to mode sorry mean so median to mean okay because we are filling it with mean value right so when you run this your data set will be replaced this particular uh missing values in this hillary column will be replaced with mean value okay so i just uh preceded it with ash no it's i'll just make it a comment okay because i don't want to fill with mean value so this is just to tell you how to fill the values with mean so when you replace median with mean so it will replace the missing list with mean value so if you want to do this you can just remove this space and dash to run this particular line so i'll just comment it okay so this is how you can fill mean values in the missing items so i'll also show you how to do the mode so it is also the same thing you just need to replace this mean with mode okay so it's very simple so this is about imputation so this is about how you can impute the missing values with the statistical central tendencies like mean median or mode now let's see how we can drop the values so but it is not you know highly appreciated or really supported to drop the values because the data set which we have is you know only 215 and when we drop the rows which has missing values 67 draws will be deleted so this is not you know supported i won't encourage to practice missing values much but you can do that when you have a very large data set and you have only one particular column which has missing values so you can do that in search cases okay but i'll also show you how you can drop but in this particular data set when you are practicing any machine learning predictions don't drop the values so i just demonstrate you how you can drop the values so dropping method so i'll create another variable called a salary data set because i don't want to affect this particular uh data set so we have already filled the values so i'll just create another variable as salary data set so in this we are going to drop the missing values i'll again give this pd dot read csv so i hope you remember that we have one second so we have used this read csv function and just copy this so we are just reading this again but we are storing it in another instance okay so we are storing it in another instant another variable called the salary data set so i'll run this okay so you can check whether you know first let's check the number of rows and columns so salary data set dot shape so this function will give us the number of rows and columns so we have 215 rows and 15 columns let's check the number of missing values salary dataset dot is null and dot sum so it's telling us that there are 67 missing values so let's say that we want to drop all the rows containing missing values okay so i'm going to drop missing values okay so um i'll create a variable with the same name which is salary data set so salary data set is equal to again mention it salary dataset dot drop na so it will drop all the missing value so na represents not available okay so drop in a and oh so i want all the values to be removed so i'll mention the cne so any means so i want to remove any missing values that are present in the data frame okay so all the rows containing those missing values will be removed so let's run this and now let's again check the number of missing values let's see whether the rows are removed so now we will get that there are no missing values so this is how you can drop uh you know missing values in a data set okay so if you check the shape of this particular data frame shape so we are getting that there are about 148 rows so previously we had about 215 rows because 67 rows have been uh removed from this dataset so this is how you can remove the missing values from your dataset so that is all about missing values so we use two methods one is imputation and the another method is dropping and we use imputation in most of the cases and we drop the values and the number of data points in the data sets are huge and we can afford to you know drop the values in that cases i hope you have understood about uh handling missing values in this video so i'll see you in the next video thanks for watching hello everyone i am siddharthan in this video we are going to discuss about one of the important data pre-processing technique which is data standardization data standardization is the process of standardizing the data to a common format or common range okay so what does this mean is so let's say that we have a data set and this data set contains 10 columns of data and there is a possibility that each column contains data in different range say for example one column can contain data in the range of undredged and 200 and one column can contain the data in the range of thousands and two thousands and one column can contain the data in tens or twenties okay so we need to standardize this data to a common format and to a common range before feeding it to our machine learning algorithm because it's easier to do analyze analysis and the processing on this standard data instead of having an unstandard data okay so this is what we are going to discuss in today's videos okay so in this video we will be taking a sample data set and uh in that we will do this data standardization okay so before getting started with this video if you are new to this channel i would like to give you a quick introduction to my youtube channel and in my youtube channel i'm making ants on machine learning course with python so if you go to my channel you will see this machine learning course curriculum video so in this video i have explained about all the videos in the modules which i will be covering in this machine learning course so in the description of this video you can download the course curriculum file okay so you can go to this playlist so you can see the previous modules we have completed so we have so far completed three modules the first module was on the machine learning basic so all the concepts that you need to know about machine learning and in the second module i have explained about all the python basics that are required for machine learning and in the third module i have explained about i have given the tutorial on several machine learning uh libraries in python such as numpy pandas matplotlib and and c-bond and this is the model which we are currently discussing it's data collection and processing okay so this is the fourth video and we also have a machine learning project video so so far we have completed about nine project videos okay so do check out this video subscribe and stay tuned for more videos so you can go to this about section and here you can get the link for my answer on data science course with python so you can also find my linkedin id telegram group and what facebook group so you can join those groups for you know to get notified when i post new videos okay so with that being said let's get started with today's video so this video you know we are doing this programming in google collaboratory so this environment is google collaboratory and here is where we are going to run our python programs okay so here there will be a connect option okay so you need to connect your system here so if you are new to google collaboratory check out my google collaboratory basics video so in that video i have explained about how to access google collaboratory and other things okay so with that being said let's get started with today's video so first of all we need to uh take a sample or data set okay so first we need to import some libraries so i'll import numpy as np okay so it's the general convention of importing numpy so numpy is useful for uh creating numpy arrays and we need another important libraries which is pandas so it is useful to create data frames so data frames are nothing but structured table which helps us to do analysis more easily okay so i'll also input scalar dot pre-processing so from this module we are going to uh import sorry so we need to give from so from sql dot pre-processing import standard scalar okay so this standard scalar is the function which we are going to use to standardize our entire dataset okay so we also need to import from sklearn dot model selection import train test split so this helps us to split our data into training data and testing data okay so we also need one more thing so i'll import sklearn dot data set so from this we will take a sample dataset so i'll run this so you need to press shift plus enter to run this cell and go to the next one okay so now we need to load the dataset so i'll just make a comment here loading the dataset okay so i'll create a variable as dataset so data set is equal to i am i'll just call this a particular module which is scalar dot data sets and in that we are going to load the breast cancer data so this uh data set is used to predict whether a person has breast cancer or not you know by analyzing several medical parameters so if i run this this will store one instance of this breast cancer data to this variable dataset okay so let's run this and let's print and see this data okay so print dataset so you will see the data here okay so you can see there is this sub name called as data okay so this data contains all the values okay so these are the parameters which we will analyze to predict whether a person has breast cancer or not so there is this target variable right so zero represents that the person uh you know as breast cancer in benign stage and one represents the person as cancer in malignant stage so benign is nothing but the starting stages and malignant is the most advanced stages where you know the treatment becomes critical and they are in the advanced phase okay so this is what we are going to predict whether the person has the cancer in banana stage or um you know in malignant stage okay so this is the data set we have and we have these feature names okay so you can see this here okay one second so okay so these are the feature names so we have this uh mean smoothness mean compactness mean concavity see these are uh the parameters of the cell okay so those cancer cells are analyzed and these values are uh you know got from those scans ourselves using some uh medical scans okay so these are nothing but the feature names so it will be mentioned here okay so you can see here the you know this dataset contains a list called as array called as feature names and it contains all the column names okay so we need to import all these okay let's see how we can import this data to a pandas data frame okay so i'll create a data frame as so first let me put a comment here loading the data to a pandas data frame so as you can see here it it's just the numbers are here and it is you know odd to do some analysis on that so if we you know import that in a structured table which is a data frame it is easier to do the analysis so i'll create the data frame as df so let's declare it as df and df is equal to pd dot data frame so this will create a data frame and inside that we need to mention the data which we want to put so as you can see here this data set we have named it as data set and this data contains all the numerical values all the data so we need to mention that inside this particular function so data set dot data and as i told you earlier the column names are present in the array called as feature names right so mention again data set dot feature names okay so this dataset.data extracts all the numerical data and this feature names extract all column names okay so let's run this and let's see the first five rows of the data frame so just put df the data frame name dot yet so this yet function will print the first five rows of the data frame okay so as you can see here this will print the first five rows of the data frame and here there is one important point to note here we doesn't include this target variable okay so this target which represents uh the cancer is benign or malignant so we don't include this so we will include it in a later point of time so for now we have just included this particular data okay so you can see here these are the column names or these are the feature names which you have imported through this okay so now you can see here we add several columns here and each column has value in different range so you can see here the first column has values between 10 to 20 and second column also has 10 to 20 whereas the third column has values more than 100 and the mean area is more than thousand okay so and some values are in the range you know in the less than zero okay so this is where we need to apply data standardization to enable better processing and analysis okay so we need to standardize the data before feeding it to our machine learning algorithm so that is the idea behind the data standardization so let's also check how many rows and columns are there for that you need to mention the data frame here the data frame name is df so df dot shape so this will tell us the number of rows and columns so totally we have 569 rows and we totally have 30 columns okay so we cannot manually standardize all the data onions we are using this standard scalar function which we have imported from scalar okay so and now what i'm going to do is i'm going to uh put all this data in a single variable and i i'm going to take this all this target whether the cancer is malignant or benign in a different uh you know variable so i'll create x and i'll store this data frame in this variable x and i'll create another variable y where i will store all these target values so these target values can be called by mentioning the data frame dot target okay so let's do that and x is equal to df which is the data frame and y is equal to data set dot target okay so let's run this so this ex is nothing but the features and why is the target so what it means is so these are the features of the the 30 columns and by analyzing all these features we need to predict the target whether the person has cancer in benign or malignant state so this is uh the end result of the data or in result of this project which we will get okay so now let's print this x and y so print x so it contains all the features so it contains all the 30 columns and now i'll print y okay so it contains all the targets or all the labels so the values are either 0 or 1 so we don't need to standardize this data because we have just two values either 0 or 1 but we need to standardize this data to have it in a common range okay so okay now we need to split the data into training data and test data before we are standardizing it okay so i'll make a text here as splitting the data into training data and test data okay so now we need to create four variables as x train x test and white rain and white test y train and y test okay so we have imported the train test split function from this scaling dot model selection so i'm going to use that function train test split okay so in that we need to mention x and y so x is nothing but the features of the data set and y is the target it's either 0 or 1 and i'll also mention the test size the test size is how much data we want in our test size so i'll put 0.2 so 0.2 represents its 20 percentage of the data so generally we take 10 to 20 percentage of the data test data they'll also mention this random state so random state is nothing but you know it's just to reproduce the code say if you give this random state is equal to 3 when you are doing this code your data will be splitted in the same way that my data is splitting if you give random state is equal to 2 it will be splitted in a different way so it is just an identity to you know for splitting the data in a specific way okay so we are creating four uh arrays here so x train is the training data and y train contains the target either zero or one for all the values in this extreme and the excess is the test data features and y test is the corresponding target for x test okay so we can also standardize the data before splitting the data into training data and test data but if our data has some outliers then that will be a problem if we do that okay so outliers are those values uh you know which are abnormal let's say uh let's consider this column so we have this mean compactness which has the value under zero so let's say that in one case we have a value around hundred or even thousand so that value is called as you know out layer so in this case this is uh you know a wrong value so these are nothing but outliers and if we have any outliers in our data then if we standardize the data before splitting it then it will create a problem so it's better to split the data into training and test data before standardizing it in most cases okay so in this particular data set it's you know it it doesn't matter because it is already uh you know this data doesn't contain much outlay so it is a normalized data and we can uh split it before standardizing okay so i'll run this this will split my data into training data and test data and now i'll print my x train shape sorry i'll first print this next text shape okay so x shape and the strain shape says strain dot shape and x test dot shape okay cool so let's run this so okay so you can see here our original data uh contains about 570 data points of 516 data points and 80 percentage of it which is 455 goes to the extreme which is the training data and 114 goes to the test data okay so as we are splitter now we can do the data standardization okay so i'll create a text here as standardize the data and i'm going to print the standard deviation of the entity data set so i'm going to do that by calling this particular data so this data set contains all the data and i'm going to find the standard deviation dataset.data dot std so if our data has all the values in the same range in that case the standard deviation should be 1 okay so let's run and see whether we are getting it so as you can see here here the standard deviation is 228 which means it it means that the data is not in the same range and they varies a lot okay so this is what we need to attack earlier so for this we are going to use this standard scalar function so we have imported it from scale and dot preprocessing so i'll create a variable as scalar so we are going to load the standard scalar in this scalar variable so standard scalar okay so let's run this so this standard scalar function will be loaded to this particular variable so you can search sklearn dot preprocessing dot standard scalar so you can see this sql documentation here so you can go through this to understand how the data standardization occurs okay so what happens is this is the final standardized data which we get so what happens is so each data point will be taken and the mean will be subtracted from it so u represents the mean and yes represents the variance or standard deviation okay so when you apply this particular formula to all the data points so we get a standardized data but it doesn't affect the nature of the data okay so it just a change uh the value range but it doesn't affect the nature of the data okay so if you do the you know machine learning task on that we get the same result okay it doesn't affect right uh how the data really is but just the range alone okay so now we need to put this uh standard scalar function which we are loaded to this and we need to fit this to it okay so use this fit function and inside that we need to mention xtrain so we are going to standardize the data of xtrain so this will fit the data extrane and let's run this so this standard scalar will analyze and understand how the data is distributed in exchange now we need to transfer transform our data based on this standard scalar okay so i'll create an array as x train standardization which is equal to now we need to use scalar dot transform okay so before we used scalar dot fit and now we need to transform the data based on this uh you know scaling so scalar dot transform extreme okay so let's run this now you can print it print exchange standardized and see that the values are in you know they are not very different so we have value in one point four zero zero point six one and minus point eight way so these values are very close to each other but you can see previously uh the values of x okay so we have about seventeen twenty and in the next column we have you know point uh 0.46 and we also add about thousands and 2000 etc so now you can see that the values are in a similar range okay so this is how you need to standardize the exchange but there is another step here you need to standardize the x test also okay so in this case we don't uh fit the data again so we cannot fit the scalar again with x test no so we shouldn't do that we should just transform the data based on this standard scalar okay because this test data contains just 114 examples and it it is not a you know good representation of our entire data set so we need to standardize our data based on this extreme fit okay so now let's create a variable as x test standardized and standardized is equal to scalar dot transform so we shouldn't fit it so we should transform it and inside it mention x test so we want to transform this extras data so let's run this now this data will be transformed so as i told you earlier the standard standard deviation should be this value okay so if the value is uh you know in this range it means that the values are in not in the same range so the value should be one for uh or at least closer to one to you know understand that the data are in the same range so i'll print the standard deviation of x train standardized dot std so std will give us the standard deviation so let's run this and the standard deviation value is 1.0 which is which means uh the data all the data are in a similar range but it doesn't affect the nature of our dataset okay so that's what we are understanding from here and you can also print the standard deviation value for x test standardized okay okay so extra standard is dot standard deviation dot std okay so the value is not 1 but it is close to you know it's 0.86 so which means that we are transforming based on this exchange right so that's why the value is 0.8 but it is very much better than 228 so that's what we are getting so this is how you need to standardize your data using this standard scalar function so i hope you have understood the things covered in this video so that's it about standard deviation this data some decision okay so thank you so much hello everyone this is siddharthan currently we are discussing about the fourth module in our hands-on machine learning course with python so this fourth module is about data collection and data pre-processing so in this data data pre-processing module this is the fifth video which is on label encoding okay so this is what we are going to discuss in today's video okay so we will be taking two data sets and we will be performing this label encoding on these two datasets so that is the objective of this video so if you are new to this channel i in this channel i'm making a ants on machine learning course with python so this is my channel so you can go to uh the playlist section of my channel and in the first page you can see this machine learning course curriculum video so in this video i have explained about all the videos and modules that i am going to post in this channel so you in this description of this video you will find this curriculum file so you can download it and go through it so you can also go to this playlist section so these are the first four modules so this model one is on the machine learning basics the second module is on the python basics for machine learning and the third module is on important python libraries like numpy pandas macro tips even etc and fourth module is data collection and preprocessing so the video which we are discussing also comes under this module and i have also posted about 10 machine learning projects so i will be posting three videos per week and two videos will be on uh this particular course order which will be posted on monday evening and wednesday evening and i will be posting one machine learning project video every friday okay so that is about my channel and uh you can also uh check this machine learning course with python so all the videos are you know incorporated in this particular playlist so do check out so we will be doing all the coding in python and in this google collaboratory environment okay so in google collaboratory you can run python programs so if you haven't heard about this you can go to this python basics playlist where the first video is about how to access google collaboratory and how to you know know about all the features we have in google collaboratory okay so with that being said we can uh you know start with today's video okay so label encoding what is mean by this label encoding so label encoding is about converting the labels into numeric form okay so when we are working in a classification machine learning problem so classification problems are nothing but we will predict whether a data point belongs to one class or the other say for example uh we are predicting whether a person is diabetic or non-diabetic so it is a classification problem whether in which we have to you know uh predict that this data points belong to one of these two classes okay so if the data are tells that that person is diabetic and non-diabetic so it is not easy to use that value so what we do is we convert the diabetic and non-diabetic these text values into numerical values as either zero or one so this is about label encoding okay so in this video i will be explaining you how you can do that so for this i have taken uh two data sets so the first data set this data.csv is a breast cancer dataset which is used to predict whether a person has a breast cancer in benign stage or malignant stage okay so benign means the starting stage of cancer and malignant is you know more uh progressive and advanced stage of cancer okay and this iris dataset so this dataset is used to predict whether iris flower belongs to which of the three species okay so in this breast cancer classification we have two labels and which are benign and malignant in this iris dataset we have three species of virus so we will be encoding those labels okay so you can find this connect option here so you can connect the google collaboratory from here and then you can go to this files option so there you can give this upload to session storage option or just right click here so you can see this upload option so i'll give this you know link of this data set files in the description of that of this video okay so first we are going to import the dependencies so i'll just make a comment here about importing the dependencies so we need two libraries so input pandas as pd so pandas library is very helpful to make a data frame so data frame are nothing but structured table so the dataset files which we have are csv files so csv means comma separated values so it is easy to analyze and process when we have the data in this data frame okay and then we need a label encoder function so we will import it from from sklearn dot pre-processing input label encoder okay so this label encoder function we will be using so you can press shift plus enter to run the cell and go to the next one so first we are going to encode this breast cancer data okay so let's make a text as label encoding of breast cancer data set okay so the first step is to load this csv file to a pandas data frame so you can go to this options here and from there copy the path of this data.csv file okay so now we are going to load the dataset loading the data from csv file to pandas data frame okay so i'll create a variable as or let's name this data frame as cancer data so cancer data which is equal to pd dot read csv and we need to mention the path in the code so i have already copied the path so let's run this so this will load the contents of the csv file to a data frame so now let's print the first five rows of the data frames first five rows of the data frame so it tells us what are the different columns we have or different features we have in the dataset so cancer data dot so this shade function will print the first five rows so as we can see the first uh five rows here so we have the id for each data point and we have this diagnosis this diagnosis is the label we have so in this particular column diagnosis we have two values as either m or b so m represents malignant stage which is uh the more advanced stage of cancer and b represents banana stage or the starting stage of cancer so if we can uh predict that the person has cancer in benign states so the probability is that that person will be saved because we can give the treatment early on so this is the objective of this particular data set but in this video our goal is just to encode this particular diagnosis column so we have other features like radius mean so these are uh you know the parameters of the cell so this cancer cell will be analyzed you know using some test so in this particular you know data set this data is collected from a procedure called as fine needle aspiration through which the cancer cells are obtained and it is you know uh used to uh do some uh test on it and all these data are obtained so these are cell data the the cancer cells data so now we are going to uh see how many data are there or how many data points are there for malignant cases and benign cases so finding the count of different labels so we need to mention the data frame name so here the data frame name is cancer data and we need to mention the column name so here the column name is diagnosis so in the codes we need to mention diagnosis okay so cancer data diagnosis dot value counts so this value counts function will tell us the number of malignant cases we have and number of the benign cases we have okay so i'll run this okay so for benign cases we have 357 data points and for malignant cases we have two 12 data points now we are going to convert this uh benign stage and magnet stage into corresponding labels with numerical values okay so first we need to load the label encoder function load the label encoder function so as as you would remember that we have imported this label encoder function from scalon.preprocessing module so let's create a variable as label encode and in this particular variable it store this function so label encoder and so we need to mention this parenthesis here to mention that we want to you know load one instance of this label encoder function to this particular variable so i'll press shift plus enter and now i'm going to encode all the labels to either 0 or 1 and i'm going to store it in a separate variable so let's create it as label okay so labels is equal to so we need to mention this label encode which contain the label encoder function label encode dot fit transform dot fit transform and as i have told you earlier we are going to transform all the values of this diagnosis column right so we need to mention in this parenthesis about the data frame which is cancer data dot diagnosis so diagnosis so this will take all the labels from this diagnosis column and it will convert all the values of m and b okay so let's run this so we will get a new column or new list which contain all these labels and now we are going to append or join this label's value to another column in this particular data frame okay so appending the labels to the data frame okay so cancer data so let's create a column named as target is equal to labels okay so let's run this okay so this will add a new column to our data frame now let's again print the first five columns so cancer data dot it okay so let's run this now you can see here at the last we have a new column called as target so this is our target variable so it says one so in the diagnosis we have one so all the values with m uh will be transformed with one okay so now we can drop this diagnosis column and then we can uh you know do our predictions using this features and the target which we have made now okay so in this case so zero represents bunny so 0 represents benign and 1 represents malignant cases okay so we can also check the counts value counts so we have already seen how we can count the number of labels we have so cancer data we need to mention the target column so which we are going to count dot value counts okay so let's run this and we can see here the label 0 has 357 values and the label one has two tool values so you can just compare it with the previous one so b benign cases 357 malignant cases 2 12. so now you may get a doubt or uh that which particular label gets uh the label as zero and which uh you know label gets the value as one so because we haven't mentioned that uh malignant cases should get the value as one and banin should the value get the values are 0. so what happens is when you use this label encoder function the labels present in that particular column will be arranged alphabetically here the two labels we have are m and b so of course we know that b comes in alphabetical order at first so this b will be given the value 0 and as m comes after b so it will be given a value 1 so this is how we can do a label encoding on this particular breast cancer data so the further processing we do will be to use this target and this feature to train a machine learning algorithm to make some predictions on whether a person has cancer in banana stage or malignant stage by you know going through these features okay so we have already made several classification projects in our channel so you can refer those project videos and you can make your project video oh sorry you can make your own project on this breast cancer you know based on those methods okay so the next thing which we are going to do is in code using this iris data set so as i have told you earlier this is iris flower data sets and this contains three uh labels so in this breast cancer data set contains only two labels and this contains three labels so it is uh the similar procedure which we have done so i just copy the path from here okay so i just going to do the same thing i just copy the code from here so i'll just change the names so now we are going to encode the labels so label encoding of iris data okay so let's paste the code here and in this case we need to paste the path here and let's name this as iris data so iris data so let's copy the path from here and i'll paste it there okay so let's run this so this will load the iris data to a cs sorry the csv data to a pandas data frame so the next step are similar so i'll use this yet function so iris data dot eight so this will give the first five rows of the data frame so okay so as you can see here we have totally one two three four and five columns so this is the species column and this will act as the labels okay so now we are going to transform this labels to numerical values as 0 1 0 1 and 2 so three labels okay so let's do the valley counts function here so iris data and so the column name here is species right so you can see the column name as species so let's mention it in codes so iris data species dot value counts okay so okay so something has happened here so we need to mention the parenthesis okay so we have 50 data points of for iris versi color species and 50 data points for virginica and 50 species for iris statos okay so we have totally 150 data points and now we are going to encode these labels okay so it is the similar step and let's load the label encoder again now let's also label sorry loading the label encoder okay so let's create a variable called as labeled encoder two okay so the label encoder the first label encoder we used is for the breast cancer data and this is for this iris data so let's mention the encoder function label encoder so i'll run this now let's create the labels and store it in the variable called as iris label okay cyrus labels is equal to label encoder 1 dot fit transform and iris data set iris data dot species because we are going to transform this species column okay so we can see the column name here so let's run this okay so now we need to append it to our data original data frame so iris data let's create again a variable called sorry a column called as target and in that we will store this iris labels okay so let's run this and let's print the first five rows of the data frame so iris data dot it now you can see here okay so something has happened here okay so we shouldn't use this label encoder so label encoder dot fit okay so we need to fit transform so we need a error here so so it should be label encoder dot fit transform so let's run this again and this okay now you can see here uh we get the target as 0 now we can check the value counts function so iris data dot okay iris data um target dot value counts okay so now we can see here we have three labels are zero one and two so now uh as i have told you earlier so the labels will be arranged in alphabetical order so here the first uh species that comes in alphabetical order is setosa and then uh comes versicolor and virginia so the label will be given the value 0 for setoza and one for versicolor and uh two for virginica okay so i'll just mention it here in a text as iris setoza the label will be 0 and for iris versicolor so the label will be 1 and for iris virginica the label will be 2 okay so this is how you can take the labels from a data frame and you can convert it to a numerical value using the label encoder function that is present in sql and dot pre-processing okay so i hope you have understood how you can you know convert the labels into numerical values and that's it for this video and i'll see you in the next video thanks hello everyone this is siddharthan currently we are discussing about the fourth module in our machine learning course so the fourth module is about data collection and data preprocessing and this is the sixth video in the data preprocessing module in this video we are going to discuss about train test split function okay in case you are watching my videos for the first time i in this channel i'm making a ants on machine learning course with python you can check out the playlist in my channel to start learning my course from the beginning okay so this train test split function is one of the important steps in data pre-processing so we will do this uh pain test split in uh every machine learning projects video okay so before uh you know i explain you about what is meant by this drain test split function i want to explain you about the general workflow which we will follow in a machine learning project okay so this is how machine learning workflow will look like so the first step in any machine learning project is to get the data we want okay so this data is chosen based on our problem statement so let's say that we want to predict whether a person is diabetic or not so in that case we want medical data for several persons of diabetic and non-diabetic persons so we use this data to train our machine learning model and do some predictions okay so the first step is to collect the appropriate data so once we have the data we cannot feed it directly to our machine learning algorithm so we need to process the data so this is where data pre-processing steps come okay so in data pre-processing so we do a lot of things to the data say for example if the data has some missing values so we need to handle those missing values through some methods okay so these are the steps that comes in data preprocessing okay so once we do the preprocessing we need to analyze the data so it helps us to give some meaningful insights of the out of the data so for example a data set may contains 10 columns or even 20 columns so we call this columns as features so we need to find which feature is important for the prediction and stuff like that so this is where we use data analysis so in this we you know make some plots and analysis to see which features are important okay so once we analyze the data the next step is to split the original data into training data and testing data so this step is known as train test split function so this is what we are going to see in this video so once we split our original data into training data and test data we will feed this training data to our machine learning model okay so there are several machine learning models so what we will do is this training data will be used to train our model so our model will find the pattern and learn from this training data okay so once it has learned from the training data our model will be evaluated and this evaluation will be based on the test data okay so evaluation is about finding how the model is performing and what is the accuracy score of the model and such kind of things so the takeaway is that we use the training data for training the model and we use this evaluation by the train test data okay so now let's see what is this train test split okay so when we have this original data set so we take uh 80 or 90 percentage of the data as training data and we take 10 percentage or 20 percentages testing data okay so which is used for evaluation but what is the need for this evaluation so let's try to understand this with an analogy so let's say that a person is studying for a max exam okay so let's say that is preparing uh you know for the exam by practicing the questions and given in a textbook so that questions will become the training data so in exams the examiner will ask the questions that may be out of that book okay so because if uh the questions are asked out of the book then only we can you know evaluate that the person has studied well okay so asking the same questions given in the textbook may not be a correct metric to you know analyze his performance so this is the same example here so we cannot test our model based on the training data because our model has already learned us already seen the training data but it never saw the test data so that's why we need this testing data to evaluate our model on how it's performing and what is uh the accuracy score and other metrics of the model okay so with that being said let's get into some coding part on how to do this training and test testing split okay so i will be doing this in google collaboratory so before starting with the video i'll just show you a quick intro to my channel so this is my youtube channel in which i am posting my machine learning videos so once you go to my channel you can see the machine learning explained about all the modules and the videos that i will be covering in this channel so in the description of the video you can see the curriculum file so you can download it and go through it so i have also mentioned some important machine learning books that you can read okay so it is also given in the description of all the videos so you can go to this playlist section to uh check out the modules okay so as you can see here the first module is on machine learning basics the second module is on python basics required for machine learning then the important libraries such as numpy pandas matplotlib and c born okay and the fourth module which we are discussing right now is data collection pre-processing and also i have several machine learning project videos so i will be posting three videos per week two videos will be on monday evening and wednesday evening which will follow this course order and every friday i will be posting one machine learning course video sorry machine learning project video okay so you can also check this machine learning course with python so i have you know incorporated all the videos in this particular playlist so in case you are new to this google collaboratory you can go to the second module which is python basics for machine learning so in that the first video is about how you can access google collaboratory and how you can you know use different features present in it okay so now let's get into uh this section so this is where we run our python programs so i have taken an example project so this is about diabetes prediction so we have already did this machine learning project in our channel so you can go to the machine learning project playlist to see the full code there so here we have the off code so i will stop this video by you know splitting the data into training data and testing data okay so before that i'll just give you a quick recap of what we have done here so i have already uploaded the data set file here so this is diabetes.csp which contains uh the medicinal data so i'll give the link of this data set file in the description of this video okay so first what we are doing is so we are importing the libraries so we need some important libraries such as numpy pandas sklm so we use the standard scale function to standardize all the data and this is the train test split function which we are going to see okay so this will automatically split our data into training data and test data so in this case i will be using a support vector machine model for training and then we will predict our accuracy score okay so i won't be explaining uh the entire thing because we have already did that in the diabetes prediction project video okay so i'll just explain the strain test split function so after that we have loaded this dataset in a pandas data frame okay so using this read csv function and we have seen what are the different columns are present in it okay so these are some data analysis part and finding some statistical measures and we found that there are two labels in our outcome column okay so you can see the outcome column here it's either one or zero so in this case zero represents non-diabetic patients and one represents diabetic patients okay so you can see this is the mean for each cases so for non-diabetic person so this is the average value of each column and oneness for diabetic people okay and then we are splitting our data into features and uh targets okay so the feature is all those columns except this outcome column okay so this outcome column is the target and we will take it separately okay and all these other columns act as the features so you can see here we have stored all the features in x and all the features in y okay so we have printed it and this x doesn't contain the outcome column and the y contains only this outcome column okay then we have applied this standard scalar function so this data standardization is used to make all the values in a common trend so you can see here so some values are in the range of hundreds and some are in the value of 20s and 30s some are in the you know in the range of 0.6 and such kind of things so we use this data standardization to have all the values in the same branch okay so we have standardized the data and the next step is where we will split our data into training data and test data okay so if you want to know more about this particular project please check out that diabetes video so i'm not explaining much okay so now i'll just create a text here as splitting the data into training data and testing data okay so now we can use the train test split function that we have imported from sklearn dot model selection okay so you can see this here so we have imported the screen test split function using scalar dot model selection so that's what we are going to use now so before that we need to create four arrays so extreme x test y train and y test okay so i'll just explain you what is meant by these uh four arrays so in a moment so before that let me complete this line of code so train test split and inside the strain test split function we need to mention the parameters so here we need to mention x and y we know that x are the features and y are the outcomes so we need to split this x and y so we need to mention ex and y and then let us mention the test size so test size is equal to 0.2 okay so 0.2 means i want to take 20 percentage of the entire data as test data so as i have told you earlier during the presentation that we take either 10 percentage or 20 percentage of data as the test data right so in this case i'll take 20 percentage of data as the test data and then i'll mention another parameter which is random state okay so let me put random state is equal to 2 so you can give any integer values for this random state so the reason for this is if you want to split the data the same way that my data is getting splitted then we need to give the same value here okay so if you give the value as 2 your data will be splitted in the same way that my data is going to get splitted okay so if you mention three then your data will be splitted in a different manner okay so these are the parameters which we need to mention x y test size and random random state okay and now we have this four arrays so extreme is nothing but this features okay so these features will be uh splitted into extreme features and x test features so 80 percentage of the data points in this features will go into this extreme and twenty percentage will go into this x test and the corresponding labels of this x train will be uh you know going into this y train array okay so this uh corresponding y will be splitted into y train and y test so y 10 contains the corresponding labels for x strain and y test contains the corresponding labels for x test okay so let's run this okay so now you can go ahead and print the shape of x so x is nothing but our original dataset shape and xtrain dot shape and xtest dot shape okay so let's run this so we have in the original data set we had about 768 data points and eight columns so in rx train 80 percentage of data which is 614 values goes into the exchange and uh the testing data we have 20 percentage of the value which is 154 data points so this is how you can split your original data into training data and this data okay so if you want to know what we can do after this on how we can predict that predict that a person has diabetes or not so you can check out that diabetes prediction video in our machine learning project playlist okay so i hope you have understood about a train test split function so that's it for this video i'll see you on the next video thank you hello everyone this is siddharthan in this video we are going to discuss about how to handle imbalanced dataset okay so an imbalanced dataset is something which contains unequal class distribution okay so let's understand this with an example so let's say that we have a diabetes dataset so this diabetes dataset contains data points of patients who have diabetes and those who doesn't have diabetes okay so if that data set is imbalanced it will contain more data points for diabetic patients and the number of data points for non-diabetic patients will be very less say for example the number of data points for diabetic patients can be thousand and for non-diabetic patients it can be only under a data point so we have this distribution of thousand on one class and android on another class so this is an example of imbalanced dataset we cannot feed this data set to our machine learning model so it will make our predictions you know very bad so before uh training this data set with our machine learning model we need to process it to you know remove this imbalance okay so that's what we are going to discuss in today's video okay so before getting started with this video if you are new to this channel hi in this channel i'm making a hands-on machine learning course with python and i will be posting three videos per week two videos will be on monday and wednesday evening and this these videos will be following the machine learning course order and every friday i will be posting a machine learning project video okay so you can get towards the playlist section in my channel to start learning my course from the beginning okay so this environment is called as google collaboratory so we will be doing all the code in google collaboratory so all the python code okay so if you are new to this google collaboratory you can go to the second uh module in my playlist section so the name of that playlist is python basics for machine learning so in that the first video is google collaboratory basics okay so the index of that video will be 2.1 so in index of this video is 4.7 that means this is the seventh uh video in our fourth module so the fourth module is all about data collection and pre-processing okay so the seventh video is about handling imbalanced data okay so now let's get started with this video so what we will do is so you can see here i have already uploaded a data set here so this is a pretty big data set so its size is about 143 mb so you can upload it so i give the link for this data set in the description of this video okay so this is an imbalanced data set which contains more number of more number of data points for only one class okay so we will load this data set to a pandas data frame and see how we can handle this imbalance okay so the first step is we need to import some libraries so i'll just make a text here as importing the dependencies okay so let's import the two libraries the two basic libraries which are numpy and pandas let's import numpy as np as the general convention and let's import pandas as pd okay so pandas spd so you can press shift press enter to run the cell and go to the next one okay so now we need to load this data set this data in the csv file to our pandas data frame okay so you can go to this options uh here and you can copy the path of this spread okay so to upload this file just go to this files option so before running the code you need to connect your system here so yet there will be a connect option okay so there you just need to go to this files option and give this upload upload thing here so there you can upload the data set file so now we are going to load the data set so loading the data set to pawn does data frame okay so let's declare the name of the data frame as credit card data okay credit card data which is equal to pd dot read csv so we have we are having a csv file here and to read the csv file we need to use this read csv function and inside the codes we need to mention the path so we have already copied the path so let's run this so here the data set is a credit card data set so it contains a different transaction for a legit transaction and fraudulent transaction so this data set will be used to predict whether a transaction is legit or it is a fraudulent transaction okay so we have already made a project video on this so you can go to the machine learning project playlist to see you know how we can uh process this data set and do the prediction so the idea of this particular video is only to handle this imbalance okay so if you want to learn about the prediction you can you know watch that video after completing this video okay so we have successfully loaded it to our pandas data frame called as credit card data okay so now we can print the first five rows of the data frame so first five rows of the data frame okay so credit card data so you need to mention the data frame name which is credit card data dot it so this yet function will print the first five rows of the data frame okay so we can see the first five rows here so first we have this time column okay so this first transaction is taken as the zeroth second and it it gives the second for all the other transaction okay so we have several columns here and finally we have the amount so this amount will be in us dollars okay and this is the class here 0 represents legit transaction legit transactional nothing but normal transaction and the another class is one okay so one value represents the transaction is a fraud learned transaction okay so this is a typical classification problem and these are you know we so totally we have about 28 columns so these features are uh processed using a principal component analysis uh function because they cannot uh you know give us the details about this transaction it may contain contain some personal informations okay so that's why you know we have the features in these kind of numbers which doesn't make any sense for us okay so but these columns are really important for our predictions okay so you can see the amount column here and the class column here okay so you can also print the last five rows of the data frame so just mention the data frame name credit card data dot tail okay so let's run this and you can see the last five rows and yeah so the last five rows are also a legit transaction okay so and one more thing this data set doesn't contain any missing values okay so you can run is null dot sum to check if it contains any missing value so it doesn't contain any missing values so our next step is to analyze the distribution of the two classes which is a legit transaction which is represented by zero on fraudulent transaction which is represented by 1 okay so let's do that now we are going to determine the distribution of the two classes okay so for that mention the data frame name credit card data and the column is class so we have seen the column so it's class and we can use value counts function dot value counts so this will give us the distribution of the data point value counts okay so let's run this and you can see here for the label zero that is for legit transaction we have about two lakh eighty four thousand data points or two eighty four thousand data points and for fraud and transaction we have only 492 data points so this means that almost 99 percentage of the data is in this particular class this legit class and we have you know less than one person for this fraudulent transaction so this is a typical example of an imbalanced data set where we have more number of data points for one particular class okay so this is why we tell this is an imbalanced data set so i'll just make a text here that this is highly imbalanced data set okay so i'll also make another text here mentioning that 0 represents legit transaction or it it is basically means the transactions are legal ok so legit transactions and 1 represents total and transactions okay so let's run this okay so now we are going to separate these legit transaction and these fraudulent transactions separately okay so i'll make another text here as separating the data or let's put separating the digit and throttle and transactions so i'll create two variables so the first one is nothing but legit and the second one is nothing but let's name this as fraud okay so legit is equal to credit card data and again we need to mention this so in square bracket the credit card data dot class so class is the column which contains the labels so we have two labels which are zero and one so we know that the legit transaction has the label as zero right so legit is equal to zero so what happens is it will take all the uh you know rows which contain the class value as a zero so zero represents legit transaction so all these two lakh eighty four thousand data points will be stored in this legit variable okay so we need to store so we just need to do a similar thing here with only one small change so rod is equal to so in this we need to mention one so one represents fraudulent transaction right so this will separate uh both of these types of classes and it will load this to this legit variable and fraud variable okay so let's run this now you can print both of these variables so let's print the shape of them so shape of legit under shape of okay it's no shape now you can see here that we have totally to like 84 000 data points in legit and 492 data points in fraud okay so now what we are going to do is we are going to you know implement a technique called as under sampling so this is a very important uh sampling method to handle imbalanced data under sampling okay so what i'm going to do is so i'm going to take a 492 fraudulent transaction and out of these 2 lakh 84 000 data points i'm just going to take the same number of uh data points so i will be taking 492 legit transaction okay and how will i take this i will just create a random sample of these two lag data points but i just want about 500 data points okay so this is what is called as under sampling okay so here what we are doing is we are just building a sample data set containing similar distribution of legit and transaction okay so that is our goal okay so the total number of uh fraudulent transaction we have is 492. so i will also mention this here so number of fraudulent transaction so it is 492. so now let's see how we can uh take a random sample of these two lakh values so i create another variable as legit sample so this is the random sample which contains 492 values of legit or transaction so it's not that we need to take you know the exact number of the other class so you just need to take almost similar so you can take 500 values or 550 values or 400 queries something like that so i'll just take the similar number of uh this class as well say for example here it is 492 right so i'll also take 492 for legit sample so as you can see here we have loaded all the legit uh transaction data points to this legit variable so to this legit array so i'll mention it here so this will be basically in the form of pandas data frame okay so it won't be an array it will be in a pandas data frame data type so legit dot sample so this sample function will give us a random sample and in the parenthesis we need to mention one parameter so here the parameter is n which is the number of data points we want so when i put n is equal to 492 it will get 492 random values in this uh legit data frame and it will be stored in this legit sample variable okay so let's run this and you can print the shape of this legit sample okay legit sample dot shape okay so you can see here so totally we have 492 data points so this is how you can get the you know random sampling with equal number of uh data points between two classes now we need to concatenate the two data frames so we need to concatenate these legit samples and the original fraud samples okay so now we are going to concatenate the two data frames concatenate the two data frames okay so i create new data frame called as new data set which is equal to we need to use the function pd so pd represents pandas because we have imported pandas spd dot concat so this concat function will join two data frames and we want to enclose two things here so one is the legit sample which contains all the legit data points and the second one is nothing but fraud data points so if you remember we have separated this uh fraud transaction which are labeled as one to this fraud variable right so we need to mention this here okay so we need to concatenate these two data frames and there is another parameter we need to mention which is axis so axis is equal to zero so axis is equal to zero means we want to concatenate data frames one top of the other so what happens is so the first 492 values will be the legit transactions and after that so this fraudulent transaction will be added so if we mention access is equal to 0 this fraudulent transaction will be added after all the columns of legit sample say for example so when we mention uh access is equal to one what happens is so the fraudulent transaction will be added after this uh class column so we don't want to add column wise but we want to add row wise right so that's why we are mentioning axis is equal to zero so axis is equal to zero means we want to concatenate row wise okay so now we can run this and now you can get this new dataset.yet so this will print the first five rows and now let's see what is the distribution so you can see here this is shuffled so this is the first one is the serial number so you can see here so previously we have this in order now this legit transaction will be you know randomly sampled and we have the first five rows here now let's print the last five rows udataset.10 so now we have the last five rows of the data frame but you can see here the class is one so we have fraudulent data set at the end of the data frame previously we have only the legend transaction right so this is how you can you know handle an imbalanced data set so previously we have seen uh this distribution rate using this uh dot value construction let's do the same for this new data set so i will mention this new data set and inside it we need to mention class dot valley accounts so when you run this you can see here the first class so as 492 data points and the class 0 as 492 data points as well but previously we had about 2 like 84 000 data points on the class 0 and for the class 1 we have only about 492 data points now we have a evenly distributed data set for the two classes and when you use this data set for your prediction using machine learning you will get better results okay so this is how you can handle imbalance data set in python okay so i hope you have understood all the things that we have covered in this video so i'll enclose the link for the data set and this caller file in the description of this video okay so if you want to uh know about further prediction on this data set so go to the machine learning projects playlist and in that you can uh check out for credit card fraud detection project okay so in that uh examples or in that project video i have explained how you can predict the transaction whether the transaction is legit or fraud okay so that's it from my side i'll see you in the next video thank you hello everyone this is siddharthan currently we are discussing about the fourth module in our hands-on machine learning course with python and this module is all about data collection and preprocessing this is the eighth video in our data pre-processing module and this video is about feature extraction of text data okay so in this video i will be explaining you what is exactly mean by feature extraction and how we can implement this feature extraction of text data using tf idf vectorizer okay so i will also be showing you how you can implement this in python okay so this also contain amazon part okay so first of all let's try to understand about this feature extraction so feature extraction is all about the mapping from textual data to real valued vectors is called as feature extraction okay so i would like to explain you about some basic thing about machine learning here before going into future extraction so basically in machine learning we will feed our machine learning model with a lot of data and our model can find the patterns in this data and it can learn from it as a result of which it can make new predictions okay so this is how a basic machine learning algorithm or basic machine learning model works okay so but when we have the data in the form of text okay so it will be kind of hard for a computer or a machine to understand the text data whereas it can easily understand the numerical data okay and we have to convert this text to data to numerical data and this is where feature extraction comes into play so what we do is we convert this textual data to feature vectors so feature vectors are nothing but the numerical representation of this textual data okay so this is called as feature extraction so once we convert this texture data to numerical data it is now compatible to uh go into the machine learning model okay so so we need to uh discuss some few terms in your terms and feature extraction okay the first one is bag of words so as you can see here bag of words represents the list of unique words in the text corpus okay so corpus means collection of words okay so what we do is let's say we have a paragraph so the algorithm so this feature extraction algorithm tries to create a list with all the unique words so it removes the repeated words and it creates a list of all the unique words present in that particular text purpose okay so then we will use our tf idf vectorizer so tf idf represents term frequency and inverse document frequency okay so in the next slide i'll explain you what exactly meant by this term frequency and inverse document frequency so let's first try to understand what does this tf id of vectorizer do as you can see here i have mentioned here that to count the number of times each word appears in a document okay so what we basically do is we create a list of all the words in the paragraph or in the document and we count the number of times the words repeats okay so you may have a doubt so how how does counting the number of words can help us convert this text into numerical data so i'll give you an example let's say that we are building a machine learning model that can predict whether a male is a spam male or a normal male ok so we all would have encountered this spam mail in our daily life and we can say that the spam mails contain the words like offers free discounts and such kind of things and a normal mail so the mail sent by our family members or our you know colleagues doesn't have this kind of website so what happens is when you count these words it can tell the machine learning algorithm like uh this particular kind of label as this kind of watch say for example the label for this spam and prediction will be spam mails and normal mail site and here the spammings will be the label and this can tell our model that this spam mails as the words like free discount offers etc and this is how counting the words can uh help the machine learning model to understand what is present in the data set so that is another important thing to note here when we do this vectorizer it doesn't understand the context of the paragraph it just uh you know tries to count the number of words so number of times the words is repeated and it doesn't understand the context so there are also some other methods which we will discuss in our nlp topics which is uh natural language processing topics in that part i'll explain you what are the other methods to do this but uh in machine learning we frequently go with this vectorizer uh concept okay so as i've told you what does this vectorizer do now let's understand more about this term frequency and inverse document frequency okay so we will be using a tf idf vectorizer and tf stands for term frequency term you can see the formula here so the formula for term frequency is number of times term t appears in a document divided by the total number of terms present in the dot payment so what happens is let's say that there is a word called as uh offer in that particular data set so this vectorizer will count how many times this word offer has been repeated in the document and it will divide it by total number of uh words present in the document so this is for this term frequency so this can tell us what are the important words are right so that is this idf so idf stands for inverse document frequency and the formula for this is log capital n by small n so i have mentioned what has been with this two m's so the capital n stands for the number of documents and uh unless uh okay it's uh not the number of documents it's number of words okay and small n is the number of documents a term t has appeared in okay so you can refer the formula so basically what it tells us so the idf value of a rare word is i whereas the idf of a frequent word is law so why we have this inverse document frequency value is that there can be words like you know the articles and nouns and other kind of things so words like this are the a etc and etc so these words would be repeated a lot of time and we don't want to give a significant you know focus to these words and this is where we use this inverse document frequency where if a word is repeated a lot of times that word will have a small value okay so this will tell us the machine learning model that that word is not significant okay and then both of these values will be multiplied okay so this is uh the tf idf value and each term as this tf idf value and this is nothing but our feature vectors and it represents the numerical value right so as you can see here we got some numerical values from these formulas and this is how we can convert a text data to a numerical data so i hope you have understood what we have discussed here now i'll show you how you can implement this in python okay so i'll go to my google collaboratory so i have the code ready so we have taken this fake news prediction uh data set so this data set contains news articles and uh it contains two types of news one is the fake news and the another one is the real news so you can see the details of the data set here so we have already made this machine learning project on fake news prediction in our channel so you can go to the playlist section in my channel to watch this complete video on this fake news prediction so in this i'm just going to explain how we can you know convert this uh text data to numerical data so i'll just give you a you know short introduction of what we are doing here okay so we just uh imported the libraries so we have also already uploaded the data set here so i'll give the link for the data set in the description of this video okay so you can download it from here so we are doing some processing so if you want to understand about this so once you watch this video go to the fake news prediction project video in my channel so that you can understand what are the things we are doing here okay so i just come to this last part so you can see what we are doing is we are taking all the words and putting it in a you know consists list okay so this dataset contains as you can see these are the columns we have title author text what is uh present in that news and what is the label so one represents fake news and zero represents real news so we are going to combine this author and text sorry the title and author and by analyzing these two uh text we are going to make our predictions okay so as you can see here that's what we have did here so we have stored all the author names and uh the corresponding news news title in x okay so now we need to convert this particular text into feature vectors so this is where we are going to use our df idf vectorizer okay so you can see here i have imported this tf idf vector is a function from sklearn.featureextraction.txt okay so this is the module that contains this tf idf vectorizer function so now let's implement this so and just make a text here as tf idf okay term frequency and inverse document frequency so what we are going to do is convert the textual data to feature vectors okay so we need to load our tf idea vectorizer to a variable so i'll just create a variable called as oh no we need to uh load the function so as you can as you have seen before that we have imported the uh tf idea vectorizer function from sk learn so i will just mention it here pf idf vectorizer parenthesis and then so let's run this now we need to fit and transform this vectorizer to our data so as you can see here we need to uh convert this ex data right so this x data contain all the content so content is nothing but the title of the news and the author of the news so mention this vectorizer vectorizer dot fit and inside this we need to mention x okay now let's create another variable x the same variable which contains all these values and now we are going to transform all this text data so we need to mention vectorizer dot transform so this will transform all the textual data so basically this will count the number of times the words are repeated so which we have discussed before so it will count those words and uh give the corresponding tf idf value for all these uh individual words okay so that that is what this fit and transform function will do and inside this we need to mention it so what happens is it will convert all this textual data to numerical data and it will be stored in this variable x so let's run this okay so you can see here in this case i have printed x and we got this text data right and now let's print x again which is already vectorized as you can see here now we have a bunch of numbers so this is how you can convert the textual data to numerical data and now we can feed this numerical data to our machine learning model to predict whether this news as a real news or a fake news okay so if you want to know more about this uh particular project so go to the playlist section uh in my channel so if you are new to this channel i i like to give you a quick introduction to my channel so uh in my channel i'm making a hands-on machine learning course with python so i will be posting videos on monday wednesday and friday and monday and wednesday the videos will be following the machine learning course order and every friday i will be posting a machine learning project video okay so that is about my channel so i hope you have understood all the contents covered in this channel so i'll see you in the next video bye hello everyone this is siddharthan currently we are in the fourth module of data collection and data pre-processing in our hands-on machine learning course with python and this is the ninth video where we will be taking a numerical data set as our use case and we will be doing some data pre-processing steps on this data set okay so before feeding this to a machine learning model so these are the previous videos in our data collection and data preprocessing module where we have discussed about where to collect the data for machine learning and how to collect this data through kaggle apis and other methods and i have also explained about these data preprocessing techniques such as handling missing values data standardization label encoding and how to split your data set into training data and testing data and how to handle imbalance data set and such things and also how to extract features from text data okay so in this particular video we will be applying some of these techniques of data pre-processing to process this particular data set okay so i'll give the link for this playlist and my other playlist in the description of this video okay so in case you are new to my channel i in this channel i'm making a hands-on machine learning course with python so i will be posting three videos per week two videos will be on monday and uh wednesday evening which will follow this course order and every friday i will be posting one detailed machine learning project video okay so i'll give the link for all the playlist in the description of my videos okay so this is google collaboratory so if you are new to google collaboratory you can go to the second playlist in my channel which is on uh python basics for machine learning okay so in that the first video will be on google collaboratory basics on and how to use google collaboratory and other features of google collaboratory okay so here there will be a connect option so you can connect your system from here and you can see this files option here so here you can upload your data set so i have already uploaded my data set so here we will be taking diabetes data set and doing some data pre-processing on this particular data set okay so this contains numerical data so once you download this data set you can go to this upload to session storage or just do a right clicker so you will find this upload option so from here you can upload your data set so i'll give the link for this data set file in the description of my video okay so the first step in any python program is to import the dependencies so and make a text here as importing the dependencies so dependencies are nothing but the libraries and functions which we need for our project okay so let's import some basic libraries such as numpy so import numpy as np and let's also input pandas as pd okay so number is used for making a numpy arrays and pandas is used for uh you know building pandas data frame so you can see here our dataset is in a csv file so csv means comma separated value so it is hard to analyze and process the data when it is in a csv file so we will feed it to a pandas data frame so pandas data frame are nothing but they are structured tables okay so it is very helpful to do some analysis and processing inside the data frame so and then we will be uh importing another function from sklearn dot pre-processing we need to import standard scalar okay so i'll explain you later why this standard scalar is used and then we need from sklearn dot model selection we need to import train test split okay so when it comes to machine learning so we will split our original data set into training data and testing data so we will train our machine learning model with this training data and we will evaluate our models performance using this test data and this train test split function is used to split the data into training return testing data so and you can run this particular set so you can press shift plus enter to run this cell and go to the next one okay so the next step is data collection and pre-processing so data collection and pre-processing okay so first we need to load our data from the csv file to a pandas data frame so i'll make a text here as loading the data from csv file to a pandas data frame okay so let's create a variable as diabetes data diabetes data which is equal to so we need to use the pandas function pd.read csv so you can see in the above cell that we have imported pandas in the short form as pd so i'm going to use this read csv function and we need to mention a mention quotes here and here we need to specify the path of our data set just go to this particular files option and you can find this menu sorry options window here so when you click this you can find this copy path so we need to copy the path of our data set file and we need to paste it yet okay so now let's run this this will load the content of the csv file to a pandas data frame okay so now we can print the first five rows so first five rows of the data frame okay and so we need to mention the data frame name which is diabetes data dot 8 so this 8 function will give us the first five rows of the data frame okay so this uh you know helps us to understand what are the different columns we have and as you can see here these are the columns we have in our data set so pregnancy so this data set is about female patients with diabetes and uh it tells us whether a patient has diabetes or not and also these parameters okay so the number of pregnancies that female patient has undergone and uh their blood glucose level their blood pressure value and their skin thickness insulin uh present in their blood bmi body mass index diabetes pedigree function so this is a function some value which gives us you know how much diabetes they are suffering from okay so such kind of measurement and we also have eh okay and we have this outcome as one and zero okay so here zero represents that a person is non-diabetic whereas one represents the person is diabetic okay so this is the outcome so this becomes our target variable so basically what we will do is we will train our machine learning algorithm with this data so once we feed you know train this uh model with this data so when we give all these features from pregnancies to age it will tell us whether that person is diabetic or not so it will give this label either one or zero okay so this is so this is uh how this particular data set is here so let's see how we can do the processing necessary before feeding this to a machine learning model okay so now let's understand how many rows and columns does this data set have number of rows and column okay so let's mention the diabetes data dot shape so this shape function will give us the number of rows and columns it has okay so we shouldn't mention the parenthesis okay so diabetes so totally we have 768 values 768 data points or nine features or nine columns you can count the column saves so 716 means we have data for 768 different persons okay so that is about this uh shape so this tells us what is the length of our data set and how many features we have okay so there are also some other you know measures that we can get about this data say for example when you put this diabetes dot describe diabetes dataset okay so diabetes dataset dot describe so it will give us some statistical measures about the data so diabetes dataset is not defined okay so we need some okay so it's diabetes data i made a mistake in the data frame name okay so it gives us the statistical measure so what is the count what is the mean of each of this column what is the standard deviation minimum value percentile values maximum values etc okay so these parts come under data analysis and we won't be not dealing much with data analysis in this particular video so our goal in this video is to uh you know know how to process this data okay so the next step is to split this data set into what should i say to features and target here these are six columns sorry this eight columns are features and this particular column is called as target so our column of interest so we are using our model to predict this column right this outcome column so this becomes our target column and others are feature columns okay so now we need to separate this features from this uh target value okay so i just make a text here as separating features and target okay so i'll create two variables x and y so i will be storing all the values except outcome in x and i'll be storing the outcome value in y okay so for this we need to mention the data frame name which is diabetes data and inside that so sorry so we need to use the function drop diabetes data dot drop so this will drop either columns or rows so we need to drop a column here so here the column which we want to drop is this outcome column so mention it inside this approach so outcome okay so and we need to we also need to mention axis so axis is equal to 1 so when we are dropping a column we need to mention axis z axis is equal to 1 and if we are dropping a row we need to mention access is equal to one okay sorry zero so one for column and zero for row now y so we need to store all the outcome variable in y so diabetes data and inside it let's mention outcome okay so let's run this now you can print x and y separately okay so let's print x okay so now you can see here it doesn't have this outcome column here and now you can print y so print y when i print y we can see that it has the outcome column label so here so let me put this here as okay so let us put this zero represents that that specific person is non-diabetic so you can find this data set and the details of this data set in gaggle so just search as diabetes data set gaggle in google so it will show you the website so if the label is zero that means that person is non-diabetic okay and for label one we need to mention that the person is diabetic okay so so we have successfully splitted our features and our target okay so the next important thing to do is so you can see the data so the pregnancies the value of the pregnancies is in the range of one you know it's a six eight okay so it's uh within this one and ten range whereas the glucose level is in in androids okay so it's in under 80s etc whereas the blood pressure is in 60s 40s and 70s and skin thickness is around in 20s and 30s and insulin values is you know there are few zeroes here and we have 90 for 168 bm is around 20s and 30s and this particular diabetes pedigree function is in the point you know it's in decimal and this edge is in a different range right so what we are going to do is we are going to standardize this data so data standardization is all about transforming all these values in a common range okay so for this purpose we will be using a standard scalar function so when we run this standard scalar function on this data so it will transform all the values to a common range okay but it it retains the meaning the data has to tell okay so this is for our machine learning model to understand this data better and to make better predictions okay so now we are going to standardize our data so the next part is about data standardization and here i need to tell one important thing so few people suggest you know to standardize the data before splitting it to x and y so before splitting the features on target um okay okay no so what happens is a few people suggest to standardize the data before splitting the data into training and testing data whereas some people uh you know suggest to split the data after splitting uh training data and testing data so i have explained you here right so so it's not features and targets so i have i've made a mistake so it's not about features and targets so it is about splitting it into training data and testing data so some people would like to split the data sorry standardize the data before splitting it to training and testing data where while some others you know prefer uh the other way around okay so in my opinion so what i do is so i standardize the data before splitting them into training and testing data because it it then retains the original range of the data set okay so if we split it before then it can uh you know lose some data so it becomes a problem at that point okay so both cases have their own pros and cons so this is my method of doing it okay so now we are going to standardize our data so i'll just create a variable as scalar the scalar is equal to standard scalar and let's run this so this will load this standard scalar function into this scalar variable now we are going to do scalar dot fit transform so what happens is so fit transform x so we are going to transform all these features right so we don't need to change this because it's already in either one or zero so this isn't labeled so we we need not change this so we but we need to change this x values so what i am going to do is i am going to fit all these values in this standard scalar function and transform them so once we do this all this data will be in a similar range okay so let's run this okay so now you can see the data here so what what's the range it is in so it is all them all of them are in you know in decimals it's uh in the range of minus one to plus one okay so they will be in a common trend so you can also print this data set as print standardized data okay so sorry so we need to mention a variable for this also so what i'll do is i'll store this transform the data into standardized data okay so what i am doing is i am transforming all of this x and i'm storing it in this variable called as standardized data okay so let's see okay so now we can print this standardized data and we can see here all the data are in a common range now okay so now we can uh proceed our further processing so now what we will do is i'll again store all of this standardized data in x and the targets in y okay so let's put x is equal to standardized data okay x is equal to let's run this now again you can print x and y and c so x will be our features but it is uh standardized and now you can print your y so we have already stored all the values of this label in y okay so the outcome column and y okay so now what we will be doing before feeding to machine learning algorithm is splitting the data into training data and testing data and for this we will be using this train test split function okay so this step is about splitting the data set into training data and testing data okay so for this we need to create four arrays so this particular step is common for all the projects which we do so we need to create four arrays as extreme x test and y train and white test so now we are going to use strain test split function so i'll just explain you what is mean by this after running this so we are going to split x and y and let's mention test size how much data we want as testing data so let's put test data is equal to 0.2 and random state so random it is equal to 2 okay so what we are doing is so we will be splitting our data into training return testing data where exchanges are training data features and the corresponding labels the corresponding outcome for our training data features are stored in y train okay so this is extraneous this particular x will be all these features will be splitted into x train and x test okay and the corresponding labels for x train this y value so this corresponding y value will be stored in y train and the corresponding labels for this x test will be stored in y test okay so that's why we need four arrays and here we are using the strain test split function which we have imported from sklearn.model selection and in this parameters we need to mention x and y where x and y are those features and targets which we have already splitted and the second parameter is about test size how much data we want as test size so here i have mentioned mentioned 0.2 so 0.2 means 20 percentage of data so generally we take 10 to 20 percentage of data as our test data and the 80 percentage of data is training data and then we have this random state is equal to 2 okay so random state means uh if you mentioned random state is equal to 3 then your data will be splitted in a different manner so when i mentioned to my data will be splitted in a different manner if in case you are let's say that you are practicing this code and you want your code to split the data in the same way that my code is splitting then we need we both need to give the same random state number so this is for this okay so it is like a serial number or index for similar splitting of data so let's run this so this will create four arrays and you can print the shape of each of this x dot shape so x is nothing but the original data before splitting and x strain so extraneous the training data features and x test it stresses testing data features dot shape okay so let's find how many data points are there for each of this case so okay so now you can see here the total data points are 768 out of this 80 percentage of data which is 6 6 14 data points are stored in this extreme whereas 20 percentage of data which is 154 are stored in x test okay so this is how you can split your data into training data and testing data so this is about data pre-processing of a numerical data set so this is a pretty uh you know easier data set a simple data set so we don't have you know any other uh big works here so in some cases if we have some missing values we need to handle those missing values and other kind of things so i have mentioned about all of these procedures in this particular playlist okay so i'll give the link for this playlist in the description of this video and after this we can feed this training data to our machine learning model so that it can learn from it so i have already made a project video on this diabetes prediction so in my channel so you can go to the machine learning project playlist in my channel where you can find this particular project okay so that's it about this data and about this video and i hope you have understood about how we need to do this data pre-processing techniques on a data set okay so that's it for this video and in the next video i'll explain you how we can do these data preprocessing steps in a text data okay this text data is kind of more complex and you know complicated when compared to numerical data because we need to convert all those text data into numerical values and there are a lot of processing that we need to do so it is an interesting video so that will video will be posted coming wednesday okay so that's it from my side i'll see you in the next video thank you hello everyone this is siddharthan this is the last video in our data collection and data pre-processing module in the previous video we have discussed what are the data pre-processing steps that we need to do on numerical data set and in this video we will be taking a textual data and see what are the data pre-processing steps that we need to do before feeding it to a machine learning model for predictions okay so in case you are watching my videos for the first time i this is data and in this channel i'm making a hands-on machine learning course with python so you can check out the details in the description of this video i'll give the details about my machine learning course and my machine learning project so i will be uploading one machine learning project video every friday okay so you can find all the details and the link for the playlist in the description of this video okay so with that being said let's get started with today's video and for this video we will be taking this fake news prediction data set okay so i have already posted a video on this fake news prediction and if you have watched that video so you don't need to watch this video because it's basically the same procedure okay so in case you haven't watched it or you want a better understanding of all the data pre-processing that we do on that text data so you can continue watching this video okay so now the first step is to upload our data set into this google collab environment okay so you can search for google collab so if you are not aware of this so when you search for google collapse we will find this collab.research.google.com so so there you can do your python programs okay so it is a cloud-based python environment so there will be this connect option so you can connect your system from here so you can find this files options files option here so there you can go to this upload option uh to upload your data set file so now the data set which we are going to use is is about android dmv in size and if we upload this data set it can take some time so what we need to do is so we can upload this data set in our google drive okay so then you can go to this mount drive options okay so when you give this mount drive your google drive will be linked to your google collaboratory account and the important thing is both this account so this account in your google drive and in google collaborate should be the same account okay so in some cases we may be dealing with data sets that is about 1gb 2gb or even 10 gb inside so in that cases we cannot upload our dataset every time we uh open it in our google cloud okay so in such cases we upload it to our google drive and process it from google collaborated so that's what we are going to do now i have already uploaded my fake news data set in my drive so i'll give the link for this dataset file in the description of my video so you can also find it in kaggle so i'll give this mount drive so while mounting so okay so connect to google drive so now you need to it will ask for your account so that you need to authorize it so once you authorize your account it will mount your drive okay so this is our drive so when you click this arrow here it will give all the subfolders in it so i'll go to my drive and there you can see all the file all the folders i have so i have stored this data set in this data sets folder so you can see here my drive data sets and fake news data set so i'll go here and this is my data set train dot csp so this is a csv file so from here we can access this dataset instead of uploading it every time in our google collab and uploading in google drive actually it takes very less time when compared to google call up okay so the first step is to import the dependencies so i'll make a text here as importing the dependencies okay so the pre-processing of text data is more challenging and interesting than numerical data because there are a lot of steps which we need to do here because computers doesn't understand understand text well so we need to convert this text into some meaningful numbers some numerical data okay so these are the processing which will be doing for this text data so let me import some basic libraries so i'll import numpy as cnp so we generally import numpy and other libraries in their short forms so instead of using numpy we can use uh in in the for in a short form as np so that's the general convention everyone uses in python okay so now let's import pandas pandas as pd then i'll import another library called as re so re means it's a regular expression so you can just search in google about regular expression library in python so you will find the documentation so it is very useful for scanning and going through some text in the documents so this is called as regular expression library numpy library is used for making arrays and pandas is for making data frames so data frames are nothing but structured tables so as we know that the data set is now in a csv file so csv means comma separated value and it is not easy to analyze the data from a csv file okay so to give it a better structure we will feed it to a pandas data frame so we have uploaded three libraries and now we need from nldk so nltk dot corpus import so from this particular library i'm importing a function so import stopwatch so i'll explain you what is meant by this stopwatch and other things when we encounter this in this particular video so corpus means some text content so a corpus me can be a paragraph or it can be a document which contains all those words okay so it's basically the collection of words and nltk means natural language toolkit so this natural language toolkit library contains several functions and important you know methods that we use for our text processing and now let's import from nltk dot stem dot portrait stream so then from sklearn dot pre-processing sorry so from scale dot feature extraction import extraction so it should be feature extraction dot text input tf idf vector razer so we have already made a video on what is meant by this vectorization what is known by this feature extraction and tf idf terrazer they'll also explain you in this video what are these things and from sklearn dot model selection import train test split okay so these are the dependencies we need for this particular processing and now we need to download these stopwatch so it's also import nltk separately import channel tk so this mltk is the library and we are importing this entire library in this particular line whereas in this line from this uh you know library there is a separate module called as corpus and that contains a function called a stopwatch so we can also import specific functions we need from a library whereas in this particular case we are importing the entire library so that's the difference between importing this nltk and like this okay so that's the difference now we need to download general tk dot download top words so let's run this to run a cell and go to the next one you can i need to run this first okay so to run a particular cell and go to the next one you can press shift plus enter so this will uh download all the stopwatch so it is already i think it's downloaded now we can print this stopwatch printing the stopwatch so i'll show you what are these stopwatch print stopwatch dot words english so now we can see all the the list of stopwatch so these words are nothing but i me my myself we are ours etc so we have a lot of stopwatch so stop words are those words which can be repeated a lot of times in a paragraph or in a document okay but these words doesn't convey much meaning so these words are stopwatch so when we do this uh data preprocessing we always encounter a lot of data a huge amount of data and these words doesn't convey much meaning so we need to remove these words from our data set so for this purpose we need to download these stoppers to identify this words from our data set so that's the purpose of downloading this stopwatch and the next step is now we can go to the data preprocessing step and make a text here as data pre-processing so the first step is to load our data set into a pandas data frame so as i have told you earlier we cannot access it easily from a csv file so i am going to load the data to a pandas data frame so let me create a variable called as news set okay so let me put it as news data and i'm going to use the function pd.read csv so this read csv function is present in pandas library and this read csv function will load the dataset from a csv file to a pandas database so you can go to this files and you can see this options here so from there you can copy the path of this file and paste it in the codes now you can run this and this will create a new data frame which contains this entire dataset so now let me print the first five rows of the data set here so i'll make a comment here as first five rows of the data set so it is always a good practice to mention what you are doing in a line of code by these commands so if someone else you know uh sees your code then it helps them to understand what you are doing in a in a line of code so that's the importance of commenting so i'll mention this data frame name as newsdata dot yet so this yet function will print the first five rows of the data frame so we have this id column title column author column text column and label column okay so this is the title of the news and here we have the author name of that particular news and what is the text entered text of the news and finally we have this label here these are if if the label is 0 that means then the news is a real news and if the label is 1 then the news is a fake news and the idea behind this data set is to train a machine learning model to let it understand which news are fake and which news can be you know real and this labels it helps the model to understand it so i just make a text here as zero means real news and one means fake news okay so now let's see how many total data points we have so news data dot shape so this will tell us the number of rows and columns present in the data set so totally we have twenty thousand eight hundred rows and five columns so these are the columns right id title author text label etc and totally we have twenty thousand eight hundred dollars so twenty thousand eight hundred dollars means so we have that many different uses so we also call this as data point so each of this particular row represents a data point or a separate news okay so totally we have these many uh data points so it is a pretty large data set but it is not the largest we may also deal with you know hundreds of thousands of data and even more than that so here we have 20 000 data which is a very good so if your data set is large then you can make better predictions with your model okay so the more the data the better your performance performance of the model is so now let's see if this dataset contains any missing values so checking for checking for missing values news data set news data dot s null dot sum so this function will give us the number of missing values in each column so in this id column we don't have any missing values in title column we have about 558 uh missing values in author it's about 1957 text and it is 39 okay so what this is so there can be few news and the author can be anonymous or the curator of this data set may not have found the author name so in that cases uh we have this null values or those missing values okay so in this particular case we can replace all the missing values with null string so if a value is missing it will be represented as n a and so n i n means not a number okay so we need to replace this man with null string so in in case of numerical data set we can impute it with mean value or mode value so we cannot find the mean value for this text dataset right so in this case we will replace all the missing value with a null string so let me put it here we are replacing the missing values with null string and use data which is equal to use data dot fill name and get mentioned so here fill in a means uh not n a means not available and uh what we are doing is so here i have mentioned quotes so in this course if we if i mention some words like let's say i mention a word like tree so what happens is all the missing values will be replaced with this word tree when i don't mention any words inside this uh quotes then it is called as a null string so it is a string but it doesn't contain any text or value so i want to replace all the missing values all the m a n values with this null string so if you just you know feed this data set which contains this and value to our machine learning model it can throw some error so it doesn't understand those nan values so now we need to replace this with uh this null string so let's run this okay so this will replace all of them now uh what we are going to do this do is in this particular prediction we are going to take this title column and author column so we won't be taking this text column so we will be analyzing our data set only with this two features so what i am going to do is i am going to combine this values in this author column to their corresponding title column so what happens is if we take this first row we will create a separate column and in that column it contains author name plus the title of their newss and the second row contains their author name of the corresponding title so now let's merge the author name and news title okay so i'll mention the data frame and in this let's create a new column called as content in this content let's combine the author name and title of the news so which is equal to news data author so just one code here so okay author plus use data and now let's mention title okay so what we are doing is we are creating a new uh column called as content in this data set in this data frame so we have named this data frame as new status so we are creating a new column and in that column we are taking this uh author column and we are just giving a space and combining it with this the title of uh that particular news so let's run this and now you can again print the first five rows of the data frame now let's see what is this content column looks like okay so now there is this new column called as content and it contains the author name first and the title of the news in each of the rows okay so and now we are going to use this content column for further processing and now we will do this text processing on this particular column and not the other things okay so now let us separate the two important columns so here we need this content column and this label column okay so we don't need any other column so because we have already combined these two so i am not going to make prediction based on this text if you want to make your prediction based on this text you can add this text column with this author and title column but it will take a long time because this data is very huge so that's why i'm taking this author and title alone so now let's separate this content and label so separating feature and target here the target is this label so the target is nothing but whether we are predicting whether it's real or fake so this is called as target and here the feature is nothing but this content column because our model is going to understand this features and it's going to predict the target as label one or zero or other things okay so now let's put two variables as x and y and in x let me mention news data set dot drop [Music] columns so i'm going to drop a column here enable so let us drop this particular label column and now we need to mention axis axis is equal to one so here we are dropping a column so we are mentioning the axis is equal to one in case you are dropping a row we will be mentioning access is equal to 0 so now we are going to store this label column in y so news data label okay so now you can print x and y so this x doesn't contain labels when you print this y it will contain all the labels and now we can do our processing on this x so y doesn't need any processing because it already is in the form of numerical values in the form of our labels now we are going to do an important step here so this process is called as stemming okay so stemming is nothing but i mentioned what is meant by streaming here stemming is the process of reducing a word to its keyword okay so what is mean by this keyword let's say that there are words like uh enjoyable enjoyment and uh the enjoying and the enjoyed so these words are there and the root word for all of these words are nothing but enjoy so this enjoy can be represented in different forms so we don't want all those different forms so what we do is if we have those kind of words it will be reduced to a uh you know its root word because root words are small and the processing can takes place in a faster time and the processing is you know more easier when you uh you know reduce it to a root word so root or keyword so let me put it as root word instead of keyword and for this purpose only we are going to use the potter's timber function here so we are going to do this trimming function using this porter swimmer and we have imported the spot estimator from nltk.stem okay so let's create a variable called a spot stream and in this port stem i'll load this quarter stemmer function okay let's run this so this will load this port html function to this hot stem variable so it is just uh you know loading one instance of this function to a new variable so let's run this okay so now let's create a function to do this stemming procedure so let's create this function and name this as stemming and in this stemming we need to give this content column right so let's give this parameter name as content okay so now we need to do some functions in this particular thing and let's name variable as stemmed content so what we are going to do is stem all of this text and put it in a variable called as stem content and this stem content is equal to re dot sub so so you can see here this regular expression as i have told you this is useful for you know uh going through the text in a particular paragraph for a document just one second okay so now we need to mention this re dot sub so you can see the purpose of this particular function so this is for defined so now let me mention um a2 is it a 2 z and caps a 2 so i'll explain you what i am doing once i complete this line of code okay so i want to you know go through all the content so all the text in this particular content column and take all the words from a to z so if the word is in this particular range it will take it so all the values are all the things that are words nothing but you know so the purpose of this is i want to remove all the punctuations like comma you know full stop on other things so that's why we are mentioning that i want all the content which is in the form of words so that's why we are mentioning uh you know small letter words or capital letters so that's what we are doing is so it's either lower case or in upper cases so the other things are you know uh you know it doesn't represent a word so i want all the words that are uh not punctuations and other things other numbers so now the next step is to convert all the words to lowercase letters so all the uppercase ir case letters will be converted to a lowercase letter and stemmed content so it's the same content variable so all this letters will be changed to a lower letter form so again mention this stem content dot lower so this functions lower functions are present in this uh regular expression library so and now content we need to separate all the words to do further processing for that we need to split all these words so stem contents is equal to content dot split so this will split all the words and now stemmed content so now let's do this stemming procedure m content is equal to you can see here we have imported this potter stemmer function in this part stem variable so i'll mention it here fourth stem and in brackets let's mention word and now i'm going to create a for loop i'll explain you in a minute what we are doing in this particular line of code for word in stemmed content if not word in topwatch dot words english okay so what we are doing is so i'm going to apply this stemming function to all the words present in this uh content column so we are taking all the words we are you know excluding all the punctuation marks and other things we are just taking all the words and we are converting all it all of them to a lowercase letters and then we are splitting each of the words and now we are going to apply this stem function to those words if that word is not present in this stopwatch so you can see here we have downloaded this stopwatch right so i will be taking each of the words in this content uh column and if it is a stopwatch we don't want that so we want all the words that are not present in the stopwatch so basically we are removing all the stopwatch from this column and uh when we take all of those words we are applying stemming so we are converting those words to a root word so what happens is so let's say that so you can consider this first line so it contains this v so the v v is an example for stopwatch so this particular word won't be considered and it will be removed for all the rows so all the uh you know stopwatch will be removed and now we can have the words like truth and uh as i have told you earlier so let's say that there is a word called as uh enjoyment so that word will be taken and it will become converted to its root word so that's what we are doing so we are taking all the words and if that word is not present in the stop words we are applying this a stemming function so that's what we are doing in this particular function called as stemming and finally let's return this stem so whenever you are working on a function so this diff is used for its def means defined so we are defining this function particular function and once this all of the you know procedures in this function is carried out we need to return the stemmed content so return them to content okay so now we have created a function so we didn't apply this function to our data set yet so we haven't applied to this data set yet now we need to apply this to our data set so let's run this function again so it won't do anything so now we need to apply this stemming function to our content column okay so let's mention this news data content which is equal to use data content dot apply stemming okay so what we are doing is we are taking this uh particular content column in our data frame and in that we are applying this stemming function so let's run this so it might take some time okay so this object is not callable so for testimony so i think we have made a mistake here let me see what is that okay so it's nothing but so in this we need a function called a spot stem dot stem okay so this pod stem dot stem function only does the stemming procedure so let's run this again okay so this is running so this will uh stem all the words to its corresponding root words and this might take some time because this data set is very huge okay so it may take a minute or two so once we apply this streaming function so we have uh you know the important text data and once we have those important text data we can convert this text data to a numerical value so what we do is we use a method called as feature extraction and in that we will convert all this text to a feature vectors okay so let me pass this video until it's run okay so it took about three to four minutes so now i'm going to print this news data set dot content so let's see whether this stemming has happened here so you can compare this particular stemmed content to the content which we originally add okay so all these words has been splitted okay and uh and all the stemming has been applied to you know reduce it to its corresponding root watch so now we can separate the data to their uh corresponding features and target so i'll create two variables as sorry as ex and y and in x i am going to take this news dataset dot content and in y is the dot values and let's take all the labels in y so news data set label values so these are the two things which we need to use dataset okay so it's news data so these are the two things which we need here x is the feature and y is the corresponding label so you can print this x and y separately to see what are them so this is x and let's print y as well okay so the y is nothing but uh it's either one or zero so now let's uh print the shape of y so y dot shape okay so we shouldn't mention this parenthesis y dot shape 20 800. now we can apply the feature extraction so now we are going to convert all these words to their corresponding feature vectors so converting the textual data to feature vectors so if you remember we have imported this tf idf vectorizer and we are going to use this particular function vectorizer is equal to so let's load this tf idf vectorizer function to this vectorizer variable and vector raiser dot fit so we are fitting this fitting all uh the data this content to this particular vectorizer function so we are going to transform this x okay suppose we need to fit the x and now we need to transform it eraser dot transform x okay so we are fitting it so it now understand what are these words and now this vectorizer can transform all the words to their corresponding feature vectors so this tf represents term frequency and idf represents inverse document frequency so it finds the words which are repeated a lot of time and it assigns some importance of values to it so by this it understand if a word is you know meaningful or not so let's say that we are uh you know predicting whether a male is a a real male a normal male or a spam mill so spam males contains the words like offer free discounts etc so when you apply this kind of tf idea vectorizer to this to those data sets it can uh identify that those words are repeated in spam mails and this will help our model to understand what makes us pamela and which mails are uh normal mates so let's run this okay so list object no attribute lower okay so what happened is i have missed one line of code here so once we apply this stemming function so we need to join all those words so all the words in this particular line will be combined together so before if we have seen the output will be like all these words will be separated by commas and they will be enclosed in let's list so it shouldn't happen so we need to join all the words in one line so we need to use this dot join function okay dot stem content so i have you know included this particular line of code and rerun all the things here so this is the mistake that we have did in this particular line so now we can run this vectorizer function and see whether it is working or not okay so it has i think it has uh worked properly now let's print x and see whether it has converted all this text data to feature vectors now we can see here it contains a lot of numbers so these are the feature vectors for all those corresponding words so when you have printed this x before so we have all these words right so now you can see this x it has converted to uh numbers so each of these words has their corresponding numerical values so this is how you can apply this uh vectorizer functions so the next step will be to split your data set into training data and testing data and once you split your data into training data testing data you can feed it to your machine learning model so this is about data pre-processing of text data so this is the general procedure which we follow but it is all you know there are few things that may change depending on other data sets but it is the general procedure so if you want to know further about how to spread this data into training and test data and how to do further predictions you can go to the machine learning project playlist in my channel and there in the fifth project is about fake news prediction and there you will find how to feed this data to a machine learning model and make predictions so i hope you have understood all the contents that we have discussed in this video i'll see you in the next video thanks rock versus mind prediction using sonar data okay so we will do all the programming in python and you don't need to install any python software so we will be using google collaboratory so google collaboratory is a cloud-based system where you can write your python script so you just need google chrome for it so i'll show you how you can access it later in this video first of all let's try to understand more about this use case consider that is a submarine okay so there is a war going on between two countries so submarine of a country is going in you know underwater to another country and the enemy country have planted some mines in the ocean okay so mines are nothing but the explosives that explodes when some object comes in contact with it right so there can also be rocks in the ocean so the submarine needs to predict whether it is crossing a mine or a rock okay so our job is to make a system that can predict whether the object beneath the submarine is a mine or a rock okay so how this is done is the submarine since a sonar signal okay sorry a sonar uses a sonar that sends sound signals and a receive switchbacks so this signal is then processed to detect whether the object is a mine or it's just a rock in the ocean okay so let's try to understand how we are going to do this first of all let's see the workflow for this project first of all we need to collect the sonar data okay so how this data is collected so what happens is in the laboratory setup an experiment can be done where the sonar is used to send and receive signals bounce back from a metal cylinder and some rocks okay so because the mines will be made of metals right so we collect this data which is nothing but the sonar data which is which we obtained from a rock and a metal cylinder okay and we use this sonar data and we feed the sonar data to our machine learning model and then our machine learning model can predict whether the object is made of metal or it is just a rock so this is the principle we are going to use in our prediction okay so first we need to collect the data okay so once we have the data we will process the data so we cannot use the data directly so we need to pre-process the data so there are various steps in that which we will do in this video so we need to analyze the data so we need to understand more about it so once we process the data we will split it into training and test data okay so why we are splitting the training and test data because so let's say there are 100 examples so under instance of data we will train our machine learning model with 90 uh you know instances and we will test our machine learning model so we evaluate our model with another 10 data points okay so there are totally 10 hundred data points we may use 90 data points for training and we can use another 10 or 20 data for testing our model okay so once we split our data into training and test data we feed it to our machine learning model so in this video in this use case we are going to use a logistic regression model because logistic regulation works really well for binary classification problem so this problem is a binary classification problem because we are going to predict whether the object is a rock or a mine okay so we will be using a logistic regression model and this is a supervised learning algorithm okay so once we train our machine learning model with the training data we will get a trained logistic regression model so this model has learned from the data on how a metal cylinder will be and how a rock will be so this model can recognize it based on the sonar data now when we give a new data it can predict whether the object is just a rock or it is a mine so this is the workflow we will be following in python to make this you know script for this use case okay so now let's go into the coding part so before that i'll show you the data via okay so this is the data sonar data.csv so it is in a csv file so you can find this data in kaggle and other data sites like uc machine learning repositories and other sites so i will be giving the link for this file in the description of this video let's try to look at this video sorry in this uh data set so as you can see here there are a lot of numbers it's so there are a lot of columns so let's see there are how many instances so there will be almost 200 examples so as you can see here there are two not eat examples that means do not eat data points so on the in the last column we have something that tells you know r and e m okay so r represents the values for rock and m represents the values of mines okay so as i told earlier this values are obtained in a laboratory setup where the sonar data is collected for a metal cylinder and for a rock so as you can see here there are several features so features represents the columns of this dataset so we feed this dataset to our machine learning model and it can learn from this data on our rock can be and our mind can be with the help of this sonar data okay so let's see how we can make this python script so close this so as i told earlier we will be doing our programming in google collaboratory so search for google collab so this is the website for it so collab.research.google.com okay so you just need to choose new notebook here so this google collaboratory will be linked to your google drive account so if you have any collaboratory files so it will show up here so i'm going to use a new notebook so these are nothing but python notebooks as you may have noticed this in jupiter notebooks so it has an extension of ipy and v so it is just like jupiter notebooks so as you can see here this dot pynb file which is known as python notebooks so i'll change this file name as rock versus mind prediction okay now you can see this connect button here so go ahead and connect this so what happens is our runtime gets connected to a google server so it is completely free google collaborator is completely free so you you will be allocated a system of a very good storage size and very good ram so as you can see here so we have 12 gb of ram and 107 gb of storage which is really good so it is better than most of our systems so we will be doing all our programming in google collaboratory okay okay so as you can see here this is called as a sill so we will write our python scripts in this cells so as you can see here you can give this code option to create another cell so if you give this text you can write some comments or a description about your code okay so i will tell about the features of this google cloud once you know we use it for different purposes so so as you can see here this is where we will upload our data files so i have already shown the sonar data file for you so how you can upload this so you can click this folder option here so there either you can choose this so it is to upload a particular file or you can right click here and click upload so i'll upload the sonar data okay so it is a very small data file so you can find it in kaggle or use a machine learning depository okay so as i told here our agenda here is to predict whether the object is a mine or a rock using the sonar data so first of all we need to import the libraries we want to import the dependencies so we will require several functions and libraries for this so i'll write a comment here so as i told you told you earlier so this is for writing description and comment about your code so just type importing the dependencies okay so you can press enter or you can place press shift press enter to complete it and go to the next cell okay so once you write a python script you can click here to run your code or you can place shift plus enter to run this code and go to the next cell okay so first we need to import some libraries so we will require numpy for this so i'll import numpy smp import numpy smp and we also need pandas so import pandas as pd okay so numpy is basically for uh you know for arrays and pandas is for several data processing steps so we will see about that later now we need a train test split okay so we have seen earlier that we need to split our data into training data under test data so we will require a function for that so from sp loan dot model selection input train test split okay so this function is used to split our data into training and the test data so we don't need to do it manually okay so then we need our logistic regression model so sqlon is a very good python library for machine learning algorithms and other functions so we will encounter it in various different uh places here there is a small mistake here scale on dot model selection so input train test split now we need to import our logistic regression model so scale on dot linear model so this is how you can import logistic regression so input logistic regression and we need the function accuracy score so from sklearn dot matrix input accuracy score so this is used to find the accuracy of our model okay so these are the libraries and functions meaning so first we import numpy so number is used for creating numpy arrays so and pandas so pandas is used for loading our data loading our data and numbers into a good table so these tables are called as data frames so we will discuss about this that in a later point so then we have a train test split so we import it from the library scale on then we have imported the logistic regression model then we have imported the function accuracy score so you can press shift press enter to run this cell and go to the next cell okay so if you have any printing output it will show here okay now let's do the data collection and processing steps so i'll just put a text here data collection and data processing okay so we already uploaded the data there are several methods to you know upload data in google collaboratory we can uh upload the data straight to collaboratory using some apis so we can do it with kaggle aps so we'll discuss about it in some other project video so as you can see here we have the sonar data file here so now we need to import this sonar data into a pandas data frame okay so i'll make a comment so in python if you write something prefixed by hash you can comment it so loading the data set to a pandas data frame so i'll create a variable called as sonar data i will be loading this uh data to a data frame and i have named this data frame as sonar data so as you can see here i have imported pandas as pd so this is just like an abbreviation so i'll be using this abbreviation so pd dot read csv so as you as i have told you earlier we have the data file as a csv file so we need to use the function read csv okay so now we need to mention the name of the file name and the location of the file so you can do it by you can go here and click here so there you will see this option called as copy path so this will copy the path and name of the file so we have to enclose it in quotes and put it in this brackets then so as you can see here we don't have a header file for this so as you can see here we don't have any header files other files means a name for the columns right so there is no editor file so we need to mention in our uh pandas data frame that there is no header so editor is equal to none so i'll press shift plus enter and this loads our data to a pandas data frame okay now let's uh have a small look at our dataset so i'll just type sonar data dot yet so what this function it does is it displays the first five rows of our data set okay so i'll run this so as you can see here we have first five rows of our data set and there are several columns so as you can see here there are 59 columns but actually it's 60 column because in python the indexing starts from zero so totally we have 61 columns and 59 columns we have 59 features and we have in the last last column we have r or m so i have shown you right it's either rock or mine so it is that categorical value so this is the use of this get function it tries to you know it prints the first five rows of our data set now what we will do is let's try to see how many rows and columns are there so number of rows and columns so if you if you are not you know if you don't understand any functions you can just search in google about let's see so you want to know what this read csv function does so you just can go to the pandas documentation so pandas dot read csv so this is the pandas official documentation page so you can go here so you will uh so you can see the use of this particular function here so as you can see here it read a comma separated value csv file into a data frame it supports optimal iterating or breaking the file into chunks so you can do this for any functions so in order to learn what this function actually does so these are the parameters so we don't need all these parameters so in our case we have used only two parameters which are the path of the file and we have included that there is no editor okay so if you have any doubt about any function you can search it in google like this okay so now we need to find the number of rows and columns so we can use it you can we can find it by using the function called as shape so sonar data dot shape so this gives us how many uh rows and columns are there so totally we have two not eight columns and sorry two not eight rows and sixty one columns the last sixty first column tells us whether it is a rock or a mind and there are totally two not eight rows so two not eight rows means there are two naughty instances or examples of data okay so on 61 represents the feature so let's say for this zero zeroth instance so it is a value for one rock and there are 60 features for this one rock and it is labeled as har okay so like this there are two not eight instances now what we will do is let's try to get some um statistical definitions for this data so shown our data dot descript so this gives the mean standard deviation and other parameters for our data sorry just made a small mistake here so now data.describe so as you can see here it gives the count so count represents the number of instances we have for the zeroth column so like this we have all the way up to 59th column so it gives us the count so the number of uh values we have the mean of this column standard deviation for this column minimum value 25th percentile 50th percentile 75th and what is the maximum value so percentile means like uh 25 percentage of the values are less than this 0.0133 for first columns and 50 percent means 50 percentage of the values are less than 0.022 so that is what percentile is so for some use cases uh it is really important for us to find this mean and standard deviation it it gives us a better understanding of the data so hence you can use this describe function to get some statistical measures so i'll just make a comment here so describe gives statistical measures of the data okay now let's try to find how many examples are there for rock and how many examples are there for mains okay so we can do that by sonar data dot value counts so this value counts function gives us how many rock and how many mine examples are there okay so i have to include one more thing so we just need to put a 60 here the 60 is nothing but the column index so as you can see here the rock and mines are specified in the 60th column so i'm specifying 60 in this value count so value count function okay so why i am using sonar data because i have loaded this data frame into a variable called as sonar data so that's why i'm using this sonar data and including the function with this okay so now let's see how many rock and mine examples are there so as you can see here there are totally 111 examples are there for mine and 97 examples are there for rock so it is almost equal so it is not a problem so if we have uh data for one type of instance more let's say for example if we have a thousand examples for mine and we have only 500 examples for rock then our prediction won't be very much good okay so if we have almost equal number of example for both the two categories so our prediction will be really good and we will get a very good accuracy score and our model performs well okay so it is almost uh equal here so actually uh speaking so totally there are almost 298 instances but this is not sufficient for a machine learning model so we may require even thousands and several thousands of examples for making a better model but we are just looking at some examples so we are okay with this you just need to note one thing more the data more accurate your model is okay so i just represent here that m represents mine and r represents rob okay now let's try to group this data based on mine and rock so shown our data so now i'll explain you what what is the use of this so as you can see here so we got the mean value for all the columns for mine for a mine we have the mean value for 0th column as 0.034 but for rock it is 0.022 as you can see here there is quite a difference between these two so this difference is really important for us because using this only we are going to predict whether the object is either a mine or a rock okay so we just found the mean for each of the 59 columns sorry 60 columns okay so the mean value for mine is these values and for rock is this values and there is a quite difference between them okay so this is really important for us now let's try to separate the data and the labels okay so here the data i mean the these numerical values so these are the features and the last column is the label so we need to separate this so we are going to do that let's see so i'll just make a comment here separating data and label so this is a supervised learning problem where we train our machine learning model with data and labels so in unsupervised learning we don't use labels so here we have labels which are nothing but rock and mine okay so let's put all the data in the variable x so so not so no data so i am going to drop the last column 60th column so data dot drop columns is equal to 60 so i am dropping 60th column and if i am dropping a column i need to specify the axis as one so if i am dropping a row i will be specifying axis as zero okay and let's put the label in the variable y so sonar data so we need to use a square bracket here 60 okay so what i'm basically doing here is i'm storing all the values in x except the last column so i'm dropping the 60th column and i'm storing storing the sixtieth column in y okay let's try to print and see this so print x in the print way so as you can see here now there are only 59 columns so actually it's 60 column and it starts with zero and the last label is in the variable called space so we have successfully splitted the data and the labels okay now what we will do is we will try to split this data into training and test data okay so let's include a text here training and test data so as we have already uh imported the function drain test split so we will be using this function to split our data okay so we need to give some variable names here extreme so you can give any names here so for convenient purpose i'm giving this name sorry explain x test y train and y test so this order should be followed so first we will take the training data and the test data then we will take the labels of training data labels of test data is equal to train test split and we have to include this x and y here so we are going to split this x and y into training and test data so x comma y so there are several parameters here so i will explain you about that so x comma y test size let's have the test size as 0.1 and stratify stratify is equal to y and random state c equal to let's say okay so now let's try to understand about these parameters so we are going to split our data into x strain and x test white rain and white test so extreme is nothing but the training data for this and uh x test is the testing data and y train is the label of those training data and y test is the label of the test data okay so now we are using the function train test split then in the parameters we have included x and y so we are going to split this x and y into training and test data so here we have the parameter test size so test says like if we give 0.1 means we need 10 percentage of the data to be test data okay say for example we add uh almost 200 examples so what happens is like 10 percentage of 200 is 20 so we will have 20 test data so that is the use of this test size you can use 0.1 or 0.2 so based on the number of data you have okay so here we will take just 10 percentage of our data stream does stratify is equal to iso strategy why we are using this stratifies yes we need to split the data based on rock convention say for example we need to have equal almost equal number of rocks in tested training data and equal number of mines in the training data okay so hence we include this stratify so our data will be splitted based on the number of uh these two row conveying okay and then we have random state so random state is to split the data in a particular order say for example if you do the same code and in the code you include one your data will be splitted uh you know in the same way as my data splitter so if i put two here so my data will be splitted in a specific way and if you include a two in your code it will also be splitted in the same way it is basically to reproduce the code as it is so i'll use one okay so now we can split our data okay now let's try to see how many uh training data and text test data are there so print x dot shape so it is the original data without splitting into train and test and then we have extreme [Music] dot shape fix test dot shape so i'll run this so as you can see here in the original uh x we have two not eight examples and in the training data we have 187 instance and in the test data we have 21 instance okay so we have 21 test data and 187 training data now we need to train our machine learning model with this extreme with this training data okay now let's see how we can train our model so model training so we will be using a logistic regression model so i'll create a variable called as model so as you can see here we have already imported the logistic regression function here so model is equal to logistic regression so this will load this logistic regression function into the variable model now we are going to the models for training the logistic regression model with training data okay so for that we use the function model.fit here we need to include the training data and training label okay which are strain and white ring okay so let's see what is this extreme unwind train once so that you can understand it so i'm printing extreme and also white ring so this is the training data so exchange is the training data and white rain is the training label so as you can see here we have the data here so totally there are 187 examples and 60 columns and this is the label so as you can see here it is kind of random because we have used the training testing spread okay so now we are going to feed this training data on this training data label to our logistic regression model so that's why i have included model dot fit x train and y train so when i run this our model will be drained so if you are having a lot of data and if you are using a complex model like a neural network it will take a lot of time so here we are seeing a simple example and simple data so it doesn't take much time okay so our model will be trained so these are some parameters of our model now let's see how our model is predicting so now let's check the accuracy score of our model so now we are going to evaluate our model so model evaluation so we have imported the function accuracy score for it so we will use this function to find the accuracy of our model accuracy on training data so first let's find the accuracy on the training data so what's happening here is so we will use this same data the data which the machine learning model has learned so we will try to use this model to predict these examples okay so then we will use the test data okay so what is the significance here is the model has already seen what is this test data sorry the training data but it it i haven't seen what is this test data okay it is just like preparing for exam let's say you are preparing all the example problems in a mathematics book for your exam so those example problems are nothing but the training data so in the exam a new problem will be asked and you need to solve that but but you have never seen that question right so that is nothing but that is data so we need to test our model for accuracy on training data and the accuracy on test data so it is always in most of the cases the accuracy on training data will be more because the model has already seen this training data and most of the times the accuracy on test data will be less okay so if the accuracy of your model is somewhere greater than 70 which is it is actually good so and it also depends on the amount of data you you use so as i have told earlier so if you use uh you know many data points and if you use a better model you will get a good accuracy score okay so if we use uh you know quite less number of data as we have in this case where we have only 200 data points our accuracy can be low okay so but the idea of this you know video is for you to understand how you can implement these steps so the accuracy is not that much important but we have to note here is so any accuracy greater than 70 percentage is good okay so now let's try to predict the training data extreme prediction is equal to model dot predict extreme training data accuracy so we will store this accuracy value in this variable training data accuracy so training data accuracy is equal to accuracy score so this is the extreme prediction is the prediction our model makes okay based on uh its learning so we need to include extreme prediction and all the correct values it is white ring okay so what happens here is we are going to compare the we are going to get the accuracy score so exchange prediction is the prediction of our model and y train is the real answer of our answer of uh this instances so as you know here so we have uh we have this test data right so x test and y test is the label of this test data so now what happens is we are going to compare the prediction of our model and the original label of these data okay by then we will get the accuracy score so let's try to get the accuracy score for the training data so i'll print the accuracy score so accuracy on training data so copy this so this will have the accuracy score here training data accuracy so as you can see here we got almost 83.4 percentage of accuracy so it is actually kind of good for these many data so now let's try to find the accuracy score of test data okay so it's it will be the same part except for some changes we just need to include the test data here so accuracy on test data so x test prediction test data accuracy so the model has never seen this data okay so model dot predict x test okay so y test so now we are uh using our model to predict the test data and this uh prediction of this model will be compared to the original labels which is y test okay is data accuracy now we need to print this accuracy score so accuracy on test data is data accuracy so we got a 76 percentage as accuracy score which is really good so which means out of 100 times it can predict 75 times the correct object whether it is a rock or mine okay so we got accuracy score as 83 percentage for training data and 80 76 percentage for test data so our model is performing fine okay so now let's what we are going to do is so we have a trained uh logistic regression model and now we are going to make a predictive system that can predict whether the object is either rock or it is mine using that sonar data okay so now let's see how we can make this predictive system so making a predictive system so we need to give the input data so i'm making a variable called as input data okay so in this uh parenthesis we need to include the sonar data so we have seen this sonar data right so we will be taking some examples for rock and mine and we will check whether our model is predicting uh the rock and mine correctly okay so this is the use of this code snippet so once i complete the script i'll uh copy some values and put it and let's see whether it's predicting correctly so input data so once we get the input data we have to convert it to a numpy array because the processing and non numpy array is faster and it's more uh easy so it's basically changing the data type to a numpy array so list to do a numpy array so changing the input data so just make a comment here so changing the input data to a numpy array so we will use the function numpy.s array for this function so input data as numpy array so this is the variable i am giving it so input data's numper is equal to np.s array so if you remember we have imported the library numpy scnp so i am using np instead of numpy okay so np.s array input data so basically we are converting this list into a numpy array and let's let's take an example say for example um [Music] i'll open this from our notepad okay so let's take some random example okay let's take it i think it is a rock so as you can see here this example is stock so if we feed this data to our machine learning model so it should predict that this is a rock okay so i'll copy this okay so i'll put this in this input data so we have totally 60 features so i have converted this input data so this is basically a list so we are converting it to a numpy array okay now we need to reshape the data okay so because we are just predicting for uh one instance so for that purpose we need to reshape this array otherwise our model can be confused by the number of data points so we need to reshape the numpy array as we are predicting for one instance okay so input data t shape is equal to i'm copying this so input data's number area so we need to reshape this reshape one comma minus one so this one comma minus one represents there is one instance and we are going to predict uh the label for this one instance so that is why we are reshaping this so once we reshape it we need to make the prediction so i will create a variable called as prediction and i will store the prediction of our model so model dot predict function is used to edit it so we have stored our trained logistic regression model in the variable called as model so as you can see here so i am calling that model function so model variable so model.predict this input data reset so this contains the features of our data okay so this data is present in input data reshaped okay so basically what happens is that this model dot predict returns either r or m as value okay so it uh you know tells us whether it is either a rock or a mine okay now let's try to print this prediction so let's try and run this so you should predict that this object represents a rock because we have copied the data for a log right so i'll run this and yeah it predicted correctly that the object represents rock from this example okay and let's include a if condition here here that if we get r as our prediction it should say that the object is a rock and if we get m the object is a mine okay so as you can see here this r is included in a list so if this prediction is equal to our prediction 0 is equal to r so i'm using this 0 because this 0 represents the first element of the list here the list is uh this prediction okay so the first element is r and we need to represent it with this index as 0 so that's why i'm using 0 as the index here so if it is not in a list i won't include this as 0 okay so if the first element of this list is equal to r we need to print that the object is rob okay otherwise so else we need to print it as a mine so when it is m we need to print it is the object smile okay let's try to run this so as you can see here so we get the first element in the list as r so it tells us the object is a rock and we know that we copied the data of a rock now let's try to see whether our model is predicting the mind correctly so let's try to get some random value for mine and let's try to let's let's find whether our model is predicted correctly i'll take this value so this is this value represents mine as you can see here there is an m so i'll copy this value let's see whether it's predicting correctly so if if our model is working correctly it should say that it is m which means that the object is a rock so i'll just replace this now let's try to run this now it should say that say that the object is a mine so as you can see here it is predicting correctly that the object is a mine so this is how our predicting system works so i hope you have understood all the process we have done here so i just give you a short recap of how we are doing this so first we have imported the dependencies so numpy is used for making arrays and pandas is used for making a data frames and we are using the libraries scale on for using the function drain test split so it is used to split our data into training and test data and in this case we are using a logistic regression model so we are importing that logistic regression function from scale one dot linear model and we are importing the accuracy score to find the accuracy score of our model from scale and dot matrix then data collection processing so we have imported the sonar.csv file into our google collab environment okay so now we are feeding it to a pandas dataframe by pd.read.csv function so and i am storing that data frame in a variable called a sonar data so here we need to give the path of the file and since we have no editor in this file so we have to mention that there is none and by the function yet we are just printing the first five columns of our pandas data frame and we find that the last column is a categorical column which says whether it is a rock or a main so r represents rock and m represents mine then we are determining the number of uh rows and columns we have so we have two not eight columns which represents two not eight data points so and uh 61 features okay so 60 features on one label which is rock or mine then we have used the function sonar data dot describe which gives us the count the number of uh values we have mean and standard deviation and other statistical values then we are counting how many main examples are there and how many rock examples are there and we find out that it is almost equal and then and then we are grouping the data based on mine and rock and we find their mean values and we get a quite difference in their mean values okay so as you can see here there is a difference in mean value of rock and mine for each of the column okay so now we are splitting the data into all the features and all on the labels so we are feeding all the features to the variable x and all the labels to the variable y okay now we are splitting our uh x and y our data into training and test data okay so the training data is used to train our model and our model is evaluated with the help of test data right and then we are loading our logistic regression model in the variable model and by the function model dot fit our model is trained it is just like a graph okay so in the x axis there will be this features under y-axis there will be labels okay and uh the graph will be plotted so this is how model is trained so once we have trained our model using the function model.fit we are finding the accuracy score so first we find the accuracy score of our training data which is around 85 84 percentage and then we find the accuracy score for the test data which is around 76 percentage then we are making a predictive system where if we give the features if we give the data it can predict whether the object is a rock or a mi okay so this is so these are these are all the procedures we have used in this use case so i hope you are clear with diabetes or not okay so we will be using one of the important machine learning algorithms support vector machine okay so we will do our coding in python and first of all let's try to understand about this uh problem statement and let's try to understand about this support vector machine algorithm then let's get into the coding part okay so first of all let me explain you about the support vector machine model so this is one of the important algorithms of supervised learning algorithms okay so in supervised learning we feed the data to our machine learning model and the machine learning model learns from the data and its respective labels okay so here the labels are the most important things so what happens is in this case we train our model with several medical information such as the blood glucose level and the insulin level of patients along with whether the person has diabetes or not so this actors labels whether that the person is diabetic or non-diabetic so this will be the label for this case okay so once we feed this data to our support vector machine model what happens is it tries to plot the data in a graph okay and once it plots the data it tries to find a hyper plane so in this image you can see a hyper plane so what happens is that this hyper plane separates these two data okay so once we feed a new data to this model so it tries to put that particular data in either of these two groups okay and by that it can predict whether the person will be diabetic or non diabetic okay so in this case we use several medical information such as bmi of the patient uh their blood glucose level their insulin level etc okay so now let's try to understand the workflow for this project so first of all we need this diabetes data okay so we will try to uh train our model with the data and the respective labels okay so before feeding this to our model so there are a lot of steps in between so we need to pre-process the data where we will try to analyze the data and this data won't be very suitable to feed to the machine learning model and we need to standardize this data because there are a lot of attributes here a lot of medical information and we want all these data to be in the same range so what we do is we try to standardize all this data so that all this data lies in the same range so all these things will be done in the data pre-processing step so once you pre-process the data we will split the data into training and testing data okay so we train our machine learning model with training data and then we try to find the accuracy score of our model in the help of test data okay so it tells us how well our model is performing okay so once we split the data into training data and testing data so we will feed this to a support vector machine model okay so we'll be using a classifier model where this model will classify whether the patient has you know patient is diabetic or non-diabetic okay so once we have drained it we have a trained support vector machine classifier so when we give a new data so it can now predict whether the patient is diabetic or non-diabetic so this is the workflow we will be following for this project okay so now let's get into the coding part so we will be doing it in google collaboratory so in our channel i have already made a video on how you can access google collaboratory from your google chrome so the index of that video is 2.1 and you can search it from there okay first of all so now we need to import the libraries so i'll make a text here so that it's clear importing the dependencies okay so first let's input numpy and put numpy smp now let's input pandas so import pandas pd now we need to standardize the data so for that we need a standardizer function so [Music] from sklearn dot pre-processing inputs standard scalar okay so this standard scalar will be used to standardize the data to a common range okay from sklearn dot model selection import print test split so we will use the train test split function to split our data into training data and test data okay from a scale on import svm so svm stands for support vector machine okay now we need to import the accuracy score from sklearn dot matrix import score okay so numpy is used to make numpy arrays so numpy arrays are very helpful in our processing and pandas so pandas is mainly used for creating data frames okay so it is useful to put our data in a nice structured table okay so then we use this standard scalar to standardize our data and then our crane test split function is used to split our data into training at this data and we have of course imported the model support vector machine and now uh we have imported accuracy score so i'll run this so i'll press shift plus enter this will run this cell and automatically goes to the next cell okay so now we need to do the data collection part data collection and analysis okay so i'll upload the data file here so i'll give the link for this data file in the description of this video which diabetes.csv okay so this data set is called as pama diabetes data set so i'll mention it here so you can also find this data set in uci machine learning repository on kaggle pam it is data set so you can search in internet so you will find this data set and you see is here so basically this contains the informations of patient who has diabetics diabetes and those who doesn't have diabetes so it basically contains the data of females okay so there won't be any data for males so it contains uh various parameters such as the number of pregnancy they have gone through the blood glucose level insulin level etc okay now let's load this c data data in this csv file through a pandas data frame okay [Music] loading the data set the diabetes to a foundation okay so i'll create a variable called as diabetes dataset so this is called as a variable and we will store the data frame in this variable okay so sorry so type it is data set is equal to we need to use the function pd dot fury csv so this will load the csv which is a comma separated value to a pandas data frame okay so i need to copy the path of this file so go to this folders and in this options you can give copy path so once you've copied the path you can paste it here and run this cell so you can also run it from pressing this run symbol okay for example like in some cases you may not sure what does a particular function does okay let's say that you don't know what this pd dot read csv function does so you can get the explanation for this function by typing this function here so pd dot read csv and include a question mark with it and run this cell so you will get this help section where you can see this function and what are the parameters you need to include in this so you don't need to include all those parameters so you just need to include the main parameters such as the location of the file and other things okay so you can read why this function is used so it read a comma separated value file into a data frame okay so you can see the additional information here and you can see the parameters of this function okay so this features is very much helpful for us to understand what does a particular function does so you just need to mention the function and include a question mark with it so this is one of the important feature of a google collaboratory which is very helpful okay now let's try to print the first five rows of this data frame so we can use the function yet for this so we need to mention this uh data frame name which is diabetes and dataset okay so so printing the first five rows of of the data set okay so mention this data frame name which is run the cell so this will give the first five rows of our data frame okay so as i have told you this data set includes the data for all the females so in that test case so you can see the number of pregnancies the blood glucose level blood pressure skin thickness okay so this skin thickness is taken from the triceps okay so this you know basically tells whether some fat or fat is stored in that particular muscle okay so then we have the serum insulin level then we have the body mass index which is basically calculated by dividing the weight by eight squared okay and then we have a diabetes pedigree function which is basically a number which indicates some you know diabetic value okay and then we have eh and then we have outcome okay so these are the uh labels this outcome part is the label where one represents uh that the patient is diabetic and zero represents that the patient is non-diabetic okay so these are the labels so all these values will be either one or zero so we need to develop a system that can classify this data to either one or zero okay now let's try to see the number of rows and columns for this data set so i hope every one of you knows this feature but i'm just telling this for uh new users who are just starting to learn python so we use ash in python to you know write a comment so if i remove this the python will recognize that it is part of the code so if i include as it assumes that we are writing some comment or some information about what what does this code do okay so you can mention some description of what you are going to do by preceding the line with the ash okay so we are going to get the number of rows and columns in this data plane number of rows and columns columns in this dataset okay so mention the data frame name diabetes dot diabetes data set dot shape okay as you can see here we have 768 column sorry 768 rows and nine columns okay so 768 columns means sorry 768 rows means we have uh 768 examples that means the data is taken from 768 people and we have nine columns these columns represent the features or parameters or attributes so these are the medical uh informations that we are going to use for our prediction okay so leave this last column because this is the label and these eight are the important features we are going to be needing for our prediction okay now let's get some statistical measures of this data frame so getting the statistical measures of the data okay so you can use the function describe and this will give the various statistical measures such as the mean of the data the standard deviation percentage etcetera okay so it will give all these values for all the seven seven or eight columns we have okay so it uh first count tells the number of data points we have and mean gives the mean value of all these uh all these columns okay for example this mean gives the mean value of glucose so we get that the mean value of glucose for all these data is 120 and those kind of statistical measures okay so then we have the standard deviation the minimum value 25th percentile 50th percentage 75 percentile and maximum value so percentile basically means let's say for example 25 percent means so 25 percentage of the value are less than 99 and in the case of that pressure 25 percentage of the value are less than 62. so 50 percentage 50 percent l means so 72 uh sorry 50 percentage of the value of blood pressure or less than 72 so that is why percentile is used and it is different from percentage okay so now let's see how many cases are there for diabetic examples and non-diabetic examples so diabetes data set outcome dot value counts so this value counts function so we you can see here i have mentioned this outcome here so basically what happens is it takes this outcome value and checks how many uh examples are there for this one label and for zero label okay so i'll run this as you can see here so we have 500 values for the label 0 for the non-diabetic cases and we have 268 values for the label one for those people who are diabetic okay so the proportion of non-diabetic cases is more in this data set so this this is a pretty smaller data set so totally we have 768 examples so typically in a machine learning or deep learning projects and use cases we generally use you know your thousands and even lacks of data okay so it depends on the data set that is available for us so the idea of this use case that this video i'm making is for you to understand what approach we will take for solving a machine learning project okay so for the purpose this data is sufficient okay now let me make a text here so we know that the label 0 represents non-diabetic people so non diabetic okay and the label 1 represents people with diabetes so one represents diabetic okay now we can try to get the mean for all those value for this label 0 and 1. diabetes data set dot a group by outcome mean so this will give the mean value for both these cases so it calculate these values for all the data set let's say for example we have this this label 0 and 1 and we get that the mean value for glucose for all those people who are non-diabetic is 1.9 and the mean value of glucose all those people who are diabetic is 141 so it is obvious that the diabetic people are more glucose level in their blood right so and then we have the blood pressure value and all those things so this difference is very important for us because this is what the machine learning algorithm will see and detect that if a person has you know these kind of values they can be diabetic or if if a person has uh the value in the range of some other value like say for example 140 or 150 it can predict that the person has diabetes okay so you can also see the age here so it is obvious that people with more age are more susceptible to get diabetes so these are some important insights we can get from this data okay so it's a good practice to always group the dataset based on their labels now so what we will do is let's separate this data and label so we want all these data except this label in one particular location and we want this label separately okay so i'll create a variable so let's make a comment here separating data on labels okay so i'll create a variable called s6 and we shall store all the data except the labels in this variable x so diabetes data set dot drop columns is equal to outcome comma axis okay so axis is equal to 1 and y is equal to diabetes dataset outcome okay so basically what we are trying to do is so we are taking the variable x and we are dropping a particular column called as output so we are taking this data frame so you can see here we have loaded the data frame sorry the data set into the data frame called as directive data set so in that i am dropping this particular column okay so that's why i am mentioning that i am going to drop a column and i am going to drop the column outcome and i mentioned here that the axis is equal to 1 okay so if you are using this drop function in you need to specify that axis is equal to 1 if you are dropping a column and you need to specify that the axis is equal to 0 if you are dropping a particular row okay so i am getting uh this x for all the values except the label and now i am storing those all the labels in this variable y okay so as you can see here i am taking this uh diabetes dataset i am taking only the outcome column okay and i am storing it in the variable y so i'll run this so this will separate the data set now let's try to print x and y so i'll print x so as you can see here we have all the data except the outcome which is the label okay so as you can see here in the previous data frame we have outcome on the data together so you can see here now we have separated the data in the variable x now let's print y which contains the labels as either one or zero so one represents diabetic patients and zero represents non-diabetic patients okay so totally we have 768 rows so you are 767 because in python the indexing starts from zero so totally there are 768 flows now what we are going to do is standardize the data okay so this is one of the important steps in data preprocessing data standardization okay now why we are doing this you can see here we have number of pregnancy here glucose value blood pressure value bmi etc so the range of pregnancies is you know one two or three in that range and the glucose level is around 100 and 150 level and the blood pressure level is around 60 and 70 bm is around 25 30 range okay so if there is a difference in the range of all these values it will be difficult for our machine learning model to make some prediction so in most cases what we will do is we will try to you know standardize the data in a particular range and that helps it helps our machine learning model to make better predictions so that is what we are going to do here so if you remember we imported a function called as standard scalar so we are going to use this function for this purpose let me create a variable called scalar and let's load that standard scalar function standard scalar you need to import include this parenthesis so this means we are taking one instance of this standard scalar function so i'll run this now we need to fit this data in this scalar so basically this scalar variable has this standard standard scalar function so we need to give standard scalar dot fit x okay so now we need to transform this data so i'll create another variable called as standardized data is equal to scalar dot transform x okay so basically what we are doing is we are fitting all these inconsistent data with our standard scalar function and based on that standardization we are transforming all the data to a common range okay so instead of using this fit separately and transform separately you can also use function like you know scalar dot fit transforms so this just fits the data and do the transformation in a single step whereas we are here we are trying to fit it separately and transform it separately you know in both cases it just gives the same values or same you know result so you can do any of them okay so this will fit and transform the data so that we get the data in the same range i will run this okay now let's try to print this data in the standardized data okay as you can see here all these values are in the range of 0 and one okay so this will help our model to make better prediction because all the values are almost in the similar range okay now we can make this simple by giving this standardized data to the variable again to x so x is standardized data y is again the labels which is diabetes data set outcome actually we have already done that there is no need to do this i'm just doing this so that you can remember that so we have already did this part here so as you can see here but this doesn't make any change okay so we are taking all this standardized data and feeding it to the variable x and we are getting all these labels so you can see the column here so outcome label and we are storing it in the variable y and we will use this x and y for training our machine learning model okay so basically x represents the data and y represents the model okay so let's run this now let me print x and y so print x print y so now we have all these data in the x and all the labels in y okay now we need to split our data into training data and test data so train test split so we need to mention four variables here extreme x test y train y test okay so let me first just make the function and i'll explain you what this particular function does so we have four variables x3 next test y train and white test okay so we can use the function train test split so you can see here we have imported this function train test split from the module sklearn dot model selection okay so train test split the important attributes are x y so x is the data and y is the label and test size so i'll give the text test size is equal to 0.2 i'll explain all these parameters in a minute stratify is equal to one random state is equal to two okay so basically what we are going to do is we are taking four variables so what these four variables means so we have x strain and expressed so this x so this x data will be split into two uh arrays okay so the first one is extreme and the second one is x test so we will train our machine learning model with this extreme data and then once the model is trained we will try to uh evaluate our model with the test data okay so you can uh compare it with this example consider that a person is studying for a max exam okay so he is solving several questions in a particular book okay so the questions he is solving for you know the questions he's practicing is like the training data and in the exams he will get you know new questions or a different questions that he haven't uh sold or he hasn't solved okay so these questions are test data so that is the similar case we have here so we will train our model with the help of training data and we will test the model with the help of test data the idea here is the machine learning model should not see this test data okay so we want our model to make prediction on this unknown data and we will try to predict its accuracy okay so that is the reasons for having x train on the x test okay so this white line represents all the labels for this extreme data and why it is represents all the labels which is one or zero which represents diabetic or non-diabetic for this x-test data okay so this train test split will uh function will give us four outputs and that outputs will be stored as extreme extras for eighteen android test okay and then we are using the attribute here so x and y so we need to mention the old data set here so the entire data set is given as x and the entire labels are given as y so from this the x strain expression white and whitest will be splitted and then here we have the test size is equal to 0.2 so 0.2 represents you know 20 percentage of data so this is for mentioning how much percentage of data you want for test data okay so here i am getting 20 percentage test data so you can also give 0.1 so if you give 0.1 that means uh you are having 80 percentage of this entire data as training data and the 10 percentage or 20 percentage as test data okay and then we are stratifying based on y okay so based on y means so y basically has the values as either one or zero so we want our data set to be splitted in the same proportion okay so for that purpose we are using the value y because if we don't mention this there is a chance that all the cases for diabetic will go to one uh data frame say for example if we don't include this stratified all the diabetes cases may go to exchange and all the non-diabetic cases may go to excess so that is the reason we are using stratify to avoid that where there will be you know similar proportion of diabetic cases and non-diabetic cases as they are in the original data set so that is the reason for stratifying this and then we have random state so random state is like let's say for example you are uh you know making this similar code and you want to split the data in the same way i did so in that case you need to mention the number two so if i mention the number one the splitting of the data will be different so if you want to replicate that splitting you can represent one this is like an you know index or a serial number for a particular splitting of data okay so if i'm using do while you are trying this code if you also use the number two so your data set will be splitted in the same way that uh the data split is splitting for me okay so this is basically for replicating a code okay so i'll run this so this will split our data now we can check the shape of our extreme x test and the original data set so the original data set is six and let's check the shape of extreme and x test dot shape let's run this so now you can see here there are totally 768 examples in our original data set out of those 614 are going to be used for training data and 154 will be our test data okay so this is a good proportion where we took 20 percentage of the data as test data okay now we are going to train the model so let me make a text here training the model so i create a variable called as classifier so classifier is equal to so we are going to use this function svm so this is used to load our support vector machine so classifier is equal to spm dot svc so svc represents support vector classifier and we need to represent another parameter here which is kernel so kernel is equal to linear so we are going to use a linear model okay so we are using support vector machine classifier and let me run this now this will load the svm model into this variable cortex classifier and now we will fit our training data to this classifier okay training the support vector machine classifier okay so give so mention this variable which is classifier dot fit so this is the training part okay so we have to mention the training data here the training data is extreme and the labels for this training data is white green okay so we need to mention both of them here so exchange come comma white ring so this basically represents the training data and the label for that training data okay so this is a small data set since we don't need much time for training so if you are using you know thousands and lakhs of trading data it will take a long time for you to do the training okay so this has trained our machine learning model now we can evaluate our model so basically evaluation is to check how many times our model is predicting correctly okay model evaluation so we will be finding the accuracy score so first let's try to find the accuracy score on training data so we will try to predict all these training data so we won't give the machine learning model these labels so we will try to predict the label for all this training data and we will compare the prediction of our model to the original labels which is white ring and try to predict the accuracy score okay so accuracy score on the training data let's make a variable called as extreme accuracy okay the extreme accuracy is equal to classifier dot ready so it's not accuracy so we need to make the predictions exchange direction classifier.predict extreme so this will basically predict the label for all these extreme okay so and we are storing all these labels in this extreme prediction variable okay now we need to find the training data accuracy any data accuracy is equal to so we have imported the function accuracy score here so we are going to use that so accuracy score extreme prediction and the original label which is x voitering so what we are doing is we are using our trained machine learning model so once we fit our machine learning model here so our the our model is trained and this trained model is stored in the variable classifier now we are using that model to predict the labels okay so we are going to predict all the labels for extreme so all those predictions made by the model are stored in this variable xtrain prediction now we are comparing the prediction of our model which is stored in this extreme prediction with the original labels that is y train okay so this will give us the accuracy score of our model let's try to print this accuracy score so print accuracy score of the training data so accuracy score between you know if it's above 75 it's pretty good and in this case we are using very less number of data so there is a chance that we may get a kind of low accuracy score okay so if your accuracy score is greater than 75 it's pretty good because you can use other optimization techniques to increase the that optima that accuracy score okay so now let's see whether our accuracy score is beyond 75 or it's less than that okay accuracy score on the training data so let's print it here so training data accuracy as you can see here our accuracy score is 78.6 which is almost 79 and it is pretty good so it means out of the under predictions our model is predicting 79 times the correct predictions okay now we need to find accuracy score on test data okay so this is the important step because the model has already seen the training data because we are training the model basically with the training data and it doesn't make sense if we only evaluate our model based on that so we need to use the model to predict some unknown data so it tells us that how well our model is performing okay so it's similar to the exam case where the student is exposed to questions to which they are you know not practiced okay so now let's try to find the accuracy score on the test data so let me copy this here and just change this change this to test data so x test prediction now we need to specify our test data so this will be test data accuracy we need to change all these so we are taking this x test and y test so you can see here that we we have split the data into extraneous test white rain and white test so we have trained our model with the help of this extreme data now we need to test our data with the help of this text test okay so that is what i am doing here now let's try to find accuracy score on this data and again we need to print the accuracy score so this time for test data so accuracy score of the training of the test data so i am just going to print this value here okay this data accuracy so the accuracy score is 77 which is again pretty good for this you know small amount of data okay so we are getting the accuracy score as 78 for training data and accuracy score for test data as a 77 okay so it is a good you know evidence that the model has not over trained okay so overtraining represents the model just trains a lot on the training data that it cannot perform well on the test data so in that case the training data accuracy will be very high and the test date accuracy will be very low so this exam so this concept is known as overfitting okay so we will be dealing all of those theories and concepts in the later module of our course okay and also we will be uh you know seeing in detail about all those model examples the support vector machine logistic regression all those models in detail in the later videos okay so we have found accuracy score for training test data as of now now we need to make a predictive system that can predict whether a person has diabetes or not given all these data so we have all these datas as pregnancy glucose blood pressure level skin thickness insulin bmi and diabetes pedigree function so we have all these data right once we give all these data our machine learning model has to predict whether that person has diabetes or not now we are going to build that system making a predictive system okay input data so in this array called as input data we need to give all those uh medical information that is the glucose level insulin etc okay so this will be input data so we will fill this data later now we have to change this input data to numpy array say for example let's get some data so i have i have this data set here let me open this okay so let me open this in notepad so we need to give these medical informations and the model has to predict whether the label is 0 or 1 okay so 1 represents diabetic patients and zero represents non-diabetic patients so let's just select a random example um okay so i'll just select this row so as you can see here the label for this this particular data point is zero okay so you can see the years here so the last value represents outcome so here the outcome is zero which represents the person is non-diabetic okay so we are not including this data so we are taking all these data and we will feed this data to our machine learning model and it needs to predict that the outcome will be zero and which represents the patient is non-diabetic okay so i'll copy this value and paste it here so i'll be pasting this value in this input data okay now i'll change this input data to a numpy array so this basically is a list okay so list data type and we are going to change this to a numpy array because the processing and non numpy array is more easy and efficient so we are going to change so changing the input data to a numpy array so we have already imported numpy snp as numpy array okay input data has numpy array so you can follow this same procedure for different projects where you you want to make a predictive system okay so this is just like a blueprint so this will remains almost the same for different uh predictive systems so input data as numpy array np dot ass array so this function sra will convert this list to an array so np dot s array so we need to mention this list name which is input data so i'm pasting this here so this will convert this data to a numpy array now we need to reshape this data so reshape the array as we are predicting for one instance so what is the reason for this reshaping okay so our model is trained on 768 examples right so our model is trained on 768 example and there are totally eight columns in our model training but in this case we are just using one data point if we don't reshape the array what the model expects is it expects 768 data points or 768 values but we are just giving one so this will make a confusion to the model and hence we need to reshape the array and this reshape will tell the model that we are just going to need the prediction for only one data point so that is the reason for this so what if we create the empire array named as input data reshaped is equal to input data numpy array so we cannot do this reshape in list much so that's why we are using numpy array so it's more easy to do the reshaping in a numpy area so reshape is the function we are going to use for this so this reshape function belongs to the library numpy okay so reshape one comma minus one so this is the parameter for this reshaping so this will tell the model that we are not giving 768 examples but we are just you know trying to predict the label for only one instance okay so this will reshape that it now there is one more important thing we cannot give this values as such why because while training our model we have standardized the data so we haven't used the raw data as such now what we have to do is we have to do the same procedure here because if we give this data as such our model cannot make predictions correctly so we need to standardize the data in the same manner as we standardize our training data okay so we have already fitted this to the cx which is the training data and we need to use the same function scalar here so you you don't have to fit it again you just need to transform it based on the scalar okay so you can do that by okay so i'll make a command standardize the input data okay let me make a variable called as standard data std data is equal to so we have already made the function scalar so scalar dot transform input data reshape so we need to transform this data and store it in the array called a standard data okay now we can feed this stamp based data to make the predictions okay let me print this standard data as well and std data now let's make the prediction prediction is equal to so we have made the machine learning model so we have trained the model and we have stored it in the variable called as classified so this classifier has the trained support vector machine model so we need to use that keyword sorry that variable classifier so classifier and for predicting we need to use the function predict so classify dot predict standard data okay so we are standardizing the data and we are feeding the standardized data to our machine learning model so you can note here that we haven't included the label here so the model has to predict the label correctly so prediction is equal to classified dot predict standard data and i'll also print the prediction okay now the model has to predict that the person is non-diabetic because we have taken the example for a non-diabetic case so you can see the label here 0 so i have to call the values here now the model has to give the label as 0 right so let's try to print this prediction so i am creating the prediction here so you can run this okay so model name number is sre okay so i made a spelling mistake here okay sorry again okay so this first two line represents our standardized data okay so this is the data we are giving and this data is standardized based on our standard scalar function and this is the standard value so you can see here it is almost in the same range and this value is fed to our trained machine learning model which is stored in the variable called as classifier and we are predicting the label for this standard data and we got the prediction from the model as zero so you can check here that we took uh the values for zero which is non-diabetic case and our model has predicted correctly the label for this you know data so this is how you can use machine learning to predict that person as diabetic or non-diabetic let's do one more thing here let's say that if this value is a zero we need to print that the person is non-diabetic if the label here is one we need to say that that person is diabetic okay so we can use a simple if statement there so if prediction so this prediction value is stored in the variable called as called as prediction right so this is that prediction value and it is stored in a list okay so if prediction is equal to okay if this prediction value prediction 0 is equal to zero then we need to say that print the present not diabetic else the other cases the person is diabetic if this value is not zero so in the other case the value will be one right so else we need to print the person is diabetic okay so now why i mentioning this prediction and zero okay so basically this prediction is a list so it doesn't give an integer value but it give a list and this list has only one element okay so that's why i am mentioning this zero so prediction and if i mention the name of this list and mention 0 in this prediction it means i want the first value okay so the first value here is 0 right so if the first value in the list prediction is 0 then we need to print that the person is not diabetic okay otherwise we need to print that the person is diabetic okay so the important point to note here is the output we are getting from our machine learning model that is classifier will be a list and not an integer it doesn't print that you know either 0 or 1 but it prints 0 or 1 inside a list so this is a list because it it surrounded by this square brackets right so we need to uh you know mention the index of this value so the index is given by zero so if the first element inside this particular list or set is zero that person is not diabetic otherwise the person is diabetic okay so we have already took the example for this non-diabetic case once again try to run this so the value is zero so as you can see here we are getting the output as the person is not diabetic okay now let's try this for diabetic case as well let's again take some random values okay so i'll take this value pie 166 so you can see here in this case the label is one that means the person is definitely diabetic okay now it should give the prediction as one and it should say that the person is of course diabetic okay so i'll just change this value and give this new value so this should predict that the label is one and the person is diabetic let's see whether it's predicting correctly so as you can see here it has predicted correctly for both diabetic and non-diabetic cases so this is how you can make a predictive systems that can uh you know just predict the label with the given input data okay so this is the agenda we wanted to do for this video and we have successfully completed it so let me give you a short quick recap so first of all we have imported all the dependencies so we have imported numpy for making empire respondus for creating the data frame and we have this standard scalar function to stabilize the data and then we add a train test split to split our data into training and test data and we have imported the support vector machine model from scalan and then we have imported this accuracy score for finding the uh you know for basically evaluating our model then we have loaded our data set which is which was in a csv file to a pandas data frame and we have named the data frame to diabetes datasets so once we add that we just try to see the first five rows to understand about the data in this data set and then we have found the number of rows and columns in this data set and we have found some statistical measures like mean standard deviation percentage etc okay then we have counted the number of values for diabetic cases and non-diabetic cases and here the label zero represents the person with who doesn't have diabetes and one represents those who have diabetes okay so then we have grouped the data based on zero and 1. so we have found the mean values for patients with diabetes and non-diabetes okay so we found that there was a difference in their glucose level so basically the glucose level is more for the patient with diabetes and the age is also more for people who have diabetes so it's one of the important insights we get from this data and then we have splitted the data and the labels separately and the one of the most important step we have done is standardizing all these data because all this data was in different uh range so we have standardized the data to have a common range between them so once we have standardized the data we have splitter the data into training data and test data so we took 20 percentage of our data's test data and then we have uh loaded our model support vector machine classifier into the variable classifier and we have uh trained our machine learning model with the help of xtrain and the their respective labels python so once we have trained the model we have found the accuracy score on both the training data and the test data once we have a good accuracy score we made a predictive system that can predict whether a person has diabetes or not with the help of this input data okay so in this we have reshaped the array to tell the machine learning model that we are just trying to predict the output of only one instance and then we try to print this prediction and uh we just you know make a simple if statement that if this label is one we need to print that the person is diabetic if the label is zero then the person is non-diabetic we are going to discuss how we can build a spam male prediction system using machine learning with python so this is one of the most important and interesting applications of machine learning as spam is something that we come across in a day-to-day life right so in this video let's try to understand how we can use machine learning effectively in order to predict which males are spam males and which males are non-spammers okay so that is the end goal of this particular project and first of all i will explain you more about this problem statement then we can discuss about the workflow which we are going to follow for this particular project and then we can move on to the hands-on path where we will try to build a machine learning system that can make this prediction okay so let's get started as i have told you spammers is something that we face in our day-to-day life and in a day you can receive multiple mails and uh more than 50 percentage of those mails can be spam made so these can be something that says that you have a job offer or you can get a discount or offer something like that and most of the time those won't be true and if you have email apps like your gmail or other apps so those apps can find and uh you know classify which mails are spam mails and which mails are non-spamming and the spam is you know end in the spam folder so we are going to build a similar system using machine learning that can correctly predict which males can be this families and which males are non-spammers okay so we can classify males as two types one is spam male and the other one is the ab male source a spammer is nothing but as i have told you before so those mails claim to give you some offers and gifts and most of the time they won't be true okay so they they can be some kind of promotion as well say if say for example i have given you an example here so this mail says free entry in to a weekly competition to win fa cup uh final tickets 21st may 2005 text fa2871121 to receive enter equation so you may receive these kind of questions right so this is an example of a spam mail which is most probably going to be a false one so this won't be true most of the times right and there are other kind of male called as am males so a males are nothing but non-spam mails so they can be the mails sent to you by your family members your friends or your co-workers and so on okay so if you read this message please go ahead with what's i just wanted to be sure do have a great weekend abiola okay so this is you know we can see this mail is nothing but sent by a friend to you okay so that's how you can classify means one is the spammers and the other one another one is the non-spam mails and the non-spam mails can be called as amaze and that is what we are going to do in this particular project so we are going to look at a mail and predict whether that mail comes under this spam mail or this comes under a mail okay so this helps us to determine which mails should go to the spam folder and which mails should come to your inbox so that is the end goal of this particular project okay so let's see how we are we are going to you know do this so this is the workflow that we are going to follow so first is that we need to get the mail data so we need uh the data for both spam mails as well as amaze and we will uh use this data to train our machine learning model okay but we cannot do it directly first we need to process this data and the second step will be data preprocessing as you might know that it is easier for a machine or a computer to understand numbers but it is you know very tough for a computer to understand text and you know paragraphs so we will do some processing where we will convert this text so we know that mails will be indexed and we will try to convert this text and paragraph into more meaningful numbers and that will be done in this data pre-processing part and after that we will split our data set into training data and testing data where we know that this training data is used to train our machine learning model and the test data is used to evaluate our model okay so once we split our original data set into training data and test data we will feed it to our logistic regression model so the training data will be used to train this logistic regression model so in this case we are using a logistic regression model because it logistic regression models are the best when it comes to binary classification problem binary classification means there will be two classes and we are trying to classify these into two pluses the two classes in this case are us a spam mill and an amml okay so we will train this logistic regression model with this training data and once you have done that you will have a trained logistic regression model now when you give a new mail your logistic regression model will predict whether that mail is a spam mail or an aml okay so this is what we are going to do in this particular video so first we will uh get this mail data and once we process this data we will split it into training detention data and once we train this logistic regression model when you give a new mail it will uh try to predict whether that mail is a spam mail or an amail okay so this is the procedure that we are going to follow so with that understanding now we can move on to the hands-on part okay so i'll open my google collaboratory so i have connected my google collaborated system here and first of all we need to upload our system upload our data set to the google collaboratory environment okay so just a second i'll just close this okay so the first step is uh updating or uploading the data set to this follow-up environment so you can go to this files option okay so here you will see an options called as upload to session storage or you can just right click here and give upload option and now you need to upload the mail data set so this is the data set that i have the name of this data set is mail data dot csv okay so csv represents comma separated value i will give the link for this data set file in my video description you can download it from here you can also give this a data set from kaggle okay so it is available in kaggle as well so once you have uploaded this uh you know data set now we can do the coding part and if you are not sure about google collaboratory if you haven't worked in google collaboratory um you know i'll just give you a link in the cards where i i have explained you about what is will be google collaborate and how you can work on that okay so you can watch that particular video so now we can get started with our coding so the first part will be importing the dependencies okay so dependencies are nothing but the libraries and the functions that we need so here i'll just create a text as importing the dependencies okay so importing the dependencies so we need to import some libraries here so first of all i'll import numpy as np okay so these are some very important libraries that we generally use in machine learning so second i'll import pandas so import pandas as pd and so numpy library is used to create numpy arrays so in most of the cases we need to create arrays and that's why we need this numpy library in order to create those numpy arrays and i am importing this number in a short form as np so this is the general convention we use okay and i'm importing pandas as pd so this pandas data pandas library is used to create data frames as you can see here these uh data set is actually in a csv format and it is not easy to analyze the data from the csv file so we need to uh you know put that together in a more structured table and that is what pandas is used for so pandas helps us to create data frames which helps us to structure our data well okay so that's why we are importing the pandas library next we will import from sklearn so escalant is another important library that is you know used in machine learning and data science applications so from sklearn dot uh model selection i am going to import train test split function okay so as i have told you before we need to split our data set into training data and testing data and for that we need this train test split function and next we are going to import a vectorizer function so from sklearn dot feature extraction dot text import tf ie df vectorizer okay so this the purpose of this tf idf vectorizer is that as i have told you before we need to convert the text data that is data in this case is nothing but the mail data into numerical values so we will convert them into more meaning meaningful numbers so that our machine learning model can understand it if you just feed the text data the machine learning model cannot understand it okay so that's the reason we are uh converting these certix you know text data into numerical values and for that we are using a tf idf vectorizer which we will use uh you know in order to convert the text into feature vector so feature vectors are nothing but numerical values okay so that's why we are importing this function t f id factorizer and we are importing it from scale and dot feature extraction dot text okay so from this model we are you know importing this tf idf vectorizer function and now we are going to import our logistic regression function so from sklearn dot linear model import logistic regression okay so i have already made video on what is the intuition behind logistic regression and how you can build a logistic regression model from scratch and if you want to see that video you can check out my youtube channel so you will find that video that okay so in this case we are going to use a logistic regression model to classify the mail into spam mail or an aml and then we are going to import from sklearn dot matrix import accuracy score so as i have told you before we will split the data set into training data and testing data and this training data will be used in order to train our logistic regression model and once we do that we will use the test data in order to evaluate our model and that is the reason we are you know importing this accuracies for function so this accuracy is 4 is used to evaluate our model in order to find how well our model is performing and how many good predictions it is making okay so these are the dependencies and the libraries that we need so i am going to run this a particular cell so in order to run this cell you can press shift plus enter so it will execute this cell and go to the next one okay so the first part is done and now the next part will be data collection and pre-processing so i'll just make a text here as data collection and pre-processing so the first step will be to load the data from the csv file we have this mail.csp file right so we will load the data from this csv file to a pandas data frame so we have already imported pandas spd right so that will be our next step in order to load the data to our data frame so i'll make a comment here as loading the data from csv file to a pandas data frame so that will be the next step and i'll name this data frame as raw mail data raw mail data which is equal to pd dot read csv so pd represents panda so we have imported pandas in the short form spd so pd dot read csv so this read csv function will load the data from the csv file to a data frame so read csv and we need to mention codes here and within this codes you need to give the location of your data set file so you can see this mail data.csp so you have to upload this data set file and once you upload this you can see this options menu here if you click that you can find this copy path option so copy the path from here and you can paste it inside this port okay so now i'll run this cell and this will load the data from my csv file to this raw mail data data frame so you can just try to print this raw mail data okay i'll run this so you will have your data set here so the first column is category so which says whether it is a spam mail or an aml and the second you know column is the message so this is the mails that we have okay so now there is a bit of a problem here so this data set contains a lot of missing values and null values so we need to convert them into null strings so that will be our next part so we need to replace the null values with the null string so this is my next step so let's see how we can do that so we have this raw mail data and i am going to take this data this data frame and i am going to replace all the null values with the null string so you can think about null values as the missing values so i'll name create a new data frame as mail data so mail data is equal to raw mail data so raw mail data is the data frame which were created before okay so raw mail data dot where pd dot not null so pds pandas pd dot not null raw mail data so double quotes so there shouldn't be any uh space with between these quotes so you just put double quotes here so mail data is equal to raw mail data dot where p d dot uh not null raw mail data comma this force so this where function is used to carry out some condition so this condition is nothing but if i have some you know null values i want to replace it with uh you know this empty string or null string so you can call this as empty string because this string doesn't contain anything so it is just empty and we can call this as a null string so that is what represented by this double quotes okay so it's actually not a double quote it is a opening port and a ending code so please don't uh you know use a double quote here i just uh you know misspelled it so you have to use a open input opening single quote under closing single quote okay so main data is equal to raw mail data dot where pd dot uh not null raw mail data comma one opening code and one single quote so this will replace all your null values with the null string so let's run this so i'll uh press shift plus enter now we can uh try to print the first five rows of this particular data frame okay so this will help us to see the sample of the data frame that we have so printing the first five rows of the data frame so the data frame is nothing but mail data right so mail data dot eight so this yet function will print the first five rows of the column sorry the first virus of this data frame okay so this is the serial number that we have and the second column or the first column so let's not consider the serial column uh okay so the first column is category which says whether the mail is an am mail or a spam mail and the second column is your message your mail okay so the first mail that we have is an amail the second mail as well as an admin and the third mail is a spam mail and so on okay so this is how you can just see the sample of your data set now let's try to check how many uh you know number of rows and columns are there or in other words words how many males we totally have in our data set okay so the next part will be checking the number of rows and columns so we are basically uh you know checking the size of our data set rows and columns in the data plane so as you can see there are only two columns here right so one is the category column and the another one is message column so i'll just try to print this mail data dot shape so when you run this code it will give you the number of rows and column that you have in your data set so the first value represents the total number of rows you have and the second value represents the total number of columns so if you see the second value it is two so we know that we have only two columns so the first column is category and the second column is messages and the first value which is the total number of rows says 5572 that means you have 5572 different mails and we have the labels for all this mail the labels are nothing but whether that mail is a spam mail or an am mail okay so this is the data set that gives and this is not a small data set so we have good number of uh you know data here so which is about 5572 males right so we can go on to the next part now as you can see the labels here so one label is amp and the another label is a spam right so what we are going to do is label encoding so in this case we will try to encode this label to numerical values so basically what we will do is we will try to change this amp and we will replace all the you know text as am as one and all the spam value will be changed to one so this part is called as label encoding where we just want to replace this fixed value with numerical you know numerical values and in this column we just have only uh two values one is m and the another one is spam so this am will be numbered as uh you know one and this spam will be numbered as zero and this part is called as label encoding so i just make a text here so it's always a good practice to make this text and comments of what you're doing in a particular code because if someone sees your code you know it will help them to understand what you're doing in that particular cell so that's the reason i'm making this text and comments clearly so it's a good practice for you to as well to include this text and comments in your code so this part will be labeled encoding so i'll just make a text as label spam mail as a zero and non-spam mail that is ama so i'll just name this as samuel just remember that amales are nothing but non-spam mails so i'm as one okay so we have two kinds of mail and one is a spam mail and the another one is ama so i'm numbering the spam mails are zero and i'm mails as one so this is how we generally label our data uh you know in our set so say for example if you are predicting whether a person has diabetes or not you may you know label a person with all a person with diabetes as one and a person without diabetes are zero and so on so these are called as labels and this is the label that we are encoding so one is zero and the another one is one okay so how we can label the dataset is mention the name mail data dot lvoc i'll explain you what we are doing in this particular code i just complete this lock mail data category is equal to spam okay search spam comma again category um and there should be another bracket here is equal to zero so basically what we are trying to do here is so i am taking this mail data data frame and i am going to locate few values so what values i am locating is that so in this mail data data frame take this category column alone so we we don't want this message column right so we are not encoding this part we are just encoding the first column the category column so i am mentioning or i am taking this category column so that's why i mentioned category and if uh the value is spam in this category column so if the text in this category column is spam and then in that case i want to replace all the values with 0 so that is what is you know mentioned by this particular line of code okay so and i'll just copy this and now let's do the same for am mails as well so i'm going to name or i'm going to encode all the am mails as one so you just need to change this spam to amp okay so i'll just number this as one so let's run this and see whether this works okay so what this basically do is the first line of the code will change all the spam as a zero and all the am mails as one so i'll just make a text here as spam will be represented by 0 and m will be represented by one okay so now we can uh split the data set into features and target say for example what we are going to do is i am going to separate this message and this category separately so i am going to separate the message and its labels so this part will be separating the data as text and labels okay so text and labels the text is nothing but the messages and the mails that we have and the labels are the category that you have whether it is a spam mail or a number so the reason we are doing this is we will feed the data and the label separately to your machine learning model it's like giving the x-axis value and the y-axis value it is similar to it so in this case your x-axis value will be the text to the messages that you have and the y-axis value will be your label whether it is one or zero okay so we basically take the features or the input data so in this case the input data is nothing but message and the output or target column is this category column so we generally take this input column input feature or input column as x and this output column or target as y so i am going to create two variables here one is x and the another one is y and i am going to save all these messages in this x and all the labels in this y okay so x is equal to main data which is the data frame that we have messages so the name of this second column is messages so we need to mention this so mail data and within the sports we need to mention message yeah so if there is no yes here so it is message and this y will be your labels so label is nothing but the category column so we need to mention it so y is equal to main data category okay so this basically will separate your data set into x and y where x will be all the messages and y will be all the categories are labels so let's run this and now we can try and print this x and y separately so let's try to print x so it will print all the messages that we have and now you can print your y so y will be one or zero where one represents amines and zero represents spamming which we have encoded here okay so this is the next step and now what we are going to do is we are going to split this x and y into training data and test data so this is one of the most important steps that we do in all the machine learning projects we work on so as i told you before the reason is one set of data will be used to train our model and the other set of data will be used to evaluate our test star model so this part is train test split or i'll just write this as splitting the data into training data and test data okay so if you remember we have imported the function train test split and we are going to use this function in order to split our data set into training data and test data so in order to do this we need to mention four arrays so the first array is extreme second array is six test third array is y train and your last fourth array is y test so basically what these four arrays are so we have this x and y so the x is the entire data set that we have so x and y are the total data set that we have and i'm now i'm going to split this x into two parts so one part of the x will be your training data and the other part will be your test data so similarly we will split the y so all the corresponding x values will be splitted correspondingly so all the training data messages or the all the training data mails will go into this extreme and the corresponding labels all the labels for those training data will go to this wide range and the remaining x test so the testing data for all the mails will go to this extras and the corresponding labels for all the mails in this extras go to this y test okay so this is how we generally split our data set into training data test data and i'll use the function that we have imported which is train test split function and within this function you need to mention the parameters such as x y because x and y are the data set which we are going to split right so you need to mention this so your trained split function now will split your x and y into two you know these four arrays where two arrays are your training data and the other two are sex test and white desk so your extrane and your white rain are the training data and x test and y test are your test data okay and there are few other parameters that we need to mention so i'll mention my test size as uh 0.2 tester size is equal to zero point okay so test size is nothing but the amount of uh data you want in your test you know data set let's say let's say that we have totally under data points in our data set so generally what we will do is we will take 80 percentage or 90 percentage of the data as training data and we take 10 percentage or 20 percentage of the data as your test data so generally training data will have will contain more data points okay so here 0.2 represents 20 percentage of data so if uh totally we have a 5571 males right so out of this 80 percentage will go to your training data which has x train and y train and the remaining 20 percentage of the data will go to the x test and the y test so you need to mention how much number of data points you want in your test data so that is your test size and finally we have random state so random state is not a very important aspect it is actually a very simple one so you can give any uh values for your random state so i'll just give three so the reason for this random status when you use this train test strip function so each time you use this your data will be splitted in a different way okay so the first time you split your data it will be in a different manner so and the next time you split the data and different uh mails will go into the training return test data so if you want this data to be splitted in the same way in all the cases then you can mention a random state three so let's say that you are practicing this code and you are splitting your data set so if you mentioned random state is equal to 2 then your data set will be splitted in a different manner but if you just use this random state is equal to 3 as i have used your data will be splitted in the same manner that my data is printing so this is just to reproduce the code this is you know used in order to split the data exactly in the way that we want okay so you can give any you know number for this random state so we are creating four arrays which are x strain x test and white n and y test so extension x train is your training data messages and your express is the test data messages and your y train is the label for this training data and y test is the label for this test data and we are using this train test split function and we have four parameters here x and y because we are splitting this x and y ah right so x and y are the total data set and we are splitting them and the next one is our test size which is the amount of data you want in your test to say so 0.2 means 20 percentage of data go into your test data and if you mention 0.3 that means you are taking 30 percentage of the entire data set as your test data and finally we have the random state so and run this and press shift plus enter okay so you can also try to print the shape of extreme uh so i'll just print the shape of its first so x shape so let's print x shape and we can also print x test and uh let's print extraneous means so let this be x test and the second one be extreme so let's see how many data points goes into exchange express okay i need to mention extreme dot shape so so this will give the total number of rows and columns you have okay so x dot shape contains five thousand five seventy two rows so the second value is empty so empty means there is only one column as you can see here x contains only one column which is all these messages so you don't need to look at this serial number column so that doesn't comes under column so this is the only column that we have onions you don't have any value here okay so your original dataset contains 5572 data points and out of those values 80 percentage of the data 4457 will be your extreme and the twenty percentage of data triple one five will go to your x test okay so this is how you we can split our data into training return this data so the next part of the code is to split your data sorry to convert your text data into numerical values as i have told you before if you feed all these sticks to your logistic regression model it doesn't understand anything so we need to convert this all this text data into meaningful numerical values okay so that is the next part and this part is called as feature extraction okay so i'll name this as feature extraction as you might remember that we have imported the tf idf vectorizer function in order to convert this fixed value into numerical values okay so just a second okay so uh the thing that we are going to do here is transform the text data to feature vectors that can be used as input to the logistic regression model okay so that's what we are going to do now so we need to convert this text data into feature because feature vectors we know that vector is uh you know some numerical values so we are going to convert those text into those numerical values so that we can feed those values to our logistic direction so those numerical values will act as the input data okay so i'll create a variable as feature extraction and in this feature extraction i'm going to load the tf ief tf idf vectorizer okay so this is the function i'm going to use and we need to mention certain parameters here one is minimum df min df i'll explain you what this means and just complete this so min df is equal to 1 stopwards stopwatch is equal to english and lowercase is equal to true so these are the parameters that we need so three parameters so first of all let's try to understand what does this tf idf vector asset does so i have made a separate video on this feature extraction of text and uh what is this tf idea vector is and how this works and if you want a very detailed explanation on this i'll give the link for this particular video in this video description you can check that video after watching this one okay so but i'll just give you a short explanation of what this tf idea vectorizer does okay so first of all it looks at this data and if you just see all the spam mails all the spam mails may contains the words like free offer discounts and so on okay so this tf idea factorizer try to go through all the words in your document so in your in this the document is nothing but the data set that we have so it will try to go through all the words in this document and if the word is repeated several times it will get some values let's say a particular word is repeated thousand times in this entire uh data set then it will get some score if uh word is repeated only under times it then it will get a smaller score and so on okay so similarly it will try to give some value or give some score to all the words that has been present in a data set okay and this is uh the most important one so this value so this uh important score or the weight score is used by our model to find which uh you know males can be spam mails or which mails can be amazing for example as i have told you before there is a possibility that the spam mails can contain the words like free offer discounts and so on so these all these woods will uh contain uh these words and we have already named this uh spam mails are zero right so now what the logistic regression model will do is link all those uh you know words like free discounts and so on so they all all of those things get some numerical values uh get some feature vectors as numerical values and they will be related to this uh label spam which is zero okay and other males will get some other score so you know the first mill is an ammo so go until uh juron point and so on so all these words will get some others four values and all this will be linked to this target so this is our model can find the difference between spam mails and admins by going through that importance value that score value which is given by the tf idf vector is okay so that is the step which we are doing here so we need to convert this text into numerical values so these numerical values are like the importance number so if a word is repeated many times it will get get a particular scope and if our number is repeated or if a text is repeated less number of times it will get some other score so that is how a tf idea vectorizer works and as i told you before please watch that video on feature extraction on tf id vectorizer so if you want a more detailed explanation okay and here we have used uh some parameters so the first parameter that we use is minimum or min underscore df so this is basically that if the score of a particular word is less than one then we we need to ignore it okay so if the score is maximum if the score is more than one okay so if the score is greater than one for a particular word then we can include it so this basically means that if a word is not repeated if the word is repeated only once in that case we don't want to use those words because those words won't be uh that important for our prediction okay so this is the reason for using this min df which is nothing but the minimum uh score that is given by this vector is it to a particular word okay so this this value this score will be uh given to all the words individually and the next uh parameter that we have is stop words and in this stopwatch we have this uh parameter called as english so stopwords are those words that will be repeated multiple times in a document so we have the words like kiss was r etcetera right but these words doesn't make much sense and much meaning say for example you have this uh the word did is the and so on so all these we don't want all these words so these are common words that will be be there in all the main so we don't want all these uh words so those words are called as stoppers and we want to ignore those words so when you give stopwatch in is equal to english uh it will contain all these set of words that are not important for us and all those words will be ignored from our document or our data set so that will be your second parameter and finally we have lowercase so basically all the letters will be uh you know change to lowercase letter which is better for the processing so these are uh the three main parameters that we have so minimum df will uh choose all the words that have mores for our is for that one and the second parameter is stopwatch so all these top words will be removed the words that doesn't have much meaning and the third one is converting all the letters into lowercase letter so i am basically loading this tf uh the idea factorizer into this variable called as feature extraction so i am loading one instance of this tf idea vectorizer now we need to use this vector as a function in order to convert this data set okay so that will be our next step so i'll create an array as extreme features and i'm going to convert my x train as you know this extreme uh we have splitted the data set into extraneous tests whiter and white disks we don't need to convert this white ring and white is because they they just contain the values as one and zero so we don't need to do anything with it we just need to change the values of xtrain and express so it will contain messages like this so we need to convert them and i am going to convert all these messages into numbers and i am going to store it in extreme features okay so this extreme the messages in this extreme will be converted into numerical values and those will be stored in this array called as extreme features so in this we need to use feature extraction to extend features is equal to feature extraction feature extraction dot fit transform so this feature extraction is nothing but your uh tf idea factorizer as you know that we have loaded this tf id vectorizer into a variable called as feature extraction and now i am going to use this feature extraction for the processing so extreme feature is equal to feature extraction dot fit transform so this will basically fit your uh data so that it data is nothing but all the mails that we have so it will fit all those mails into your vectorizer function and this vector is a function once it has fitted to this data it will transform so there are basically two steps that are happening here one is fitting all this data into your vectorizer and after that it will transform all the data into feature vectors which are nothing but numerical values so in that you in this parenthesis you need to mention what we are going to convert so i am going to convert my x train into features right so we need to mention this extreme here so this is our next step and similarly we need to convert all the messages in this extracted to x test features so that will be our next array which is x test features and one main thing that you need to remember here is feature extraction dot transform so here we don't uh fit the data again so we will just fit the data only with the training data okay so we don't fit the data again for the test data so there are basically three steps so the first step is fitting the data or fitting uh this training data into your vectorizer and using this uh vectorizer to transform your x strain and using the same vector result in order to ah transform your x test okay so this parenthesis should contain next test so the main thing that you need to do is you shouldn't uh you know write a feature extraction.fit transform in the second case so we don't want to fit our vectorizer to our express data because we don't want our model to look at this x test so that is the reason so we just want to fit our data with the extreme and based on that fit we want to convert the extreme and express into their respective features okay and i'll just also do a small thing here i'm going to convert y train and whitest values as integers so basically what we are doing is we have labeled this as one and zero right so sometimes this will be uh considered as uh you know strings as you can see the data type here as objects right so sometimes the this is what happens and we don't want that so i want to convert all those values as uh integers so if we convert all the values as integers then it is then it is easier for our machine to uh understand it so we are going to convert all the values as integers so i'll just take x strain and express sorry y 10 and y test so y train will contains all the labels for ex test so white rain is equal to y train dot ask type in so i am taking all the values inside this uh y train and i'm converting all of them into integers so the reason is that this one and zero won't be considered as uh you know integers they will be considered as some objects of string so that's the reason we are doing it so this is not a big deal so next we need to do the same for y test so y test is equal to y test dot ask type in so it's it's basically the same thing so let's run this okay so the first part of code is loading the tf idf vectorizer so once we load it we will convert all the x strain and x-rays into uh their corresponding feature vectors and after that we are converting the white rain and white test into integer values which are one and zero okay so now you can try and print your x test and extreme and print x train now as you can see i sorry so x rain is nothing but uh your data which is not been transformed so extreme is your original data this uh text data and now let's try to print our extreme features okay so this extreme features will contain only numerical values so now i'm going to print features okay let's see how this looks like as you can see here now it contains a lot of numbers so basically what happens is if you take this first sentence as i've told you before each sentence will get some score based on the vectorizer function and it will be given that score so this is how you you can convert uh your text data into numerical values it's not like you can convert them into any numerical value so it's not like that it should have that meaning and that meaning will be given by this tf idf factorization okay so this is how your extreme looks like and this is how your extreme features look like okay so now we don't use this extreme but we will use this extreme features because they are numbers and as i have told you machines understand numbers better okay so this is the training data that we are going to use so uh i just clear this output because it is not that much tidy okay so we have xtrain which is all the messages which is in the form of text and then we have extreme features so it is basically the same extreme but it is represented in a numerical way okay so that is the difference so and now we are in the in stages of our code so now let's uh train our logistic regression model okay so this part of the code will be training the machine learning model that we have so in this case also mention another text as logistic regression okay so model so i'll create a variable as model so i'm going to load an instance of logistic regression model and if you remember we have imported the logistic regression function from uh scale and dot linear model okay so i'm going to load this model is equal to logistic regression paranthesis okay so this will load your logistic regression model to this particular variable i will run this and now you can fit your extreme features and your y train to this logistic regression model so this part will be training the logistic regression model with the training data okay so i'm going to use the word model so model is nothing but your logistic regression so model is a model dot fit so model dot fit is so as i told you you need to give two values one is like the x axis value and the other is y axis values kind of things so your x axis value is nothing but your x train features and your y axis value is your y train okay so extreme features is nothing but all the training data but it is represented in a numerical form and white rain contains all the corresponding labels okay so label in the sense one represents amin and zero represents spamming so all those males so let's run this and once you run this your logistic regression model will be trained okay so now if you give a new mail it will tell you whether that particular mail is a spam mail or an ambient so this is how this generally works okay so before going into the predictive system now we need to evaluate our model so we need to check uh how many good predictions how many correct predictions our model is making so that is the next part and this is called as evaluating the model evaluating the trained model okay so that will be our next step so first i'm going to predict on training data prediction on training data so basically what we are doing here is so we have used this extreme features in white raid in order to train our model now what i am going to do is as our model is trained i'm going to give only this extreme features and i'm going to ask my model to predict the white rain values so i'm going to give all the males and i and i'm going to ask my model to predict whether it is a spam wheel or an ammo so it will basically try to predict whether the value is 1 or zero okay so and i'm going to check how many correct values it is predicting so i'm going to store all the values predicted by my model as prediction on training data okay so prediction on training data is equal to model dot predictor so fit is the functions as you can see here fit is the function which is used to fit our logistic regression model to the data set so it is like training our model and for predicting we are we are going to use a different function called as predict so model dot predict exchange features okay so as you can see here i'm just giving the extreme features values alone but i'm not giving this white trend so my model will now find this y train values and it will be all those values one or zeros all those value values will be stored in this prediction on training data okay and now i am going to compare the values predicted by our model and the true value so the model the value is predicted by our model is stored in this particular array and the true values are nothing but wide range so we need to compare them so this will be your accuracy on training data so accuracy on training data is equal to accuracy s4 so as you may remember that we have imported this accuracies for function from scale and dot matrix so i'm using this function again so accuracy score on accuracy and training data is equal to accuracy score so you need to mention two values one is the true value and the predicted value so the true value is nothing but your y train values and your predicted value is prediction on training data okay so prediction on one training data okay so let's run this and now let's find the accuracy score value okay so let's try to print the value i'm going to print accuracy on training data security on training data is equal to accuracy on ea let's name this as beta i just copy this value and you will get some value here so let's see what is the value so the value is 0.967 so 0.96 basically represents 96 percentage that means out of 100 predictions so if you use your model to predict under different mails it will give you correct value for 96 mail so that is your accuracy value so zero point nine means ninety percent zero point eight means eighty percent zero point nine six means ninety six percentage okay which is a very good accuracy score so if you get an accuracy score of seventy five more than seventy five percentage eighty or eighty five percent then we can say that it is a good model and you are getting an accuracy score of more than 95 percentage that means your model is working really well okay so that is one main thing that we can remember and now we need to find the same the you know accuracy score for test data as i've told you before we will train our logistic regression model with training data which is extraneous test and then we will test our data with sorry xtrain under white and we will test it or evaluate it evaluate it with x test and y test right so we need to evaluate it with the test data and you might wonder that you know why i am doing this with training data like i have told you before that we need to test it with test data but if if you can see here i have tested it with uh training data so that is one main thing uh one main reason why i have done this so i'll just explain you in a minute but before that let's do the same thing with test data as well so i'll just copy this code i just need to change few things so i'm going to predict using this data okay so prediction on this data so change this to test okay and this will be your x test features next test y test and accuracy on test data so i'm basically repeating the same thing the only difference is that so instead of using this extreme features i'm using the x test feature so this is predicting using the test data set so prediction on test data brazen raising accuracies for okay so everything is perfect so let's run this and again we will try to find the accuracy score let's print it so i'll just change this to test data and this should be copied and pasted here let's see what's our accuracy score on test data okay so your accuracy is four on test data is 96.5 percentage which is not very much different from your training data accuracy now the reason i am trying to find the accuracy score on both training data and test data is that in some cases your model may work so overfitting is a problem that occurs you know most of the times in machine learning so in that case what happens is your model performs well on your training data set so you will get a very high accuracy score on your training data but when you predict it using the test data you will get a very minimum test accuracies for say for example let's say that we are getting an accuracy score of 96 percentage in the training and let's say that if we get only 60 percentage accuracy in your test data that means uh the difference is so huge in your training return test data in that case we can say that our model is worth fitting that it basically means that our model has overtrained from the data okay so and we don't want that we want a general solution we don't want our model to over learn anything from the training data and that is the reason we are checking the accuracies for on training data and the testing data as well so the reason we are doing this is i'll give you an analogy for this overfitting let's say that that is a person and this person is studying for some exam so let's say that he's studying for a max exam and he has practiced all the questions that has been given in a particular book okay so if all the questions are asked in the exams he can perform well but if the examiner asks different questions that are related to those which have studied he may not uh perform well so this is basically what happens in the case of overfitting so if you just uh test it with all the data that the machine has studied which is training data so it is nothing but all the you know problems that a person has solved in that particular max book so if you ask all those questions he can perform it but if you ask something outside of the book he cannot answer so we don't want that kind of a case in our machine learning model so it should uh perform well in the mail or in the data set that the model has not seen so this basically what this means is like if all the mails is similar to those that you add in your exchange your model can perform well but if you use a new mail if you you know give a new mail that your model has not seen it may not perform well so in that case what you will have is a very low accuracy score on your test data okay so the one main thing that you can you know do in order to find whether your model has not worked is to check the accuracy score on timing return test data if the difference is not very huge then you can say that your model is working well and your model is not over fitted as you can see here the accuracy is four are very uh you know similar so in that case we can say that our model doesn't overfit okay so that that is the reason for it and we get a very good accuracy score of about more than 96 percentage which is really good okay and now we are in the final part of our code now what we are going to do is build a predictive system so this predictive system what it will do is if you type in a new male your trained logistic regression model will predict whether that male is a spammy or man so what it basically does is it will try to predict this uh zero or one value so if you give a new uh you know male it will try to predict whether the label is zero or one so if the label is zero we call it as family if the label is uh one then we can call that as an admin okay so that's what we are going to do in our next part of code and this is building a predictive system okay so i'll create an array as input mail and store the data in a list format so what i'm going to do is i'm going to choose a mail from my data set i'm going to paste it in this particular list so i have my uh mail data set here i'll just open this with notepad okay so you can just copy any mail yet so we can just copy some mail so i'll just uh you know randomly copies okay so i just take this mail so if you can see here this mail is an amal so the first word all the first word in each line represents its category it's a target which is spam or amp and the second part of this line represents the main so i'll copy this particular line so and i am going to feed it to my machine learning system my logistic regression model okay so i have pasted the mail here so now i need to give this to my model so and another main thing is you need to enclose it within a port so i'll enclose it within double quotes here so the reason i am using double quotes is as you can see a single quote here right so if i just use a single quote here what it will consider this it will consider this i as a string so we don't want that so in that case you can just use double quotes so double quotes and uh here there should also be double quotes okay so i'll enclose all the string in this double quotes and now i'm going to feed this to my machine learning system so if my model is working correctly then it should predict that this particular mail is an am mail or it should basically give the target value as my target value for male is one okay so it should give the value as one so that's what we are going to do now so input main and i'm going to convert this as as you know that we need to convert this text into numerical values right in order we we need to convert it to feature vector so in order to do that we are going to use the feature extraction so it is the same thing that we have done here so we will take this particular message and we will transform it using this feature extraction dot transform then it will convert this to numerical values so the second part is convert text to feature vectors okay so i'll name this as input data features so input data features is equal to feature extraction dot transform input data so i am sorry input mail okay so basically what i'm doing is i'm taking this input mail and i am fitting it with this feature extraction dot transform so that it can convert it into features which are nothing but numerical values and then storing all those numerical value to this input data features okay and now we can try to make our predictions so the next part is making predictions and your prediction will be i'll store my prediction in the keyword prediction so prediction is equal to model.predict as i told you we will use the fit function in order to train our model and we will use the predict function in order to predict the label value so model.predict input mail features or input data features input data features and i am going to print my prediction value okay so this is actually a very simple step so i'm just taking a new mail and i'm converting this mail into a numerical values using this feature extraction part okay and i'm trying to predict the value so model reporting so this particular line will give you the value as either one or zero and i am storing that value in this variable called as prediction and i am trying to print the prediction so we know that this mail is an amail right so this mail is in our ml so i should get the label value as one because we have labeled all the diamonds as one so i'll run this as you can see here we get the label value as one so we know that one basically represents and am made so as we know that this is uh basically an ammo so we can say that our model is predicted correctly okay so just uh include another simple part here so i'm just going to create an if else condition so this prediction value will basically be you know contained in a list as you can see here there is a square bracket and within the square bracket there is one so basically when you predict something using your model that value will be stored in a list so uh what i'm going to do is i'm going to check this value and i'm going to tell that if this value is equal to 1 and then i i just want to print it as main and if this value is equal to 0 i want it to print as spam in okay so i'll just create an if condition here so if prediction so prediction is nothing but the list you get as your output okay so if prediction 0 is equal to 1 so this 0 basically represents the first element in your prediction list so let's say that let's create a list as my list okay so my list is equal to so let's say that my list contain some values as 1 2 3 and so on so if you want to print the first value so what you will do is my list now you mentioned square bracket and zero so i'm going to print this so it will print the first value which is one okay so this is how you can print the first value in a list so my list zero is zero is nothing but your first one and if you want to print the second value you just give one as your value so similarly my list contains only one value right so we need to uh mention the you know index of the value so prediction zero basically means i am having a prediction list and i want to print the first value in that list so that's the reason we are doing this so if prediction square bracket 0 is equal to 1. so this basically means is if the first value in my prediction array is equal to 1 then i want to say that it is an am mail okay so it is an am mail or else so the other condition else is nothing but it is in spam mail okay so print spam so if the value that okay so it's spam mail so we need to enclose it in quotes so basically what we are trying to do here is yeah i'm trying to find the value of this prediction and if the value of this prediction is equal to 1 and i am going to call this ml if i get the value as 0 which is the else condition it it will print it as a spammy okay so we can just enclose this in a bracket and now let's run this it will give you the label as well as it will tell you whether the mail is an amme or spammer so you can what you can do is if you take some spam mail example and if you paste that particular uh line inside this bracket so you can just remove this uh and you can instead of this mail you can print you know you can copy and paste a spam mail and you can try to predict the label as well as you know when you do that it will basically give you the label as 0 and it will tell that it is a spam so that is the predictive system that we are making so that's it for this particular project and i hope you have understood all the things covered in this particular video i'll just give you a quick recap of all the things that we have done here okay so that it will be useful for you so the first part is importing the dependencies so the dependencies are nothing but the libraries and the functions that we need so we have uh imported the numpy and the panda so we know that numpy is used in order to create numpy arrays so i think we didn't use numpy anywhere in this particular code but in most of the cases you will need numpy array so it is a good practice to import it so i mean i have imported numpy snp and then i have imported pandas so pandas is used to create a data frame like this okay they are used to put the data in a structured data frame and i am importing it in a short form as pd then i am importing the train test split function which is used to split my data into training and test data where training data is used to train my model and test data is used to evaluate our model and we are using tf idf vectorizer in order to transform the text into numerical values and we are importing the logistic regression model and finally accuracy is four in order to find how accurate our model is so that is the first step and then we are you know uploading our data set to the collab environment so i'll give you this data set file you can download it from this from the link in the video description okay and we have taking this csv file and we are loading it to a data frame so after that we are replacing all the null values with null string and next part is checking the first five rows of the data frame and the next part is checking how many rows and columns there are and then we are replacing this spam by 0 and ms1 so we are basically doing label encoding and then we are splitting the x and y so x is nothing but all your messages or mails and your y is the category so category is nothing but spam or amp which is represented by 0 and 1 okay and we are trying to print it and then we are splitting a data set into training and test data and the next part is converting your text data into feature vectors which are numerical values and after that we can feed those data to our logistic regression model and after that we are evaluating our model so here we are trying to find the accuracy score both on training return test data as well and the final part is building a predictive system and this system will tell you whether that male is a spam mill or an amine if you put that in this particular bracket okay so i hope everyone is clear up to the things covered in this video and then

Transcript for:Introduction to Siddharthan's Machine Learning

Transcript for:
Introduction to Siddharthan's Machine Learning