Transcript for:
Data Science Full Course: Key Concepts and Algorithms

Undoubtedly, Data Science is the most revolutionary technology of the era. It's all about deriving useful insights from data in order to solve real-world complex problems. Hi all I welcome you to this session on Data Science full course that contains everything that you need to know in order to master data science. Now before we get started, let's take a look at the agenda. The first module is an reduction to data science that covers all the basic fundamentals of data science followed by this. We have statistics and probability module where you'll understand the statistics and math behind data science and machine learning algorithms. The next module is the basics of machine learning where will understand what exactly machine learning is the different types of machine learning the different machine learning algorithms and so on the next module is the supervised learning algorithms module where we'll start by understanding the most basic With them or which is linear regression. The next module is the logistic regression module where we will see how logistic regression can be used to solve classification problems. After this we'll discuss about decision trees and we'll see how decision trees can be used to solve complex data-driven problems. The next module is random Forest here will understand how random Forest can be used to solve classification problems and regression problems with the help of use cases and examples. The next module will be be discussing is the k-nearest neighbor module. We will understand how gain and can be used to solve complex classification problems followed by this. We look at the naive bias module, which is one of the most important algorithms in the Gmail spam detection. The next algorithm is support Vector machine where we will understand how svm's can be used to draw a hyperplane between different classes of data. Finally. We move on to the unsupervised learning module where we will understand how genes can be used for clustering. And how you can perform Market Basket analysis by using Association rule mining. The next module is reinforcement learning where we will understand the different concepts of reinforcement learning along with a couple of demonstrations followed by this bill. Look at the Deep learning module where we will understand what exactly deep learning is what our neural networks with different types of neural networks. And so on. The last module is the data science interview questions module where we will understand the important concepts of data. Along with a few tips in order to Ace the interview now before we get started make sure you subscribe to Adorama YouTube channel in order to stay updated about the most trending Technologies data science is one of the most in-demand Technologies right now. Now this is probably because we're generating data at an Unstoppable pace. And obviously we need to process and make sense out of this much data. This is exactly where data science comes in in today's session. We'll be talking about data science in depth. So let's move ahead and take a look at today's agenda. We're going to begin with discussing the various sources of data and how the evolution of technology and introduction of IOD and social media have led to the need of data sign next. We'll discuss how Walmart is using insightful patterns from their database to increase the potential of their business. After that. We will see what exactly data science is, then we'll move on and discuss who are data scientist is where we will also discuss the various skill sets. Needed to become a data scientist next we can move on to see the various data science job roles such as data analyst data architect data engineer and so on after this we will cover the data life cycle where we will discuss how data is extracted processed and finally use as a solution. Once we're done with that. We'll cover the basics of machine learning where we'll see what exactly machine learning is and the different types of machine learning next. We will move onto the K means algorithm and we'll discuss a use case of the k-means clustering after which we Discuss the various steps involved in the k-means algorithm and then we will finally move on to the Hands-On part where we use the k-means algorithm to Cluster movies based on their popularity on social media platforms, like Facebook at the end of today's session will also discuss about what a data science certification is and why you should take it up. So guys, there's a lot to cover in today's session. Let's jump into the first topic. Do you guys remember the times when we have telephones and we had to go to PC your boots in order to make a phone call. Call now those things are very simple because we didn't generate a lot of data. We didn't even store the contacts and our phones or our telephones. We used to memorize phone numbers back then or you know, these have a diary of all our contact but these days we have smartphones with store a lot of data. So there's everything about us in our mobile phones. We have images we have contacts. We have various apps. We have games. Everything is stored on a mobile phones these days similarly the PCS that we use in the earlier times. It used to process very little data. All right, there was A lot of data processing needed because technology was an evolved that much. So if you guys remember we use floppy disk back then and floppy. This was used to store small amounts of data, but later on hard disks were created and those used to store GBS of data. But now if you look around there's data everywhere around us. All right, we have a data stored in the cloud. We have data in each and every Appliance at our houses. Similarly. If you look at smart cars these days they're connected to the internet they connected to a mobile phones and this also generates a lot of data. What we don't realize is that evolution of technology has generated a lot of data. All right. Now initially there was very little data and most of it was even structured only a small part of the data was unstructured or semi-structured. And in those days you could use Simple bi Tools in order to process all of this data and make sense out of it. But now we have way too much data and order to process this much data. We need more complex algorithms. We need a better process. All right, and this is where data science comes in now guys, I'm not going to get into the depth of data science. Yet I'm sure all of you have heard of iot or Internet of things. Now. Did you guys know that we produce 2.5 quintillion bytes of data each day. And this is only accelerating with the growth of iot. Now iot or Internet of Things is just a fancy term that we use for network of tools or devices that communicate and transfer data through the internet. So various devices are connected to each other through the internet and they communicate with each other right now the communication happens by exchange of data or by. Generation of data now these devices include the vehicles. We drive the include our TVs of coffee machines refrigerators washing machines and almost everything else that we use in a daily basis. Now, these interconnected devices produce an unimaginable amount of data guys iot data is measured in zettabytes and one zettabyte is equal to trillion gigabytes. So according to a recent survey by Cisco. It's estimated that by the end of 2019, which is almost here. The iot will generate more than five hundred zettabytes of data per year. And this number will only increase through time. It's hard to imagine data in that much volume, imagine processing analyzing and managing this much of data. It's only going to cause as a migraine so guys having to deal with this much data is not something that traditional bi tools can do. Okay. We no longer can rely on traditional data processing methods. That's exactly why we need data science. It's our only hope right now now let's not get into the details here. Yet moving on. Let's see how social media is adding on to the generation of data. Now the fact that we are all in love with social media. It's actually generating a lot of data for us. Okay. It's certainly one of the fuels for data creation Now all these numbers that you see on the screen are generated every minute of the day. Okay, and this number is just going to increase so for Instagram it says that approximately 1.7 million pictures uploaded in a minute and similarly on Twitter approximately. A hundred and forty eight thousand tweets are published every minute of the day. So guys imagine in one are how much that would be and then imagine in 24 hours. So guys, this is the amount of data that is generated through social media. It's unimaginable. Imagine processing this much data analyzing it and then trying to figure out, you know, the important insights from this much data analyzing this much data is going to be very hard with traditional tools or traditional methods. That's why data science was introduced data science is a simple process that will just extract the useful information from data. All right, it's just going to process and analyze the entire data and then it's just going to extract what is needed now guys apart from social media and iot, there are other factors as well which contribute to data generation these days all our transactions are done online, right? We pay bills online. We shop online. We even buy homes online these days you can even sell your pets on oil excuses. Not only that when we stream music and Watch videos on YouTube all of this is generating a lot of data not to forget. We've also brought Health Care into the internet wall. Now there are various watches like bit fit which basically trans our heart rate and it generates data about a health conditions education is also an online thing right now. That's exactly what you are doing right now. So with the emergence of the internet, we now perform all our activities online. Okay, obviously, this is helping us, but we are unaware of how much data we are generating what can be done with All of this data and what if we could use the data that we generated to our benefit? Well, that's exactly what data science does data science is all about extracting the useful insights from data and using it to grow your business. Now before we get into the details of data science, let's see how Walmart uses data science to grow that business. So guys Walmart is the world's biggest retailer with over 20,000 stores in just 28 countries. Okay. Now, it's currently building the world's biggest. Good Cloud, which will be able to process two point five petabytes of data every hour now. The reason behind Walmart success is how the user customer data to get useful insights about customers shopping patterns. Now the data analyst and the data scientist at Walmart. They know every detail about their customers. They know that if a customer buys Pop-Tarts, they might also buy cookies, how do they know all of this? Like how do they generate information like this now the user data that they get from their customers. Hours and the analyze it to see what a particular customer is looking for. Now. Let's look at a few cases where Walmart actually analyze the data and they figured out the customer needs. So let's consider the Halloween and the cookie sales example now during Halloween sales Analyst at Walmart took a look at the data. Okay, and he found out that a specific cookie was popular across all Walmart stores. So every Walmart store was selling these cookies very well, but he found out that they would to stores which are not selling. A DOT. Okay. So the situation was immediately investigated and it was found that there was a simple stocking oversight. Okay, because of which the cookies were not put on the shelves for sale. So because this issue was immediately identified they prevented any further loss of sales now another such example, is that true Association rule mining Walmart found out that strawberry Pop-Tart sales increased by seven times before a hurricane. So a data analyst at Walmart identified the association between ha Hurricane and strawberry pop tarts through data mining now guys. Don't ask me the relationship between Pop-Tarts and Harry Caine, but for some reason whenever there was a hurricane approaching people really wanted to eat strawberry Pop-Tart. So what Walmart did was they place all the strawberry Pop-Tarts? I will check out before a hurricane would occur. So this way the increase sales of the Pop-Tarts Now, where's this is a natural thing. I'm not making it up. You can look it up on the internet. Not only that Walmart is analyzing the data generated by Social media to find out all the training product so through social media. You can find out the likes and dislikes of a person right? So what Walmart did is they are quite smart the user data generated by social media to find out what products are trending or what products are liked by customers. Okay an example of this is 1 mod analyze social media data to find out that Facebook users were crazy about cake pops. Okay, so Walmart immediately took a decision and they introduced cake pops into the Walmart stores. So guys the only reason Walmart is so successful is because the huge amount of data that they get they don't see it as a burden instead. They process this data analyze it and then you try to draw useful insights from it. Okay, so they invest a lot of money a lot of effort and a lot of time and data analysis. Okay, they spend a lot of time analyzing data in order to find any hidden patterns. So as soon as they find out hidden pattern or association between any two products, these are giving out offers or Started having discount or something along that line. So basically Walmart uses data in a very effective manner the analyzer very, well. They process the data very well and they find out the useful insights that they need in order to get more customers or in order to improve their business. So guys, this was all about how Walmart uses data science now, let's move ahead and look at what is data set now guys data science is all about uncovering findings from data. It's all about surfacing the hidden insights that can help. Ponies to make smart business decisions. So all these hidden insights or these hidden patterns can be used to make better decisions in a business now an example of this is also Netflix. So Netflix, basically analyzes the movie viewing patterns of users to understand what drives user interest and to see what users want to watch and then once they find out they give people what they want. So guys actually data has a lot of power. You should just know how to process this data and how to extract the useful information. From data. Okay. That's what data science is all about. So guys a big question over here is how do data scientists get useful insights from data. So it's all starts with data exploration. Whenever a data scientist comes across any challenging question or any sort of challenging situation, they become detectives so the investigative leads and they try to understand the different patterns or the different characteristics of the data. Okay. They try to get all the information that they can from the data and then Then they use it for the betterment of the organization or the business. Now, let's look at who is a data scientist. So guys the data scientists has to be able to view data through a quantitative lengths. So guys knowing math is one of the very important skills of data scientists. Okay. So mathematics is important because in order to find a solution you're going to build a lot of predictive models and these predictive models are going to be based on hard math. So you have to be able to understand all the Underlying mechanics with these models most of the predictive models most of the algorithms require mathematics. Now, there's a major misconception that data science is all about statistics. Now, I'm not saying that statistics is an important. It is very important, but it's not the only type of math that is utilized in data science. There are actually many machine learning algorithms which are based on linear algebra. So guys overall you need to have a good understanding of math and apart from that data scientist. Eli's technology, so data scientists have to be really good with technology. Okay. So their main work is they utilize all the technology so that they can analyze these enormous data sets and work with complex algorithms. So all of this requires tools, which are much more sophisticated than Excel so there's data scientist need to be very efficient with coding languages and few of the core language has associated with data science include SQL python R & sass. It is also important for a data scientist. Be a tactical business consultant. So guys business problems can be on a sword by data scientist since our data scientists work so closely with data they know everything about the business. If you have a business and you give the entire data set of your business stored data scientist, he know each and every aspect of your business. Okay? That's how data scientists work. They get the entire data set. They study the data set the analyze it and then we see where things are going wrong or what needs to be done more or what? Needs to be excluded. So guys having this business Acumen is just as important as having skills in algorithms or being good with math and technology. So guys business is also as important as these other fields now, you know who our data scientist is. Let's look at the skill sets that a data scientist names. Okay, it always starts with Statistics statistics will give you the numbers from the data. So a good understanding of Statistics is very important for becoming a data scientist. You have to be familiar with satisfaction. Contest distributions maximum likelihood estimators and all of that apart from that you should also have a good understanding of probability Theory and descriptive statistics. These Concepts will help you make Better Business decisions. So no matter what type of company or role you're interviewing for. You're going to be expected to know how to use the tools of the trade. Okay. This means that you have to know a statistical programming language like our or Python and also you'll need to know or database. Wiring language like SQL now the main reason why people prefer our and python is because of the number of packages that these languages have and these predefined packages have most of the algorithms in them. So you don't have to actually sit down and code the algorithms instead. You can just load one of these packages from their libraries and run it. So programming languages is a must at the minimum. You should know our or python and a database query language now, let's move on to data extraction and processing. So guys That you have multiple data sources like mySQL database Mongo database. Okay. So what you have to do is you have to extract from such sources and then in order to analyze and query this database you have to store it in a proper format or a proper structure. Okay, finally, then you can load the data in the data warehouse and you can analyze the data over here. Okay. So this entire process is called extraction and processing. So guys extraction and processing is all about getting data. From these different data sources and then putting it in a format so that you can analyze it now next is data wrangling and exploration now guys data wrangling is one of the most difficult tasks in data science. This is the most time-consuming task because data wrangling is all about cleaning the data. There are a lot of instances where the data sets have missing values or they have null values or they have inconsistent formats or inconsistent values and you need to understand what to do with such values. This is Data wrangling or data cleaning comes into the picture then after you're done with that. You are going to analyze the data. So where's after data wrangling and cleaning is done. You're going to start exploring. This is where you try to make sense out of the data. Okay, so you can do this by looking at the different patterns in the data the different Trends outliers and various unexpected results in all of that. Next. We have machine learning. So guys if you're a large company or with huge amounts of data or if you're working at a company. See where the product is data driven, like if you're working in Netflix or Google Maps, then you have to be familiar with machine learning methods, right? You cannot process large amount of data with traditional methods. So that's why you need a machine learning algorithms. So there are few algorithms. Like knok nearest neighbor does random Forest this K means algorithm this support Vector machines, all of these algorithms. You have to be aware of all of these algorithms and let me tell you that most of these algorithms can be implemented. Using our or python libraries. Okay, you need to have an understanding of machine learning. If you have large amount of data in front of you which is going to be the case for most of the people right now because data is being generated at an Unstoppable Pace earlier in the session we discussed how much of data is generated. So for now knowing machine learning algorithms and machine learning Concepts is a very required skill if you want to become a data scientist, so if you're sitting for an interview as a data scientist, you will be asked machine learning. Seems you will be asked how good you are with these algorithms and how well you can Implement them. Next we have big data processing Frameworks. So guys, we know that we've been generating a lot of data and most of this data can be structured or unstructured as well. So on such data, you cannot use traditional data processing system. So that's why you need to know Frameworks like Hadoop and Spark. Okay. These Frameworks can be used to handle big data lastly. We have data visualization. So guys data visualization is Is one of the most important part of data analysis, it is always very important to present the data in an understandable and Visually appealing format. So data visualization is one of the skills that data scientists have to master. Okay, if you want to communicate the data with the end users in a better way then data visualization is a must so guys are a lot of tools which can be used for data visualization tools like Diablo and power bi are few the most popular visualization tools. So with this we sum up the entire skill set that is needed to become a data scientist apart from this you should also have data-driven problem solving approach. You should also be very creative with data. So now that we know the skills that are needed to become a data scientist. Let's look at the different job roles just data science is a very vast field. There are many job roles under data science. So let's take a look at each role. Let's start off with a data scientist. So there's data scientists have to understand. The challenge is over business and they have to offer the best solution using data analysis and data processing. So for instance if they are expected to perform predictive analysis, they should also be able to identify Trends and patterns that can have the companies in making better decisions to become a data scientist. You have to be an expert in our Matlab SQL Python and other complementary Technologies. It can also help if you have a higher degree in mathematics or computer engineering next we have data. An analyst so a data analyst is responsible for a variety of tasks, including visualization processing of massive amount of data and among them. They have to also perform queries on databases. So they should be aware of the different query languages and guys one of the most important skills of a data analyst is optimization. This is because they have to create and modify algorithms that can be used to pull information from some of the biggest databases without corrupting the data so to become Be done. You must know Technologies such as SQL our SAS and python. So certification in any of these Technologies can boost your job application. You should also have a good problem solving quality. Next. We have a data architect. So a data architect creates the blueprints for a data management so that the databases can be easily integrated centralized and protected with a best security measures. Okay. They also ensure that the data Engineers have the best tools and systems to work with So to become a data architect, you have to have expertise and data warehousing data modeling extraction transformation and loan. Okay. You should also be well versed in Hive Pig and Spark now apart from this there are data Engineers. So guys, the main responsibilities of a data engineer is to build and test scalable Big Data ecosystems. Okay, they are also needed to update the existing systems with newer or upgraded versions and they are also responsible for improving the efficiency. For database now. If you are interested in a career as a data engineer, then technologies that require hands-on experience include Hive nosql are Ruby Java C++ and Matlab, it would also help if you can work with popular data apis and ETL tools next. We have a statistician. So as the name suggests you have to have a sound understanding of statistical theories and data organization. Not only do they extract and offer valuable insights. They also create new. Methodologies for engineers to apply now. If you want to become a statistician then you have to have a passion for logic. They are also good variety of database systems such as SQL Data Mining and other various machine learning Technologies by that. I mean, you should be good with math and you should also have a good knowledge about the weight is database system such as SQL and also the various machine learning Concepts and algorithms is the most next we have the database administrator. So guys the job profile of a database administrator is Much self-explanatory, they are basically responsible for the proper functioning of all the databases and they are also responsible for granting permission or the working in services to the employees of the company. They also have to take care of the database backups and recoveries. So some of the skills that are needed to become a database administrator include database backup and Recovery data security data modeling and design next. We have the business analyst now the role of a business analyst is a little It different from all of the other data signs job now. Don't get me wrong. They have a very good understanding of the data oriented Technologies. They know how to handle a lot of data and process it but they are also very focused on how this data can be linked to actionable business inside. So they mainly focus on business growth. Okay. Now a business analyst acts like a link between the data engineers and the management Executives. So in order to become a business analyst you have to have an understanding of business finances business intelligence. And also I did acknowledge, he's like data modeling data visualization tools and Etc at last we have a data and analytics manager a data and analytics manager is responsible for the data science operations. Now the main responsibilities of a data and analytics manager is to oversee the data science operation. Okay, he's responsible for assigning the duties to the team according to their skills and expertise now their strength should include Technologies like SAS our SQL. And of course, they should have good management skills apart from that. They must have excellent social skills leadership qualities and and out-of-the-box thinking attitude. And like I said earlier you need to have a good understanding of Technologies. Like pythons as our Java and Etc. So Guys, these were the different job roles in data science. I hope you all found this informative. Now, let's move ahead and look at the data lifecycle. So guys are basically six steps in the data life cycle. It starts with a business requirement. Next is the data acquisition after that you would process the data which is called data processing. Then there is data exploration modeling and finally deployment. So guys before you even start on a data science project. It is important that you understand the problem you're trying to solve. So in this stage, you're just going to focus on identifying the central objectives of the project and you will do this by identifying the variables that need to be predicted next up. We have data acquisition. Okay. So now that you have your objectives I find it's time for you to start Gathering the data. So data mining is the process of gathering your data from different sources at this stage some of the questions you can ask yourself is what data do I need for my project? Where does it live? How can I obtain it? And what is the most efficient way to store and access all of it? Next up there is data processing now usually all the data that you collected is a huge mess. Okay. It's not formatted. It's not structured. It's not cleaned. So if Find any data set that is cleaned and it's packaged well for you, then you've actually won the lottery because finding the right data takes a lot of time and it takes a lot of effort and one of the major time-consuming task in the data science process is data cleaning. Okay, this requires a lot of time. It requires a lot of effort because you have to go through the entire data set to find out any missing values or if there are any inconsistent values or corrupted data, and you also find the unnecessary data. Over here and you remove that data. So this was all about data processing next we have data exploration. So now that you have sparkling clean set of data, you are finally ready to get started with your analysis. Okay, the data exploration stage is basically the brainstorming of data analysis. So in order to understand the patterns in your data, you can use histogram. You can just pull up a random subset of data and plot a histogram. You can even create interactive visualizations. This is the point where you Dive deep into the data and you try to explore the different models that can be applied to your data next up. We have data modeling. So after processing the data, what you're going to do is you're going to carry out model training. Okay. Now model training is basically about finding a model that answers the questions more accurately. So the process of model training involves a lot of steps. So firstly you'll start by splitting the input data into the training data set and the testing data set. Okay, you're going to take the entire data set and you're going to separate it into Two two parts one is the training and one is the testing data after that your build a model by using the training data set and once you're done with that, you'll evaluate the training and the test data set now to evaluate the training and testing data. So you'll be using series of machine learning algorithms after that. You'll find out the model which is the most suitable for your business requirement. So this was mainly data modeling. Okay. This is where you build a model out of your training data set and then you evaluate this model by using the testing data set. You have deployment. So guys a goal of this stage is to deploy the model into a production or maybe a production like environment. So this is basically done for final user acceptance and the users have to validate the performance of the models and if there are any issues with the model or any issues with the algorithm, then they have to be fixed in this stage. So guys with this we come to the end of the data lifecycle. I hope this was clear statistics and probability are essential because these disciples form the basic Foundation of all machine learning algorithms deep learning artificial intelligence and data science. In fact, mathematics and probability is behind everything around us from shapes patterns and colors to the count of petals in a flower mathematics is embedded in each and every aspect of our lives with this in mind. I welcome you all to today's session. So I'm going to go ahead and Scoffs the agenda for today with you all now going to begin the session by understanding what is data after that. We'll move on and look at the different categories of data, like quantitative and qualitative data, then we'll discuss what exactly statistics is the basic terminologies in statistics and a couple of sampling techniques. Once we're done with that. We'll discuss the different types of Statistics which involve descriptive and inferential statistics. Then in the next session will mainly be focusing on descriptive statistics here will understand the different measures of center measures of spread Information Gain and entropy will also understand all of these measures with the help of a use case and finally we'll discuss what exactly a confusion Matrix is once we've covered the entire descriptive statistics module will discuss the probability module here will understand what exactly probability is the different terminologies in probability will also study the Different probability distributions, then we'll discuss the types of probability which include marginal probability joint and conditional probability. Then we move on and discuss a use case where and we'll see examples that show us how the different types of probability work and to better understand Bayes theorem. We look at a small example. Also, I forgot to mention that at the end of the descriptive statistics module will be running a small demo in the our language. So for those of you who don't know much about our I'll be explaining every line in depth, but if you want to have a more in-depth understanding about our I'll leave a couple of blocks. And a couple of videos in the description box you all can definitely check out that content. Now after we've completed the probability module will discuss the inferential statistics module will start this module by understanding what is point estimation. We will discuss what is confidence interval and how you can estimate the confidence interval. We will also discuss margin of error and will understand all of these concepts by looking at a small use case. We'd finally end the inferential Real statistic module by looking at what hypothesis testing is hypothesis. Testing is a very important part of inferential statistics. So we'll end the session by looking at a use case that discusses how hypothesis testing works and to sum everything up. We'll look at a demo that explains how inferential statistics Works. Alright, so guys, there's a lot to cover today. So let's move ahead and take a look at our first topic which is what is data. Now, this is a quite simple question if I ask any of You what is data? You'll see that it's a set of numbers or some sort of documents that have stored in my computer now data is actually everything. All right, look around you there is data everywhere each click on your phone generates more data than you know, now this generated data provides insights for analysis and helps us make Better Business decisions. This is why data is so important to give you a formal definition data refers to facts and statistics. Collected together for reference or analysis. All right. This is the definition of data in terms of statistics and probability. So as we know data can be collected it can be measured and analyzed it can be visualized by using statistical models and graphs now data is divided into two major subcategories. Alright, so first we have qualitative data and quantitative data. These are the two different types of data under qualitative data. We have nominal and ordinal data and under quantitative data. We have discrete and continuous data. Now, let's focus on qualitative data. Now this type of data deals with characteristics and descriptors that can't be easily measured but can be observed subjectively now qualitative data is further divided into nominal and ordinal data. So nominal data is any sort of data that doesn't have any order or ranking? Okay. An example of nominal data is gender. Now. There is no ranking in gender. There's only male female or other right? There is no one two, three four or any sort of ordering in gender race is another example of nominal data. Now ordinal data is basically an ordered series of information. Okay, let's say that you went to a restaurant. Okay. Your information is stored in the form of customer ID. All right. So basically you are represented with a customer ID. Now you would have rated their service as either good or average. All right, that's how no ordinal data is and similarly they'll have a record of other customers who visit the restaurant along with their ratings. All right. So any data which has some sort of sequence or some sort of order to it is known as ordinal data. All right, so guys, this is pretty simple to understand now, let's move on and look at quantitative data. So quantitative data basically these He's with numbers and things. Okay, you can understand that by the word quantitative itself quantitative is basically quantity. Right Saudis will numbers a deals with anything that you can measure objectively. All right, so there are two types of quantitative data there is discrete and continuous data now discrete data is also known as categorical data and it can hold a finite number of possible values. Now, the number of students in a class is a finite Number. All right, you can't have infinite number of students in a class. Let's say in your fifth grade. They have a hundred students in your class. All right, there weren't infinite number but there was a definite finite number of students in your class. Okay, that's discrete data. Next. We have continuous data. Now this type of data can hold infinite number of possible values. Okay. So when you say weight of a person is an example of continuous data what I mean to see is my weight can be 50 kgs or it NB 50.1 kgs or it can be 50.00 one kgs or 50.000 one or is 50.0 2 3 and so on right there are infinite number of possible values, right? So this is what I mean by a continuous data. All right. This is the difference between discrete and continuous data. And also I'd like to mention a few other things over here. Now, there are a couple of types of variables as well. We have a discrete variable and we have a continuous variable discrete variable is also known as a categorical variable or and it can hold values of different categories. Let's say that you have a variable called message and there are two types of values that this variable can hold let's say that your message can either be a Spam message or a non spam message. Okay, that's when you call a variable as discrete or categorical variable. All right, because it can hold values that represent different categories of data now continuous variables are basically variables that can store infinite number of values. So the weight of a person can be denoted as a continuous variable. All right, let's say there is a variable called weight and it can store infinite number of possible values. That's why we will call it a continuous variable. So guys basically variable is anything that can store a value right? So if you associate any sort of data with a Able, then it will become either discrete variable or continuous variable. There is also dependent and independent type of variables. Now, we won't discuss all of that in death because that's pretty understandable. I'm sure all of you know, what is independent variable and dependent variable right? Dependent variable is any variable whose value depends on any other independent variable? So guys that much knowledge I expect or if you do have all right. So now let's move on and look at our next topic which Which is what is statistics now coming to the formal definition of statistics statistics is an area of Applied Mathematics, which is concerned with data collection analysis interpretation and presentation now usually when I speak about statistics people think statistics is all about analysis but statistics has other parts to it it has data collection is also a part of Statistics data interpretation presentation. All of this comes into statistics already are going to use statistical methods to visualize data to collect data to interpret data. Alright, so the area of mathematics deals with understanding how data can be used to solve complex problems. Okay. Now I'll give you a couple of examples that can be solved by using statistics. Okay, let's say that your company has created a new drug that may cure cancer. How would you conduct a test to confirm the As Effectiveness now, even though this sounds like a biology problem. This can be solved with Statistics already will have to create a test which can confirm the effectiveness of the drum or a this is a common problem that can be solved using statistics. Let me give you another example you and a friend are at a baseball game and out of the blue. He offers you a bet that neither team will hit a home run in that game. Should you take the BET? All right here you just discuss the probability of I know you'll win or lose. All right, this is another problem that comes under statistics. Let's look at another example. The latest sales data has just come in and your boss wants you to prepare a report for management on places where the company could improve its business. What should you look for? And what should you not look for now? This problem involves a lot of data analysis will have to look at the different variables that are causing your business to go down or the you have to look at a few variables. That are increasing the performance of your models and thus growing your business. Alright, so this involves a lot of data analysis and the basic idea behind data analysis is to use statistical techniques in order to figure out the relationship between different variables or different components in your business. Okay. So now let's move on and look at our next topic which is basic terminologies in statistics. Now before you dive deep into statistics, it is important that you understand basic terminologies used in statistics. The two most important terminologies in statistics are population and Sample. So throughout the statistics course or throughout any problem that you're trying to stall with Statistics. You will come across these two words, which is population and Sample Now population is a collection or a set of individuals or objects or events. Events whose properties are to be analyzed. Okay. So basically you can refer to population as a subject that you're trying to analyze now a sample is just like the word suggests. It's a subset of the population. So you have to make sure that you choose the sample in such a way that it represents the entire population. All right. It shouldn't Focus add one part of the population instead. It should represent the entire population. That's how your sample should be chosen. So Well chosen sample will contain most of the information about a particular population parameter. Now, you must be wondering how can one choose a sample that best represents the entire population now sampling is a statistical method that deals with the selection of individual observations within a population. So sampling is performed in order to infer statistical knowledge about a population. All right, if you want to understand the different statistics of a population like the mean the median Median the mode or the standard deviation or the variance of a population. Then you're going to perform sampling. All right, because it's not reasonable for you to study a large population and find out the mean median and everything else. So why is sampling performed you might ask? What is the point of sampling? We can just study the entire population now guys, think of a scenario where in your asked to perform a survey about the eating habits of teenagers in the US. So at present there are over 42 million teens in the US and this number is growing as we are speaking right now, correct. Is it possible to survey each of these 42 million individuals about their health? Is it possible? Well, it might be possible but this will take forever to do now. Obviously, it's not it's not reasonable to go around knocking each door and asking for what does your teenage son eat and all of that right? This is not very reasonable. That's By sampling is used. It's a method wherein a sample of the population is studied in order to draw inferences about the entire population. So it's basically a shortcut to studying the entire population instead of taking the entire population and finding out all the solutions. You just going to take a part of the population that represents the entire population and you're going to perform all your statistical analysis your inferential statistics on that small sample. All right, and that sample basically here Presents the entire population. All right, so I'm short of made this clear to y'all what is sample and what is population now? There are two main types of sampling techniques that are discussed today. We have probability sampling and non-probability sampling now in this video will only be focusing on probability sampling techniques because non-probability sampling is not within the scope of this video. All right will only discuss the probability part because we're focusing on statistics and probability, correct. Now again under probability sampling. We have three different types. We have random sampling systematic and stratified sampling. All right, and just to mention the different types of non-probability sampling, 's we have no bald Kota judgment and convenience sampling. All right now guys in this session. I'll only be focusing on probability. So let's move on and look at the different types of probability sampling. So what is probability sampling it is a sampling technique in which samples from a large population are chosen by using the theory of probability. All right, so there are three types of probability sampling. All right first we have the random sampling now in this method each member of the population has an equal chance of being selected in the sample. All right, so each and every individual or each and every object in the population has an equal John's of being a part of the sample. That's what random sampling is all about. Okay, you are randomly going to select any individual or any object. So this Bay each individual has an equal chance of being selected. Correct? Next. We have systematic sampling now in systematic sampling every nth record is chosen from the population to be a part of the sample. All right. Now refer this image that I've shown over here out of these six. Groups every second group is chosen as a sample. Okay. So every second record is chosen here and this is our systematic sampling works. Okay, you're randomly selecting the nth record and you're going to add that to your sample. Next. We have stratified sampling now in this type of technique a stratum is used to form samples from a large population. So what is a stratum a stratum is basically a subset of the population that shares at One common characteristics. So let's say that your population has a mix of both male and female so you can create to straightens out of this one will have only the male subset and the other will have the female subset. All right, this is what stratum is. It is basically a subset of the population that shares at least one common characteristics. All right in our example, it is gender. So after you've created a stratum you're going to use random sampling on these stratums and you're going to choose. Choose a final sample. So random sampling meaning that all of the individuals in each of the stratum will have an equal chance of being selected in the sample. Correct. So Guys, these were the three different types of sampling techniques. Now, let's move on and look at our next topic which is the different types of Statistics. So after this, we'll be looking at the more advanced concepts of Statistics, right so far we discuss the basics of Statistics, which is basically what is statistics the Friend sampling techniques and the terminologies and statistics. All right. Now we look at the different types of Statistics. So there are two major types of Statistics descriptive statistics and inferential statistics in today's session. We will be discussing both of these types of Statistics in depth. All right, we'll also be looking at a demo which I'll be running in the our language in order to make you understand what exactly descriptive and inferential statistics is soaked. As which is going to look at the basic, so don't worry. If you don't have much knowledge, I'm explaining everything from the basic level. All right, so guys descriptive statistics is a method which is used to describe and understand the features of specific data set by giving a short summary of the data. Okay, so it is mainly focused upon the characteristics of data. It also provides a graphical summary of the data now in order to make you understand what descriptive statistics is. Let's suppose that you want to gift all your classmates or t-shirt. So to study the average shirt size of a student in a classroom. So if you were to use descriptive statistics to study the average shirt size of students in your classroom, then what you would do is you would record the shirt size of all students in the class and then you would find out the maximum minimum and average shirt size of the cloud. Okay. So coming to inferential statistics inferential. Six makes inferences and predictions about a population based on the sample of data taken from the population. Okay. So in simple words, it generalizes a large data set and it applies probability to draw a conclusion. Okay. So it allows you to infer data parameters based on a statistical model by using sample data. So if we consider the same example of finding the average shirt size of students in a class in infinite real statistics. We'll take a sample set of the class which is basically a few people from the entire class. All right, you already have had grouped the class into large medium and small. All right in this method you basically build a statistical model and expand it for the entire population in the class. So guys, there was a brief understanding of descriptive and inferential statistics. So that's the difference between descriptive and inferential now in the next section, we will go in depth about descriptive statistics. Right. So let's discuss more about descriptive statistics. So like I mentioned earlier descriptive statistics is a method that is used to describe and understand the features of a specific data set by giving short summaries about the sample and measures of the data. There are two important measures in descriptive statistics. We have measure of central tendency, which is also known as measure of center and we have measures of variability. This is also known as Measures of spread so measures of center include mean median and mode now what is measures of center measures of the center are statistical measures that represent the summary of a data set? Okay, the three main measures of center are mean median and mode coming to measures of variability or measures of spread. We have range interquartile range variance and standard deviation. All right. So now let's discuss each of these measures. Has in a little more depth starting with the measures of center. Now, I'm sure all of you know, what the mean is mean is basically the measure of the average of all the values in a sample. Okay, so it's basically the average of all the values in a sample. How do you measure the mean I hope all of you know how the main is measured if there are 10 numbers and you want to find the mean of these 10 numbers. All you have to do is you have to add up all the 10 numbers and you have to divide it by 10 then. Represents the number of samples in your data set. All right, since we have 10 numbers, we're going to divide this by 10. All right, this will give us the average or the mean so to better understand the measures of central tendency. Let's look at an example. Now the data set over here is basically the cars data set and it contains a few variables. All right, it has something known as cars. It has mileage per gallon cylinder type displacement horsepower and relax. Silver ratio. All right, all of these measures are related to cars. Okay. So what you're going to do is you're going to use descriptive analysis and you're going to analyze each of the variables in the sample data set for the mean standard deviation median more and so on. So let's say that you want to find out the mean or the average horsepower of the cars among the population of cards. Like I mentioned earlier what you'll do is you'll check the average of all the values. So in this case we will take The sum of the horsepower of each car and we'll divide that by the total number of cards. Okay, that's exactly what I've done here in the calculation part. So this hundred and ten basically represents the horsepower for the first car. All right. Similarly. I've just added up all the values of horsepower for each of the cars and I've divided it by 8 now 8 is basically the number of cars in our data set. All right, so hundred and three point six two five is what army mean is or the average of horsepower is all right. Now, let's understand what median is with an example. Okay. So to Define median median is basically a measure of the central value of the sample set is called the median. All right, you can see that it is the middle value. So if we want to find out the center value of the mileage per gallon among the population of cars first, what we'll do is we'll arrange the MGP values in ascending or descending Order and choose a middle value right in this case since we have eight values, right? We have eight values which is an even entry. So whenever you have even number of data points or samples in your data set, then you're going to take the average of the two middle values. If we had nine values over here. We can easily figure out the middle value and you know choose that as a median. But since they're even number of values we are going to take the average of the two middle values. All right. Right. So 22.8 and 23 are my two middle values and I'm taking the mean of those 2 and hence I get twenty two point nine, which is my median. All right, lastly, let's look at how mode is calculated. So what is mode the value that is most recurrent in the sample set is known as mode or basically the value that occurs most often. Okay, that is known as mode. So let's say that we want to find out the most common type of cylinder among the population of cards. What we have to do is we will check the value which is repeated the most number of times here. We can see that the cylinders come in two types. We have cylinder of Type 4 and cylinder of type 6, right? So take a look at the data set. You can see that the most recurring value is 6 right. We have one two, three four and five. We have five six and we have one two, three. Yeah, we have three four types of lenders and five six types of lenders. So basically we have three four type cylinders and we have five six type cylinders. All right. So our mode is going to be 6 since 6 is more recurrent than 4 so guys those were the measures of the center or the measures of central tendency. Now, let's move on and look at the measures of the spread. All right. Now, what is the measure of spread a measure of spread? Sometimes also called as measure of dispersion is Used to describe the variability in a sample or population. Okay, you can think of it as some sort of deviation in the sample. All right, so you measure this with the help of the different measure of spreads. We have range interquartile range variance and standard deviation. Now range is pretty self-explanatory, right? It is the given measure of how spread apart the values in a data set are the range can be calculated as shown in this formula. You basically going to subtract the maximum value in your data set from the minimum value in your data set. That's how you calculate the range of the data. Alright, next we have interquartile range. So before we discuss interquartile range, let's understand. What a quartile is red. So quartiles basically tell us about the spread of a data set by breaking the data set into different quarters. Okay, just like how the median breaks the data into two parts the court is We'll break it into different quarters. So to better understand how quartile and interquartile are calculated. Let's look at a small example. Now this data set basically represents the marks of hundred students ordered from the lowest to the highest scores red. So the quartiles lie in the following ranges the first quartile, which is also known as q1 it lies between the 25th and 26th observation. All right. So if you look at this I've highlighted Add the 25th and the 26th observation. So how you can calculate Q 1 or first quartile is by taking the average of these two values. Alright, since both the values are 45 when you add them up and divide them by two you'll still get 45 now the second quartile or Q 2 is between the 50th and the 51st observation. So you're going to take the average of 58 and 59 and you will get a value of 58.5. Now, this is my second quarter the third quartile. Ah Q3 is between the 75th and the 76th observation here. Again, we'll take the average of the two values which is the 75th value and the 76 value right and you'll get a value of 71. All right, so guys this is exactly how you calculate the different quarters. Now, let's look at what is interquartile range. So IQR or the interquartile range is a measure of variability based on dividing a data set into quartiles now the The interquartile range is calculated by subtracting the q1 from Q3. So basically Q3 minus q1 is your IQ are so your IQR is your Q3 minus q1? All right. Now this is how each of the quartiles are each core tile represents a quarter, which is 25% All right. So guys, I hope all of you are clear with interquartile range and what our quartiles now, let's look at variance covariance is basically a measure that shows How much a random variable the first from its expected value? Okay. It's basically the variance in any variable now variance can be calculated by using this formula right here x basically represents any data point in your data set n is the total number of data points in your data set and X bar is basically the main of data points. All right. This is how you calculate variance variance is basically a Computing the squares of deviations. Okay. That's why it says s Square there. Now let's look at what is deviation deviation is just the difference between each element from the mean. Okay, so it can be calculated by using this simple formula where X I basically represents a data point and mu is the mean of the population or add this is exactly how you calculate the deviation Now population variance and Sample variance are very specific to whether you're calculating the variance in your population data set or in your sample data set. That's the A difference between population and Sample variance. So the formula for population variance is pretty explanatory. So X is basically each data point mu is the mean of the population n is the number of samples in your data set. All right. Now, let's look at sample. Variance Now sample variance is the average of squared differences from the mean. All right here x i is any data point or any sample in your data set X bar is the mean of your sample. All right. It's not the main of your population. Ation, it's the mean of your sample. And if you notice n here is a smaller n is the number of data points in your sample. And this is basically the difference between sample and population variance. I hope that is clear coming to standard deviation is the measure of dispersion of a set of data from its mean. All right, so it's basically the deviation from your mean. That's what standard deviation is now to better understand how the measures of spread are calculated. Let's look at a small use case. So let's see Daenerys has 20 dragons. They have the numbers nine to five four and so on as shown on the screen, what you have to do is you have to work out the standard deviation or at in order to calculate the standard deviation. You need to know the mean right? So first you're going to find out the mean of your sample set. So how do you calculate the mean you add all the numbers in your data set and divided by the total number of samples in your data set so you get a value of 7. Here then you calculate the rhs of your standard deviation formula. All right. So from each data point you're going to subtract the mean and you're going to square that. All right. So when you do that, you will get the following result. You'll basically get this 425 for 925 and so on so finally you will just find the mean of the squared differences. All right. So your standard deviation will come up to two point nine eight three once you take the square root. So guys, it's pretty simple. It's a simple At the magic technique, all you have to do is you have to substitute the values in the formula. All right. I hope this was clear to all of you. Now let's move on and discuss the next topic which is Information Gain and entropy now. This is one of my favorite topics in statistics. It's very interesting and this topic is mainly involved in machine learning algorithms, like decision trees and random forest. All right, it's very important for you to know how Information Gain and entropy really work and why they are so essential in building machine learning models. We focus on the statistic parts of Information Gain and entropy and after that we'll discuss a use case. And see how Information Gain and entropy is used in decision trees. So for those of you who don't know what a decision tree is it is basically a machine learning algorithm. You don't have to know anything about this. I'll explain everything in depth. So don't worry. Now. Let's look at what exactly entropy and Information Gain Is Now guys entropy is basically the measure of any sort of uncertainty that is present in the data. All right, so it can be measured by using this formula. So here s is the set of all instances in the data set or all the data items in the data set n is the different type of classes in your data set Pi is the event probability. Now this might seem a little confusing to y'all but when we go through the use case, you'll understand all of these terms even better. All right cam. The information gained as the word suggests Information Gain indicates how much information a particular feature or a particular variable gives us about the final outcome. Okay, it can be measured by using this formula. So again here heads of s is the entropy of the whole data set s SJ is the number of instances with the J value of an attribute a s is the total number of instances in the data set V is the set of distinct values of an attribute a h of s j is the entropy of subsets of instances and hedge of a comma s is the entropy of an attribute a even though this seems confusing. I'll clear out the confusion. All right, let's discuss a small problem statement where we will understand how Information Gain and entropy is used to study the significance of a model. So like I said Information Gain and entropy are very important statistical measures that let us understand the significance of a predictive model. Okay to get a more clear understanding. Let's look at a use case. All right now suppose we are given a problem statement. All right, the statement is that you have to predict whether a match can be played or Not by studying the weather conditions. So the predictor variables here are outlook humidity wind day is also a predictor variable. The target variable is basically played or a the target variable is the variable that you're trying to protect. Okay. Now the value of the target variable will decide whether or not a game can be played. All right, so that's why The play has two values. It has no and yes, no, meaning that the weather conditions are not good. And therefore you cannot play the game. Yes, meaning that the weather conditions are good and suitable for you to play the game. Alright, so that was our problem statement. I hope the problem statement is clear to all of you now to solve such a problem. We make use of something known as decision trees. So guys think of an inverted tree and each branch of the tree denotes some decision. All right, each branch is Is known as the branch known and at each branch node, you're going to take a decision in such a manner that you will get an outcome at the end of the branch. All right. Now this figure here basically shows that out of 14 observations 9 observations result in a yes, meaning that out of 14 days. The match can be played only on nine days. Alright, so here if you see on day 1 Day 2 Day 8 day 9 and 11. The Outlook has been Alright, so basically we try to plaster a data set depending on the Outlook. So when the Outlook is sunny, this is our data set when the Outlook is overcast. This is what we have and when the Outlook is the rain this is what we have. All right, so when it is sunny we have two yeses and three nodes. Okay, when the Outlook is overcast. We have all four as yes has meaning that on the four days when the Outlook was overcast. We can play the game. All right. Now when it comes to rain, we have three yeses and two nodes. All right. So if you notice here, the decision is being made by choosing the Outlook variable as the root node. Okay. So the root node is basically the topmost node in a decision tree. Now, what we've done here is we've created a decision tree that starts with the Outlook node. All right, then you're splitting the decision tree further depending on other parameters like Sunny overcast and rain. All right now like we know that Outlook has three values. Sunny overcast and brain so let me explain this in a more in-depth manner. Okay. So what you're doing here is you're making the decision Tree by choosing the Outlook variable at the root node. The root note is basically the topmost node in a decision tree. Now the Outlook node has three branches coming out from it, which is sunny overcast and rain. So basically Outlook can have three values either it can be sunny. It can be overcast or it can be rainy. Okay now these three values Use are assigned to the immediate Branch nodes and for each of these values the possibility of play is equal to yes is calculated. So the sunny and the rain branches will give you an impure output. Meaning that there is a mix of yes and no right. There are two yeses here three nodes here. There are three yeses here and two nodes over here, but when it comes to the overcast variable, it results in a hundred percent pure subset. All right, this shows that the overcast baby. Will result in a definite and certain output. This is exactly what entropy is used to measure. All right, it calculates the impurity or the uncertainty. Alright, so the lesser the uncertainty or the entropy of a variable more significant is that variable? So when it comes to overcast there's literally no impurity in the data set. It is a hundred percent pure subset, right? So be want variables like these in order to build a model. All right now, we don't always Ways get lucky and we don't always find variables that will result in pure subsets. That's why we have the measure entropy. So the lesser the entropy of a particular variable the most significant that variable will be so in a decision tree. The root node is assigned the best attribute so that the decision tree can predict the most precise outcome meaning that on the root note. You should have the most significant variable. All right, that's why we've chosen Outlook or and now some of you might ask me why haven't you chosen overcast Okay is overcast is not a variable. It is a value of the Outlook variable. All right. That's why we've chosen our true cure because it has a hundred percent pure subset which is overcast. All right. Now the question in your head is how do I decide which variable or attribute best Blitz the data now right now, I know I looked at the data and I told you that, you know here we have a hundred percent pure subset, but what if it's a more complex problem and you're not able to understand which variable will best split the data, so guys when it comes to decision tree Information and gain and entropy will help you understand which variable will best split the data set. All right, or which variable you have to assign to the root node because whichever variable is assigned to the root node. It will best let the data set and it has to be the most significant variable. All right. So how we can do this is we need to use Information Gain and entropy. So from the total of the 14 instances that we saw nine of them said yes and five of the instances said know that you cannot play on that particular day. All right. So how do you calculate the entropy? So this is the formula you just substitute the values in the formula. So when you substitute the values in the formula, you will get a value of 0.9940. All right. This is the entropy or this is the uncertainty of the data present in a sample. Now in order to ensure that we choose the best variable for the root node. Let us look at all the possible combinations that you can use on the root node. Okay, so these are All the possible combinations you can either have Outlook you can have windy humidity or temperature. Okay, these are four variables and you can have any one of these variables as your root note. But how do you select which variable best fits the root node? That's what we are going to see by using Information Gain and entropy. So guys now the task at hand is to find the information gain for each of these attributes. All right. So for Outlook for windy for humidity and for temperature, we're going to find out the information. Nation gained all right. Now a point to remember is that the variable that results in the highest Information Gain must be chosen because it will give us the most precise and output information. All right. So the information gain for attribute windy will calculate that first here. We have six instances of true and eight instances of false. Okay. So when you substitute all the values in the formula, you will get a value of zero point zero four eight. So we get a value of You 2.0 for it. Now. This is a very low value for Information Gain. All right, so the information that you're going to get from Windy attribute is pretty low. So let's calculate the information gain of attribute Outlook. All right, so from the total of 14 instances, we have five instances with say Sunny for instances, which are overcast and five instances, which are rainy. All right for Sonny. We have three yeses and to nose for overcast we have Or the for as yes for any we have three years and two nodes. Okay. So when you calculate the information gain of the Outlook variable will get a value of zero point 2 4 7 now compare this to the information gain of the windy attribute. This value is actually pretty good. Right we have zero point 2 4 7 which is a pretty good value for Information Gain. Now, let's look at the information gain of attribute humidity now over here. We have seven instances with say hi and seven instances with same. Right and under the high Branch node. We have three instances with say yes, and the rest for instances would say no similarly under the normal Branch. We have one two, three, four, five six seven instances would say yes and one instance with says no. All right. So when you calculate the information gain for the humidity variable, you're going to get a value of 0.15 one. Now. This is also a pretty decent value, but when you compare it to the Information Gain, Of the attribute Outlook it is less right now. Let's look at the information gain of attribute temperature. All right, so the temperature can hold repeat. So basically the temperature attribute can hold hot mild and cool. Okay under hot. We have two instances with says yes and two instances for no under mild. We have four instances of yes and two instances of no and under col we have three instances of yes and one instance of no. All right. When you calculate the information gain for this attribute, you will get a value of zero point zero to nine, which is again very less. So what you can summarize from here is if we look at the information gain for each of these variable will see that for Outlook. We have the maximum gain. All right, we have zero point two four seven, which is the highest Information Gain value and you must always choose a variable with the highest Information Gain to split the data at the root node. So that's why we assign The Outlook variable at the root node. All right, so guys. I hope this use case was clear. If any of you have doubts. Please keep commenting those doubts now, let's move on and look at what exactly a confusion Matrix is the confusion Matrix is the last topic for descriptive statistics read after this. I'll be running a short demo where I'll be showing you how you can calculate mean median mode and standard deviation variance and all of those values by using our okay. So let's talk about confusion Matrix now guys. What is the confusion Matrix now don't get confused. This is not any complex topic now confusion. Matrix is a matrix that is often used to describe the performance of a model. Right? And this is specifically used for classification models or a classifier and what it does is it will calculate the accuracy or it will calculate the performance of your classifier by comparing your actual results and Your predicted results. All right. So this is what it looks like to prosit of true- and all of that. Now this is a little confusing. I'll get back to what exactly true positive to negative and all of this stands for for now. Let's look at an example and let's try and understand what exactly confusion Matrix is. So guys. I made sure that I put examples after each and every topic because it's important you understand the Practical part of Statistics. All right statistics has literally nothing to do with Theory you need to understand how Calculations are done in statistics. Okay. So here what I've done is let's look at a small use case. Okay, let's consider that your given data about a hundred and sixty-five patient's out of which hundred and five patients have a disease and the remaining 50 patients don't have a disease. Okay. So what you're going to do is you will build a classifier that predicts by using these hundred and sixty five observations your feed all of these 165 observations to your classifier and It will predict the output every time a new patients detail is fed to the classifier right now out of these 165 cases. Let's say that the classifier predicted. Yes hundred and ten times and no 55 times. Alright, so yes basically stands for yes. The person has a disease and no stands for know. The person has not have a disease. All right, that's pretty self-explanatory. But yeah, so it predicted that a hundred and ten times. Patient has a disease and 55 times that nor the patient doesn't have a disease. However in reality only hundred and five patients in the samples have the disease and 60 patients who do not have the disease, right? So how do you calculate the accuracy of your model? You basically build the confusion Matrix? All right. This is how the Matrix looks like and basically denotes the total number of observations that you have which is 165 in our case actual denotes the actual use in the data set and predicted denotes the predicted values by the classifier. So the actual value is no here and the predicted value is no here. So your classifier was correctly able to classify 50 cases as no. All right, since both of these are no so 50 it was correctly able to classify but 10 of these cases it incorrectly classified meaning that your actual value here is no but you classifier predicted it as yes or a that's why this And over here similarly it wrongly predicted that five patients do not have diseases whereas they actually did have diseases and it correctly predicted hundred patients, which have the disease. All right. I know this is a little bit confusing. But if you look at these values no, no 50 meaning that it correctly predicted 50 values No Yes means that it wrongly predicted. Yes for the values are it was supposed to predict. No. All right. Now what exactly is? Is this true positive to negative and all of that? I'll tell you what exactly it is. So true positive are the cases in which we predicted a yes and they do not actually have the disease. All right, so it is basically this value already predicted a yes here, even though they did not have the disease. So we have 10 true positives right similarly true- is we predicted know and they don't have the disease meaning that this is correct. False positive is be predicted. Yes, but they do not actually have the disease. All right. This is also known as type 1 error falls- is we predicted. No, but they actually do not have the disease. So guys basically false negative and true negatives are basically correct classifications. All right. So this was confusion Matrix and I hope this concept is clear again guys. If you have doubts, please comment your doubt in the comment section. So guys that was descriptive statistics now, Before we go to probability. I promised all that will run a small demo in our all right, we'll try and understand how mean median mode works in our okay, so let's do that first. So guys again what we just discussed so far was descriptive statistics. All right, next we're going to discuss probability and then we'll move on to inferential statistics. Okay in financial statistics is basically the second type of Statistics. Okay now to make things more clear of you, let me just zoom in. So guys it's always best to perform practical implementations in order to understand the concepts in a better way. Okay, so here will be executing a small demo that will show you how to calculate the mean median mode variance standard deviation and how to study the variables by plotting a histogram. Okay. Don't worry. If you don't know what a histogram is. It's basically a frequency plot. There's no big signs behind it. Alright, this is a very simple demo but it also forms a foundation that everything. Machine learning algorithm is built upon. Okay, you can say that most of the machine learning algorithms actually all the machine learning algorithms and deep learning algorithms have this basic concept behind them. Okay, you need to know how mean median mode and all of that is calculated. So guys am using the our language to perform this and I'm running this on our studio. For those of you who don't know our language. I will leave a couple of links in the description box. You can go through those videos. So what we're doing is we are randomly generated. Eating numbers and Miss storing it in a variable called data, right? So if you want to see the generated numbers just to run the line data, right this variable basically stores all our numbers. All right. Now, what we're going to do is we're going to calculate the mean now. All you have to do in our is specify the word mean along with the data that you're calculating the mean of and I was assigned this whole thing into a variable called mean Just hold the mean value of this data. So now let's look at the mean for that abuser function called print and mean. All right. So our mean is around 5.99. Okay. Next is calculating the median. It's very simple guys. All you have to do is use the function median or write and pass the data as a parameter to this function. That's all you have to do. So our provides functions for each and everything. All right statistics is very easy when it comes to R because R is basically a statistical language. Okay. So all you have to do is just name the function and that function is Ready in built in your art. Okay, so your median is around 6.4. Similarly. We will calculate the mode. All right. Let's run this function. I basically created a small function for calculating the mode. So guys, this is our mode meaning that this is the most recurrent value right now. We're going to calculate the variance and the standard deviation for that. Again. We have a function in are called as we're all right. All you have to do is pass the data to that function. Okay, similarly will calculate the standard deviation, which is basically the square root of your variance right now will Rent the standard deviation, right? This is our standard deviation value. Now. Finally, we will just plot a small histogram histogram is nothing but it's a frequency plot already in show you how frequently a data point is occurring. So this is the histogram that we've just created it's quite simple in our because our has a lot of packages and a lot of inbuilt functions that support statistics. All right. It is a statistical language that is mainly used by data scientists or by data and analysts and machine learning Engineers because they don't have to student code these functions. All they have to do is they have to mention the name of the function and pass the corresponding parameters. So guys that was the entire descriptive statistics module and now we will discuss about probability. Okay. So before we understand what exactly probability is, let me clear out a very common misconception people often tend to ask me this question. What is the relationship between statistics and probability? So probability and statistics are related fields. All right. So probability is a mathematical method used for statistical analysis. Therefore we can say that a probability and statistics are interconnected branches of mathematics that deal with analyzing the relative frequency of events. So they're very interconnected feels and probability makes use of statistics and statistics makes use of probability or a they're very interconnected Fields. So that is the relationship between said It is six and probability. Now. Let's understand what exactly is probability. So probability is the measure of How likely an event will occur to be more precise. It is the ratio of desired outcome to the total outcomes. Now, the probability of all outcomes always sum up to 1 the probability will always sum up to 1 probability cannot go beyond one. Okay. So either your probability can be 0 or it can be 1 or it can In the form of decimals like 0.5 to or 0.55 or it can be in the form of 0.5 0.7 0.9. But it's valuable always stay between the range 0 and 1 okay, another famous example of probability is rolling a dice example. So when you roll a dice you get six possible outcomes, right? You get one two, three four and five six phases of a dies now each possibility only has one outcome. So what is the probability that on rolling a dice? You will get 3 the probability is 1 by 6 right because there's only one phase which has the number 3 on it out of six phases. There's only one phase which has the number three. So the probability of getting 3 when you roll a dice is 1 by 6 similarly. If you want to find the probability of getting a number 5 again, the probability is going to be 1 by 6. All right. So all of this will sum up to 1. All right, so guys, this is exactly what Ability is it's a very simple concept. We all learnt it in 8 standard onwards right now. Let's understand the different terminologies that are related to probability. Now that three terminologies that you often come across when we talk about probability. We have something known as the random experiment. Okay. It's basically an experiment or a process for which the outcomes cannot be predicted with certainty. All right. That's why you use probability. You're going to use probability in order to predict the outcome with Some sort of certainty sample space is the entire possible set of outcomes of a random experiment and event is one or more outcomes of an experiment. So if you consider the example of rolling a dice now, let's say that you want to find out the probability of getting a to when you roll the dice. Okay. So finding this probability is the random experiment the sample space is basically your entire possibility. Okay. So one two, three, four five six Is are there and out of that you need to find the probability of getting a 2 right? So all the possible outcomes will basically represent your sample space gives a 1 to 6 are all your possible outcomes. This represents your sample space now event is one or more outcome of an experiment. So in this case my event is to get a tattoo when I roll a dice, right? So my event is the probability of getting a to when I roll a dice, so guys, this is basically what random experiment samples. All space and event really means alright now, let's discuss the different types of events. There are two types of events that you should know about there is disjoint and non disjoint events. Disjoint events are events that do not have any common outcome. For example, if you draw a single card from a deck of cards, it cannot be a king and a queen correct it can either be king or it can be Queen now a non disjoint events are events that have common out. For example a student can get hundred marks in statistics and hundred marks in probability. All right, and also the outcome of a ball delivered can be a no ball and it can be a 6 right. So this is what non disjoint events are or n? These are very simple to understand right now. Let's move on and look at the different types of probability distribution. All right, I'll be discussing the three main probability distribution functions. I'll be talking about probability density. Aaron normal distribution and Central limit theorem. Okay probability density function also known as PDF is concerned with the relative likelihood for a continuous random variable to take on a given value. Alright, so the PDF gives the probability of a variable that lies between the range A and B. So basically what you're trying to do is you're going to try and find the probability of a continuous random variable over a specified range. Okay. Now this graph denotes the PDF of a continuous variable. Now this graph is also known as the bell curve right? It's famously called the bell curve because of its shape and the three important properties that you need to know about a probability density function. Now the graph of a PDF will be continuous over a range this is because you're finding the probability that a continuous variable lies between the ranges A and B, right the second property. Is that the area bounded by By the curve of a density function and the x-axis is equal to 1 basically the area below the curve is equal to 1 all right, because it denotes probability again the probability cannot arrange more than one it has to be between 0 and 1 property number three is that the probability that our random variable assumes a value between A and B is equal to the area under the PDF bounded by A and B. Okay. Now what this means, is that the probability You is denoted by the area of the graph. All right, so whatever value that you get here, which basically one is the probability that a random variable will lie between the range A and B. All right. So I hope all of you have understood the probability density function. It's basically the probability of finding the value of a continuous random variable between the range A and B. All right. Now, let's look at our next distribution, which is normal distribution now. Normal distribution, which is also known as the gaussian distribution is a probability distribution that denotes the symmetric property of the mean right meaning that the idea behind this function. Is that the data near the mean occurs more frequently than the data away from the mean. So what it means to say is that the data around the mean represents the entire data set. Okay. So if you just take a sample of data around the mean it can represent the entire data set now similar to Probability density function the normal distribution appears as a bell curve right now when it comes to normal distribution. There are two important factors. All right, we have the mean of the population and the standard deviation. Okay, so the mean and the graph determines the location of the center of the graph, right and the standard deviation determines the height of the graph. Okay. So if the standard deviation is large the curve is going to look something like this. All right, it'll be short and wide. I'd and if the standard deviation is small the curve is tall and narrow. All right. So this was it about normal distribution. Now, let's look at the central limit theorem. Now the central limit theorem states that the sampling distribution of the mean of any independent random variable will be normal or nearly normal if the sample size is large enough now, that's a little confusing. Okay. Let me break it down for you now in simple terms if we had a large population and be Why did it in too many samples, then the mean of all the samples from the population will be almost equal to the mean of the entire population right? Meaning that each of the sample is normally distributed. Right? So if you compare the mean of each of the sample, it will almost be equal to the mean of the population. Right? So this graph basically shows a more clear understanding of the central limit theorem red you can see each sample here and the mean of each sample. Oil is almost along the same line, right? Okay. So this is exactly what the central limit theorem States now the accuracy or the resemblance to the normal distribution depends on two main factors, right? So the first is the number of sample points that you consider. All right, and the second is the shape of the underlying population. Now the shape obviously depends on the standard deviation and the mean of a sample, correct. So guys the central limit theorem basically states that eats Bill will be normally distributed in such a way that the mean of each sample will coincide with the mean of the actual population. All right in short terms. That's what central limit theorem States. All right, and this holds true only for a large data set mostly for a small data set and there are more deviations when compared to a large data set is because of the scaling Factor, right? The small is deviation in a small data set will change the value vary drastically, but in a large data set a small deviation will not matter at all. Now, let's move. Vaughn and look at our next topic which is the different types of probability. This is a important topic because most of your problems can be solved by understanding which type of probability should I use to solve this problem? Right? So we have three important types of probability. We have marginal joint and conditional probability. So let's discuss each of these now the probability of an event occurring unconditioned on any other event is known as marginal. Or unconditional probability. So let's say that you want to find the probability that a card drawn is a heart. All right. So if you want to find the probability that a card drawn is a heart The Profit will be 13 by 52 since there are 52 cards in a deck and there are 13 hearts in a deck of cards. Right and there are 52 cards in a total deck. So your marginal probability will be 13 by 52. That's about marginal probability. Now, let's understand what is joint probability. And now joint probability is a measure of two events happening at the same time. Okay, let's say that the two events are A and B. So the probability of event A and B occurring is the intersection of A and B. So for example, if you want to find the probability that a card is a four and a red that would be joint probability. All right, because you're finding a card that is 4 and the card has to be red in color. So for the answer to this would be to Biceps you do because we have 1/2 in heart and we have 1/2 and diamonds, correct. So both of these are red and color therefore. Our probability is to by 52 and if you further down it is 1 by 26, right? So this is what joint probability is all about moving on. Let's look at what exactly conditional probability is. So if the probability of an event or an outcome is based on the occurrence of a previous event or an outcome. Then you call it as a conditional probability. Okay. So the conditional probability of an event B is the probability that the event will occur given that an event a has already occurred. Right? So if a and b are dependent events, then the expression for conditional probability is given by this. Now this first term on the left hand side, which is p b of a is basically the probability of event B occurring given that event a has already occurred. So like I said, if a and b are dependent events than this is the expression but if a and b are independent events, and the expression for conditional probability is like this, right? So guys P of A and B of B is obviously the probability of a and probability of B right now, let's move on now in order to understand conditional probability joint probability and marginal probability. Let's look at a small use case. Okay now basically we're going to Take a data set which examines the salary package and training undergone my candidates. Okay. Now in this there are 60 candidates a without training and forty five candidates, which have enrolled for Adder Acres training right. Now the task here is you have to assess the training with a salary package. Okay. Let's look at this in a little more depth. So in total, we have hundred and five candidates out of which 60 of them have not enrolled Frederick has training and 45 of them have enrolled for a deer Acres. Inning. All right. This is the small survey that was conducted and this is the rating of the package or the salary that they got right? So if you read through the data, you can understand there were five candidates without Eddie record training who got a very poor salary package. Okay. Similarly, there are 30 candidates with Ed Eureka training who got a good package, right? So guys, basically you're comparing the salary package of a person depending on whether or not they've enrolled for a A core training right? This is our data set. Now. Let's look at our problem statement find the probability that a candidate has undergone editor Acres training quite simple, which type of probability is this. This is marginal probability. Right? So the probability that a candidate has undergone Edge rakers training is obviously 45 divided by a hundred and five since 45 is the number of candidates with Eddie record raining and hundred and five is the total number of candidates, so you Value of approximately 0.4 to or I that's the probability of a candidate that has undergone a Judaica straining next question find the probability that a candidate has attended edger a constraining and also has good package. Now. This is obviously a joint probability problem, right? So how do you calculate this now? Since our table is quite formatted we can directly find that people who have gotten a good package along with Eddie record raining or 30, right? So out of hundred and five people 30 people have education training and a good package, right? They specifically asking for people with Ado Rekha training remember that right? The question is find the probability that a candidate has attended editor Acres training and also has a good package. Alright, so we need to consider two factors that is a candidate who's addenda deaderick has training and who has a good package. So clearly that number is 30 30 divided by total number of candidates, which is 1 0 Five, right. So here you get the answer clearly. Next we have find the probability that a candidate has a good package given that he has not undergone training. Okay. Now this is clearly conditional probability because here you're defining a condition you're saying that you want to find the probability of a candidate who has a good package given that he's not undergone. Any training, right? The condition is that he's not undergone any training. All right. So the number of people who have not undergone training are 60 and out of that five of them have got a good package, right? So that's why this is Phi by 60 and not 5 by hundred and five because here they have clearly mentioned has a good package given that he has not undergone training. You have to only consider people who have not undergone training, right? So only five people who have not undergone training have gotten a good package, right? So 5 divided by 60 you get a probability of around 208 which is pretty low, right? Okay. So this was all about the different types of probability. Now, let's move on and look at our last Topic in probability, which is base theorem. Now guys Bayes theorem is a very important concept when it comes to statistics and probability. It is majorly used in knife bias algorithm. Those of you who aren't aware. Now I've bias is a supervised learning classification algorithm and it is mainly Used in Gmail spam filtering, right a lot of you might have noticed that if you open up Gmail, you'll see that you have a folder called spam right or that is carried out through machine learning and the algorithm used there is knife bias, right? So now let's discuss what exactly the Bayes theorem is and what it denotes the bias theorem is used to show the relation between one conditional probability and it's inverse. All right, basically Nothing, but the probability of an event occurring based on prior knowledge of conditions that might be related to the same event. Okay. So mathematically the bell's theorem is represented like this, right like shown in this equation. The left-hand term is referred to as the likelihood ratio, which measures the probability of occurrence of event B, given an event a okay on the left hand side is what is known as the posterior right is referred to as posterior. Are which means that the probability of occurrence of a given an event B, right? The second term is referred to as the likelihood ratio or a this measures the probability of occurrence of B, given an event a now P of a is also known as the prior which refers to the actual probability distribution of A and P of B is again, the probability of B, right. This is the bias theorem in order to better understand the base theorem. Let's look at a small example. Let's say that we Three balls we have about a bowel be and bouncy okay barley contains two blue balls and for red balls bowel be contains eight blue balls and for red balls baozi contains one blue ball and three red balls. Now if we draw one ball from each Bowl, what is the probability to draw a blue ball from a bowel a if we know that we drew exactly a total of two blue balls right if you didn't Understand the question. Please. Read it. I shall pause for a second or two. Right. So I hope all of you have understood the question. Okay. Now what I'm going to do is I'm going to draw a blueprint for you and tell you how exactly to solve the problem. But I want you all to give me the solution to this problem, right? I'll draw a blueprint. I'll tell you what exactly the steps are but I want you to come up with a solution on your own right the formula is also given to you. Everything is given to you. All you have to do is come up with the final answer. Right? Let's look at how you can solve this problem. So first of all, what we will do is Let's consider a all right, let a be the event of picking a blue ball from bag in and let X be the event of picking exactly two blue balls, right because these are the two events that we need to calculate the probability of now there are two probabilities that you need to consider here. One is the event of picking a blue ball from bag a and the other is the event of picking exactly two blue balls. Okay. So these two are represented by a and X respectively Lee so what we want is the probability of occurrence of event a given X, which means that given that we're picking exactly two blue balls, what is the probability that we are picking a blue ball from bag? So by the definition of conditional probability, this is exactly what our equation will look like. Correct. This is basically a occurrence of event a given an event X and this is the probability of a and x and this is the probability of X alone, correct? And what we need to do is we need to find these two probabilities which is probability of a and X occurring together and probability of X. Okay. This is the entire solution. So how do you find P probability of X this you can do in three ways. So first is white ball from a either white from be or read from see now first is to find the probability of x x basically represents the event of picking exactly two blue balls. Right. So these are the three ways in which it is possible. So you'll pick one blue ball from bowel a and one from bowel be in the second case. You can pick one from a and another blue ball from see in the third case. You can pick a blue ball from Bagby and a blue ball from bagsy. Right? These are the three ways in which it is possible. So you need to find the probability of each of this step two is that you need to find the probability of a and X occurring together. This is the sum of terms 1 and 2. Okay, this is because in both of these events, we are picking a ball from bag, correct. So there is find out this probability and let me know your answer in the comment section. All right. We'll see if you get the answer right? I gave you the entire solution to this. All you have to do is substitute the value right? If you want a second or two, I'm going to pause on the screen so that you can go through this in a more clear away. Right? Remember that you need to calculate two. Tease the first probability that you need to calculate is the event of picking a blue ball from bag a given that you're picking exactly two blue balls. Okay, II probability you need to calculate is the event of picking exactly two blue bonds. All right. These are the two probabilities. You need to calculate so remember that and this is the solution. All right, so guys make sure you mention your answers in the comment section for now. Let's move on and Look at our next topic, which is the inferential statistics. So guys, we just completed the probability module right now. We will discuss inferential statistics, which is the second type of Statistics. We discussed descriptive statistics earlier. Alright, so like I mentioned earlier inferential statistics also known as statistical inference is a branch of Statistics that deals with forming inferences and predictions about a population based on a sample of data. Are taken from the population. All right, and the question you should ask is how does one form inferences or predictions on a sample? The answer is you use Point estimation? Okay. Now you must be wondering what is point estimation one estimation is concerned with the use of the sample data to measure a single value which serves as an approximate value or the best estimate of an unknown population parameter. That's a little confusing. Let me break it down to you for Camping in order to calculate the mean of a huge population. What we do is we first draw out the sample of the population and then we find the sample mean right the sample mean is then used to estimate the population mean this is basically Point estimate, you're estimating the value of one of the parameters of the population, right? Basically the main you're trying to estimate the value of the mean. This is what point estimation is the two main terms in point estimation. There's something known as as the estimator and the something known as the estimate estimator is a function of the sample that is used to find out the estimate. Alright in this example. It's basically the sample mean right so a function that calculates the sample mean is known as the estimator and the realized value of the estimator is the estimate right? So I hope Point estimation is clear. Now, how do you find the estimates? There are four common ways in which you can do this. The first one is method of Moment you'll what you do is you form an equation in the sample data set and then you analyze the similar equation in the population data set as well like the population mean population variance and so on. So in simple terms, what you're doing is you're taking down some known facts about the population and you're extending those ideas to the sample. Alright, once you do that, you can analyze the sample and estimate more essential or more complex values right next. We have maximum likelihood. But this method basically uses a model to estimate a value. All right. Now a maximum likelihood is majorly based on probability. So there's a lot of probability involved in this method next. We have the base estimator this works by minimizing the errors or the average risk. Okay, the base estimator has a lot to do with the Bayes theorem. All right, let's not get into the depth of these estimation methods. Finally. We have the best unbiased estimators in this method. There are seven unbiased estimators that can be used to approximate a parameter. Okay. So Guys these were a couple of methods that are used to find the estimate but the most well-known method to find the estimate is known as the interval estimation. Okay. This is one of the most important estimation methods or at this is where confidence interval also comes into the picture right apart from interval estimation. We also have something known as margin of error. So I'll be discussing all of this. In the upcoming slides. So first let's understand. What is interval estimate? Okay, an interval or range of values, which are used to estimate a population parameter is known as an interval estimation, right? That's very understandable. Basically what they're trying to see is you're going to estimate the value of a parameter. Let's say you're trying to find the mean of a population. What you're going to do is you're going to build a range and your value will lie in that range or in that interval. All right. So this way your output is going to be more accurate because you've not predicted a point estimation instead. You have estimated an interval within which your value might occur, right? Okay. Now this image clearly shows how Point estimate and interval estimate or different. So where's interval estimate is obviously more accurate because you're not just focusing on a particular value or a particular point in order to predict the probability instead. You're saying that the value might be within this range between the lower confidence limit and the upper confidence limit. All right, this is denotes the range or the interval. Okay, if you're still confused about interval estimation, let me give you a small example if I stated that I will take 30 minutes to reach the theater. This is known as Point estimation. Okay, but if I stated that I will take between 45 minutes to an hour to reach the theater. This is an example of Will estimation all right. I hope it's clear. Now now interval estimation gives rise to two important statistical terminologies one is known as confidence interval and the other is known as margin of error. All right. So there's it's important that you pay attention to both of these terminologies confidence interval is one of the most significant measures that are used to check how essential machine learning model is. All right. So what is confidence interval confidence interval is the measure of your confidence that the interval estimated contains the population parameter or the population mean or any of those parameters right now statisticians use confidence interval to describe the amount of uncertainty associated with the sample estimate of a population parameter now guys, this is a lot of definition. Let me just make you understand confidence interval with a small example. Okay. Let's say that you perform a survey and you survey a group of cat owners. The see how many cans of cat food they purchase in one year. Okay, you test your statistics at the 99 percent confidence level and you get a confidence interval of hundred comma 200 this means that you think that the cat owners by between hundred to two hundred cans in a year and also since the confidence level is 99% shows that you're very confident that the results are, correct. Okay. I hope all of you are clear with that. Alright, so your confidence interval here will be a hundred and two hundred and your confidence level will be 99% Right? That's the difference between confidence interval and confidence level So within your confidence interval your value is going to lie and your confidence level will show how confident you are about your estimation, right? I hope that was clear. Let's look at margin of error. No margin of error for a given level of confidence is a greatest possible distance between the Point estimate and the value of the parameter that it is estimating you can say that it is a deviation from the actual point estimate right. Now. The margin of error can be calculated using this formula now zc her denotes the critical value or the confidence interval and this is X standard deviation divided by root of the sample size. All right, n is basically the sample size now, let's understand how you can estimate the confidence intervals. So guys the level of confidence which is denoted by C is the probability that the interval estimate contains a population parameter. Let's say that you're trying to estimate the mean. All right. So the level of confidence is the probability that the interval estimate contains a population parameter. So this interval between minus Z and z or the area beneath this curve is nothing but the probability that the interval estimate contains a population parameter. You don't all right. It should basically contain the value that you are predicting right. Now. These are known as critical values. This is basically your lower limit and your higher limit confidence level. Also, there's something known as the Z score now. This court can be calculated by using the standard normal table, right? If you look it up anywhere on Google you'll find the z-score table or the standard normal table get to understand how this is done. Let's look at a small example. Okay, let's say that the level of Vince is 90% This means that you are 90% confident that the interval contains the population mean. Okay, so the remaining 10% which is out of hundred percent. The remaining 10% is equally distributed on these Dale regions. Okay, so you have 0.05 here and 0.05 over here, right? So on either side of see you will distribute the other leftover percentage now these these scores are calculated from the table as I mentioned before. All right one. N64 5 is get collated from the standard normal table. Okay. So guys how you estimate the level of confidence. So to sum it up. Let me tell you the steps that are involved in constructing a confidence interval first. You'll start by identifying a sample statistic. Okay. This is the statistic that you will use to estimate a population parameter. This can be anything like the mean of the sample next you will select a confidence level now the confidence level describes the uncertainty of a Sampling method right after that you'll find something known as the margin of error, right? We discuss margin of error earlier. So you find this based on the equation that I explained in the previous slide, then you'll finally specify the confidence interval. All right. Now, let's look at a problem statement to better understand this concept a random sample of 32 textbook prices is taken from a local College Bookstore. The mean of the sample is so so and so and the sample standard deviation is This use a 95% confident level and find the margin of error for the mean price of all text books in the bookstore. Okay. Now, this is a very straightforward question. If you want you can read the question again. All you have to do is you have to just substitute the values into the equation. All right, so guys, we know the formula for margin of error you take the Z score from the table. After that we have deviation Madrid's 23.4 for right and that's standard deviation and n stands for the number of samples here. The number of samples is 32 basically 32 textbooks. So approximately your margin of error is going to be around 8.1 to this is a pretty simple question. All right. I hope all of you understood this now that you know, the idea behind confidence interval. Let's move ahead to one of the most important topics in statistical inference, which is hypothesis testing, right? So Sigelei statisticians use hypothesis testing to formally check whether the hypothesis is accepted or rejected. Okay, hypothesis. Testing is an inferential statistical technique used to determine whether there is enough evidence in a data sample to infer that a certain condition holds true for an entire population. So to understand the characteristics of a general population, we take a random sample, and we analyze the properties of the sample right we test. Whether or not the identified conclusion represent the population accurately and finally we interpret their results now whether or not to accept the hypothesis depends upon the percentage value that we get from the hypothesis. Okay, so to better understand this, let's look at a small example before that. There are few steps that are followed in hypothesis, testing you begin by stating the null and the alternative hypothesis. All right. I'll tell you what exactly these terms are and then you formulate. Analysis plan right after that you analyze the sample data and finally you can interpret the results right now to understand the entire hypothesis testing. We look at a good example. Okay now consider for boys Nick jean-bob and Harry these boys were caught bunking a class and they were asked to stay back at school and clean the classroom as a punishment, right? So what John did is he decided that four of them would take turns to clean their classrooms. He came up with a plan of writing each of their names on chits and putting them in a bout now every day. They had to pick up a name from the bowel and that person had to play in the clock, right? That sounds pretty fair enough now it is been three days and everybody's name has come up except John's assuming that this event is completely random and free of bias. What is a probability of John not treating right or is the probability that he's not actually cheating this can Solved by using hypothesis testing. Okay. So we'll Begin by calculating the probability of John not being picked for a day. Alright, so we're going to assume that the event is free of bias. So we need to find out the probability of John not cheating right first we'll find the probability that John is not picked for a day, right? We get 3 out of 4, which is basically 75% 75% is fairly high. So if John is not picked for three days in a row the Probability will drop down to approximately 42% Okay. So three days in a row meaning that is the probability drops down to 42 percent. Now, let's consider a situation where John is not picked for 12 days in a row the probability drops down to Tea Point two percent. Okay, that's the probability of John cheating becomes fairly high, right? So in order for statisticians to come to a conclusion, they Define what is known as the threshold value. Right considering the above situation if the threshold value is set to 5 percent. It would indicate that if the probability lies below 5% then John is cheating his way out of detention. But if the probability is about threshold value then John it just lucky and his name isn't getting picked. So the probability and hypothesis testing give rise to two important components of hypothesis testing, which is null hypothesis and alternative hypothesis. Null. Hypothesis is based. Basically approving the Assumption alternate hypothesis is when your result disapproves the Assumption right therefore in our example, if the probability of an event occurring is less than 5% which it is then the event is biased hence. It proves the alternate hypothesis. Undoubtedly machine learning is the most in-demand technology in today's market. It's applications. From Seth driving cause to predicting deadly diseases such as ALS the high demand for machine learning skills is the motivation behind today's session. So let me discuss the agenda with you first. Now, we're going to begin the session by understanding the need for machine learning and why it is important after that. We look at what exactly machine learning is and then we'll discuss a couple of machine learning definitions. Once we're done with that. We'll look at the machine learning process and how you can solve a problem by using Using the machine learning process next we will discuss the types of machine learning which includes supervised unsupervised and reinforcement learning. Once we're done with that. We'll discuss the different types of problems that can be solved by using machine learning. Finally. We will end this session by looking at a demo where we'll see how you can perform weather forecasting by using machine learning. All right, so guys, let's get started with our first topic. So what is the importance or what is the need for machine learning now? Since the technical Revolution, we've been generating an immeasurable amount of data as for research with generating around 2.5 quintillion bytes of data every single day and it is estimated that by 2020 1.7 MB of data will be created every second for every person on earth. Now that is a lot of data right now. This data comes from sources such as the cloud iot devices social media and all of that. Since all of us are very interested in the internet right now with generating a lot of data. All right, you have no idea how much data we generate through social media all the chatting that we do and all the images that we post on Instagram the videos that we watch all of this generates a lot of data. Now how does machine learning fit into all of this since we're producing this much data, we need to find a method that can analyze process and interpret this much data. All right, and we need to find a method. That can make sense out of data. And that method is machine learning. Now the lot of talk tire companies and data driven company such as Netflix and Amazon which build machine learning models by using tons of data in order to identify any profitable opportunities. And if they want to avoid any unwanted risk it make use of machine learning. Alright, so through machine learning You can predict risk You can predict profits you can identify opportunities, which will help you grow your business. Business so now I'll show you a couple of examples of where in machine learning is used. All right, so I'm sure all of you have been watch on Netflix. Now the most important thing about Netflix is its recommendation engine. All right. Most of Netflix's Revenue comes from its recommendation engine. So the recommendation engine basically studies the movie viewing patterns of its users and then recommends relevant movies to them. All right, it recommends movies depending on users interests. Depending on the type of movies the user watches and all of that. Alright, so that is how Netflix uses machine learning. Next. We have Facebook's Auto tagging feature. Now the logic behind Facebook's Auto tagging feature is machine learning and neural networks. I'm not sure how many of you know this but Facebook makes use of deepmind face verification system, which is based on machine learning natural language processing and neural networks. So deep mine basically studies the facial features in an image and it tag your friends and family. Another such example is Amazon's Alexa now Alexa is basically an advanced level virtual assistant that is based on natural language processing and machine learning. Now, it can do more than just play music for you. All right, it can book your Uber it can connect with other I/O devices that your house it can track your health. It can order food online and all of that. So data, and machine learning are basically the main factors behind Alex has power another such example is the Google spam filter. So guys Gmail basically makes use of machine learning to filter out spam messages. If any of you just open your Gmail inbox, you'll see that there are separate sections. There's one for primary this social the spam and the Joe general made now basically Gmail makes use of machine learning algorithms and natural language processing to an Is emails in real time and then classify them as either spam or non-spam now, this is another famous application of machine learning. So to sum this up, let's look at a few reasons. Why machine learning is so important. So the first reason is obviously increase in data generation. So because of excessive production of data, we need a method that can be used to structure and lies and draw useful insights from data. This is where machine learning comes as in it uses data to solve problems and find solutions to the most complex tasks faced by organizations. Another important reason is that it improves decision-making. So by making use of various algorithms machine learning can be used to make Better Business decisions. For example machine learning is used to forecast sales. It is used to predict any downfalls in the stock market. It is used to identify risks anomalies and so on now the next reason Is it uncovers patterns and Trends in data finding hidden patterns and extracting key insights from data is the most essential part of machine learning. So by building predictive models and using statistical techniques machine learning allows you to dig beneath the surface and explore the data at a minut scale now understanding data and extracting patterns manually will take a lot of days. Now, if you do this through machine learning algorithms, you can perform such computations. Nations in less than a second. Another reason is that it's solved complex problems. So from detecting genes that are linked to deadly ALS disease is to building self-driving cars and building phase detection systems machine learning can be used to solve the most complex problems. So guys now that you know, why machine learning is so important. Let's look at what exactly machine learning is. The term machine learning was first coined by Arthur Samuel in the year 1959 now looking back that your was probably the most significant in terms of technological advancements. There is if you browse through the net about what is machine learning you'll get at least a hundred different definitions. Now the first and very formal definition was given by Tom and Mitchell now, the definition says that a computer program is set to learn from experience e with respect to some class. Of caste and performance measure P if its performance at tasks in D as measured by P improves with experience e all right. Now I know this is a little confusing. So let's break it down into simple words. Now in simple terms machine learning is a subset of artificial intelligence which provides machines the ability to learn automatically and improve from experience without being explicitly programmed to do so in the sense. It is the practice of getting machines to solve problems by gaining the ability to think but wait now how can a machine think or make decisions? Well, if you feel a machine a good amount of data, it will learn how to interpret process and analyze this data by using machine learning algorithm. Okay. Now guys, look at this figure on top. Now this figure basically shows how a machine learning algorithm or how the machine learning process really works. So the machine learning Begins by feeding the machine lots and lots of data okay by using this data. The machine is trained to detect hidden insights and Trends. Now these insights are then used to build a machine learning model by using an algorithm in order to solve a problem. Okay. So basically you're going to feed a lot of data to the machine. The machine is going to get trained by using this data. It's going to use this data and it's going to draw useful insights and patterns from it, and then it's going to build a model by Using machine learning algorithms. Now this model will help you predict the outcome or help you solve any complex problem or any business problem. So that's a simple explanation of how machine learning works. Now, let's move on and look at some of the most commonly used machine learning terms. So first of all, we have algorithm. Now, this is quite self-explanatory. Basically algorithm is a set of rules or statistical techniques, which are used to learn patterns from data now an algorithm is The logic behind a machine learning model. All right, an example of a machine learning algorithm is linear regression. I'm not sure how many of you have heard of linear regression. It's the most simple and basic machine learning algorithm. All right. Next we have model now model is the main component of machine learning. All right. So model will basically map the input to your output by using the machine learning algorithm and by using the data that you're feeding the machine. So basically the model is a representation of the entire machine learning process. So the model is basically fed input which has a lot of data and then it will output a particular result or a particular outcome by using machine learning algorithms. Next we have something known as predictor variable. Now predictor variable is a feature of the data that can be used to predict the output. So for example, let's say that you're trying to predict the weight of a person depending on the person's height and their age. All right. So over here the predictor variables are your height and your age because you're using height and age of a person to predict the person's weight. Alright, so the height and the A's are the predictor variables now, Wait on the other hand is the response or the target variable. So response variable is a feature or the output variable that needs to be predicted by using the predictor variables. All right, after that we have something known as training data. So guys the data that is fed to a machine learning model is always split into two parts first. We have the training data and then we have the testing data now training data is basically used to build the machine learning model. So usually training data is much larger. Than the testing data because obviously if you're trying to train the machine then you're going to feed it a lot more data. Testing data is just used to validate and evaluate the efficiency of the model. Alright, so that was training data and testing data. So Guys, these were a few terms that I thought you should know before we move any further. Okay. Now, let's move on and discuss the machine learning process. Now, this is going to get very interesting because I'm going to give you an example and make you understand how the machine learning. process works So first of all, let's define the different stages or the different steps involved in the machine learning process. So machine learning process always begins with defining the objective or defining the problem that you're trying to solve next is is data Gathering or data collection. Now the data that you need to solve this problem is collected at this stage. This is followed by data preparation or data processing after that. You have data exploration and Analysis. Isis and the next stage is building a machine learning model. This is followed by model evaluation. And finally you have prediction or your output. Now, let's try to understand this entire process with an example. So our problem statement here is to predict the possibility of rain by studying the weather conditions. So let's say that you're given a problem statement and you're asked to use a machine learning process to solve this problem statement. So let's get started. Alright, so the first step is to Find the objective of the problem statement. Our objective here is to predict the possibility of rain by studying the weather conditions. Now in the first stage of a machine learning process. You must understand what exactly needs to be predicted. Now in our case the objective is to predict the possibility of rain by studying weather conditions, right? So at this stage, it is also essential to take mental notes on what kind of data can be used to solve this problem or the type of approach that you can follow to get. Get to the solution. All right, a few questions that are worth asking during this stage is what are we trying to predict? What are the Target features or what are the predictor variables? What kind of input data do we need? And what kind of problem are we facing? Is it a binary classification problem or is it a clustering problem now, don't worry. If you don't know what classification and clustering is I'll be explaining this in the upcoming slides. So guys this was the first step of a machine learning process, which is Define the Double the problem. All right. Now, let's move on and look at step number two. So step number two is basically data collection or data Gathering now at this stage. You must be asking questions such as what kind of data is needed to solve the problem is the data available and if it is available, how can I get the data? Okay. So once you know the type of data that is required, you must understand how you can derive this data data collection can be done manually or by web scraping, but if you're a beginner Nor and you're just looking to learn machine learning you don't have to worry about getting the data. OK there are thousands of data resources on the web. You can just go ahead and download the datasets from websites such as kaggle. Okay, now coming back to the problem at hand the data needed for weather forecasting includes measures such as humidity level temperature pressure locality whether or not you live in a hill station and so on so guys such data must be collected and stored for analysis. Now the next stage in machine learning is preparing your data the data you collected is almost never in the right format. So basically you'll encounter a lot of inconsistencies in the data set. Okay, this includes missing values redundant variables duplicate values and so on removing such values is very important because they might lead to wrongful computations and predictions. So that's why at this stage you must can the entire data set for any inconsistencies. You have to fix them at this stage. Now. The next step is exploratory data analysis. Now data analysis is all about diving deep into data and finding all the hidden data Mysteries. Okay. This is where you become a detective. So edu or exploratory data analysis is like a brainstorming of machine learning data exploration involves understanding the patterns and the trends in your data. So at this stage all the useful insights are drawn and all the correlations. Turns between the variables are understood. So you might ask what sort of correlations are you talking about? For example in the case of predicting rain fall. We know that there is a strong possibility of rain if the temperature has fallen low. Okay. So such correlations have to be understood and mapped at this stage. Now. This stage is followed by stage number 5, which is building a machine learning model. So all the insights and the patterns that you derive during data exploration are used to build the machine learning. So this stage always Begins by splitting the data set into two parts training data and the testing data. So earlier in the session. I already told you what training and testing data is now the training data will be used to build and analyze the model and the logic of the model will be based on the machine learning algorithm that is being implemented. Okay. Now in the case of predicting rainfall since the output will be in the form of true or false we can use a classification algorithm like logistically. Regression now choosing the right algorithm depends on the type of problem. You're trying to solve the data set you have and the level of complexity of the problem. So in the upcoming sections will be discussing different types of problems that can be solved by using machine learning. So don't worry. If you don't know what classification algorithm is and what logistic regression in. Okay. So all you need to know is at this stage, you'll be building a machine learning model by using machine learning algorithm and by using the training data set the next But in on machine learning process is model evaluation and optimization. So after building a model by using the training data set it is finally time to put the model to a test. Okay. So the testing data set is used to check the efficiency of the model and how accurately it can predict the outcome. So once you calculate the accuracy any improvements in the model have to be implemented in this stage. Okay, so methods like parameter tuning and cross-validation can be used to improve the The performance of the model this is followed by the last stage, which is predictions. So once the model is evaluated and improved it is finally used to make predictions. The final output can be a categorical variable or it can be a continuous quantity in our case for predicting the occurrence of rainfall the output will be a categorical variable in the sense. Our output will be in the form of true or false. Yes or no. Yes, basically represents that is going to rain and no will represent that. It wondering okay as simple as that, so guys that was the entire machine learning process. A linear regression is one of the easiest algorithm in machine learning. It is a statistical model that attempts to show the relationship between two variables. So the linear equation, but before we drill down to linear regression algorithm in depth, I'll give you a quick overview of today's agenda. So we'll start a session with a quick overview of what is regression as linear regression is one of a type of regression algorithm. Once we learn about regression, its use case the various types of it next. We'll learn about the algorithm from scratch where I live To its mathematical implementation first, then we'll drill down to the coding part and Implement linear regression using python in today's session will deal with linear regression algorithm using least Square method checketts goodness of fit or how close the data is to the fitted regression line using the R square method and then finally what we'll do well optimized it using the gradient descent method in the last part on the coding session. I'll teach you to implement linear regression using Python and the coding session. Would be divided into two parts the first part would consist of linear regression using python from scratch where you will use the mathematical algorithm that you have learned in this session. And in the next part of the coding session will be using scikit-learn for direct implementation of linear regression. All right. I hope the agenda is clear to you guys are like so let's begin our session with what is regression. Well regression analysis is a form of predictive modeling technique which investigates the relationship between a dependent and independent. Able a regression analysis involves graphing a line over a set of data points that most closely fits the overall shape of the data or regression shows the changes in a dependent variable on the y-axis to the changes in the explanatory variable on the x-axis fine. Now you would ask what are the uses of regression? Well, they are major three uses of regression analysis the first being determining the strength of predicator, 's the regression might be used to identify the strength of the effect that the independent. Variables have on the dependent variable. For example, you can ask question. Like what is the strength of relationship between sales and marketing spending or what is the relationship between age and income second is forecasting an effect in this the regression can be used to forecast effects or impact of changes. That is the regression analysis help us to understand how much the dependent variable changes with the change in one or more independent variable fine. For example, you can ask question like how Additional seal income will I get for each thousand dollars spent on marketing third is Trend forecasting in this the regression analysis to predict Trends and future values. The regression analysis can be used to get Point estimates in this you can ask questions. Like what will be the price of Bitcoin and next six months, right? So next topic is linear versus logistic regression by now. I hope that you know, what a regression is. So let's move on and understand its type. So there are various kinds of regression like linear. Session logistic regression polynomial regression and others. All right, but for this session will be focusing on linear and logistic regression. So let's move on and let me tell you what is linear regression. And what is logistic regression then what we'll do we'll compare both of them. All right. So starting with linear regression in simple linear regression. We are interested in things like y equal MX plus C. So what we are trying to find is the correlation between X and Y variable this means that every value of X has a corresponding value of y in it if it is continuous. I like however in logistic regression we are not fitting our data to a straight line like linear regression instead what we are doing. We are mapping Y versus X to a sigmoid function in logistic regression. What we find out is is y 1 or 0 for this particular value of x so thus we are essentially deciding true or false value for a given value of x fine. So as a core concept of linear regression You can say that the data is modeled using a straight line where in the case of logistic regression the data is model using a sigmoid function. The linear regression is used with continuous variables on the other hand the logistic regression. It is used with categorical variable the output or the prediction of a linear regression is the value of the variable on the other hand the output of production of a logistic regression is the probability of occurrence of the event. Now, how will you check the accuracy and goodness of fit in case of linear regression? We are various methods. Take measured by loss r squared adjusted r squared Etc while in the case of logistic regression you have accuracy precision recall F1 score, which is nothing but the harmonic mean of precision and recall next is Roc curve for determining the probability threshold for classification or the confusion Matrix Etc. There are many all right. So summarizing the difference between linear and logistic regression. You can say that the type of function you are mapping to is the main point of difference between linear and regression a linear regression Maps a continuous X2 a continuous fi on the other hand a logistic regression Maps a continuous x to the bindery why so we can use logistic regression to make category or true false decisions from the data find so let's move on ahead. Next is linear regression selection criteria, or you can say when will you use linear regression? So the first is classification and regression capabilities regression models predict a continuous variable such as the Don't a day or predict the temperature of a city their Reliance on a polynomial like a straight line to fit a data set poses a real challenge when it comes towards building a classification capability. Let's imagine that you fit a line with the training points that you have now imagine you add some more data points to it. But in order to fit it, what do you have to do? You have to change your existing model that is maybe you have to change the threshold itself. So this will happen with each new data point you add to the model, hence. The linear regression is not good for classification. All's fine. Next is data quality each missing value removes one data point that could optimize the regression in simple linear regression. The outliers can significantly disrupt the outcome just for now. You can know that if you remove the outliers your model will become very good. All right. So this is about data quality. Next is computational complexity a linear regression is often not computationally expensive as compared to the decision tree or the clustering algorithm the order of complexity for n training example and X features. Usually Falls in either Big O of x square or big of xn next is comprehensible and transparent the linear regression are easily comprehensible and transparent in nature. They can be represented by a simple mathematical notation to anyone and can be understood very easily. So these are some of the criteria based on which you will select the linear regression algorithm. All right. Next is where is linear regression used first is evaluating Trends and sales estimate. Well linear regression can be used in Business to evaluate Trends and make estimates or focused for example, if a company sales have increased steadily every month for past few years then conducting a linear analysis on the sales data with monthly sales on the y axis and time on the x axis. This will give you a line that predicts the upward Trends in the sale after creating the trendline the company could use the slope of the lines too focused sale in future months. Next is analyzing. The impact of price changes will linear regression can be To analyze the effect of pricing on consumer behavior. For instance. If a company changes the price on a certain product several times, then it can record the quantity itself for each price level and then perform a linear regression with sold quantity as a dependent variable and price as the independent variable. This would result in a line that depicts the extent to which the customer reduce their consumption of the product as the prices increasing. So this result would help us in future pricing decisions. Next is assessment of risk and fine. Financial services and insurance domain. Well linear regression can be used to analyze the risk, for example health insurance company might conduct a linear regression algorithm how it can do it can do it by plotting the number of claims per customer against its age and they might discover that the old customers then to make more health insurance claim. Well the result of such analysis might guide important business decisions. All right, so by now you have just a rough idea of what linear regression algorithm as like, What it does where it is used when you should use it early now, let's move on and understand the algorithm and depth. So suppose you have independent variable on the x-axis and dependent variable on the y-axis. All right suppose. This is the data point on the x axis. The independent variable is increasing on the x axis. And so does the dependent variable on the y-axis? So what kind of linear regression line you would get you would get a positive linear regression line. All right as the slope would be positive. Next is suppose. You have an independent variable on the x-axis which is increasing and on the other hand the dependent variable on the y-axis that is decreasing. So what kind of line will you get in that case? You will get a negative regression line. In this case as the slope of the line is negative. And this particular line that is line of y equal MX plus C is a line of linear regression which shows the relationship between independent variable and dependent variable and this line is only known as line of linear regression. Okay? So let's add some data points to our graph. So these are some observation or data points on our graphs. Let's plot some more. Okay. Now all our data points are plotted now our task is to create a regression line or the best fit line. All right now once our regression line is drawn now, it's the task of production now suppose. This is our estimated value or the predicted value and this is our actual value. Okay. So what we have to do our main goal is to reduce this error. That is to reduce the distance between the estimated or the predicted value and the actual value. The best fit line would be the one which had the least error or the least difference in estimated value and the actual value. All right, and other words we have to minimize the error. This was a brief understanding of linear regression algorithm soon. We'll jump towards mathematical implementation. All right, but for then let me tell you this suppose you draw a graph with speed on the x-axis and distance covered. On the y axis with the time demeaning constant, if you plot a graph between the speed travel by the vehicle and the distance traveled in a fixed unit of time, then you will get a positive relationship. All right. So suppose the equation of line as y equal MX plus C. Then in this case Y is the distance traveled in a fixed duration of time x is the speed of vehicle m is the positive slope of the line and see is the y-intercept of the line. All right suppose the distance remaining constant. You have to plot a graph between the Rid of the vehicle and the time taken to travel a fixed distance then in that case you will get a line with a negative relationship. All right, the slope of the line is negative here the equation of line changes to y equal minus of MX plus C where Y is the time taken to travel a fixed distance X is the speed of vehicle m is the negative slope of the line and see is the y-intercept of the line. All right. Now, let's get back to our independent and dependent variable. So in that term why is our dependent variable and That is our independent variable. Now, let's move on and see the mathematical implementation of the things. Alright, so we have x equal 1 2 3 4 5 let's plot them on the x-axis. So 0 1 2 3 4 5 6 alike and we have y as 3 4 2 4 5. All right. So let's plot 1 2 3 4 5 on the y-axis now, let's plot our coordinates 1 by 1 so x equal 1 and y equal 3, so We have here x equal 1 and y equal 3. So this is the point 1 comma 3 so similarly we have 1 3 2 4 3 2 4 4 & 5 5. All right. So moving on ahead. Let's calculate the mean of X and Y and plot it on the graph. All right, so mean of X is 1 plus 2 plus 3 plus 4 plus 5 divided by 5. That is 3. All right, similarly mean of Y is 3 plus 4 plus 2 plus 4 plus 5 that is 18. So it in divided by 5. That is nothing but 3.6 aligned so next what we'll do we'll plot our mean that is 3 comma 3 .6 on the graph. Okay. So there's a point 3 comma 3 .6 see our goal is to find or predict the best fit line using the least Square Method All right. So in order to find that we first need to find the equation of line, so let's find the equation of our regression line. All right. So let's suppose this is our regression line y equal MX plus C. Now. We have an equation of line. So all we need to do is find the value of M and see where m equals summation of x minus X bar X Y minus y bar upon the summation of x minus X bar whole Square don't get confused. Let me resolve it for you. All right. So moving on ahead as a part of formula. What we are going to do will calculate x minus X bar. So we have X as 1 minus X bar as 3 so 1 minus 3 that is minus 2 next. We have x equal to minus its mean 3 that is minus 1 similarly. We have 3 minus 3 is 0 4 - 3 1 5 - 3 2 alight so x minus X bar. It's nothing but the distance of all the point through the line y equal 3 and what does this y minus y bar implies it implies that distance of all the point from the line x equal 3 .6 fine. So let's calculate the value of y minus y bar. So starting with y equal 3 - value of y. A bar that is 3.6. So it is three minus 3.6 how much - of 0.6 next is 4 minus 3.6 that is 0.4 next to minus 3.6 that is minus of 1 point 6 next is 4 minus 3.6 that is 0.4 again, 5 minus 3.6 that is 1.4. Alright, so now we are done with Y minus y bar fine now next we will calculate x minus X bar whole Square Let's calculate x minus X bar whole Square. So it is minus 2 whole square. That is 4 minus 1 whole square. That is 1 0 squared is 0 1 Square 1 2 square for fine. So now in our table we have x minus X bar y minus y bar and x minus X bar whole Square. Now what we need. We need the product of x minus X bar X Y minus y bar. Alright, so let's see the product of x minus X bar X Y minus y bar that is minus of 2 x minus of 0.6. That is one. Point 2 minus of 1 x 0 point 4 that is minus of 0 point 4 0 x minus of 1.6. That is 0 1 multiplied by zero point four that is 0.4. And next 2 multiplied by 1 point for that is 2.8. All right. Now almost all the parts of our formula is done. So now what we need to do is get the summation of last two columns. All right, so the summation of x minus X bar whole square is 10 and the summation of x minus X bar. X Y minus y bar is 4 so the value of M will be equal to 4 by 10 fine. So let's put this value of m equals zero point 4 and our line y equal MX plus C. So let's file all the points into the equation and find the value of C. So we have y as 3.6 remember the mean by m as 0.4 which we calculated just now X as the mean value of x that is 3 and we have the in as 3 point 6 equals 0 point 4 x 3 plus C. Alright that is 3.6 equal 1 Point 2 plus C. So what is the value of C that is 3.6 minus 1 Point 2. That is 2 point 4. All right. So what we had we had m equals zero point four see as 2.4 and then finally when we calculate the equation of the regression line what we get is y equal zero point four times of X plus two point four. So there is the regression line. Like so there's how you're plotting your points. This is your actual point. All right. Now for given m equals zero point four and SQL 2.4. Let's predict the value of y for x equal 1 2 3 4 & 5. So when x equal 1 the predicted value of y will be zero point four x one plus two point four that is 2.8. Similarly when x equal to predicted value of y will be zero point 4 x 2 plus 2 point 4 that equals to 3 point. Two similarly x equal 3 y will be 3 point 6 x equal 4 y will be 4 point 0 x equal 5 y will be four point four. So let's plot them on the graph and the line passing through all these predicting point and cutting y-axis at 2.4 as the line of regression. Now your task is to calculate the distance between the actual and the predicted value and your job is to reduce the distance. All right, or in other words, you have to reduce the error between the actual and the predicted. The line with the least error will be the line of linear regression or regression line and it will also be the best fit line. Alright, so this is how things work in computer. So what it do it performs a number of iteration for different values of M for different values of M. It will calculate the equation of line where y equals MX plus C. Right? So as the value of M changes the line is changing so iteration will start from one. All right, and it will perform a number of iteration so after Every iteration what it will do it will calculate the predicted value according to the line and compare the distance of actual value to the predicted value and the value of M for which the distance between the actual and the predicted value is minimum will be selected as the best fit line. All right. Now that we have calculated the best fit line now, it's time to check the goodness of fit or to check how good a model is performing. So in order to do that, we have a method called R square method. So what is this R square? Well r-squared value is a statistical measure of how close the data are to the fitted regression line in general. It is considered that a high r-squared value model is a good model, but you can also have a lower squared value for a good model as well or a higher Squad value for a model that does not fit at all. All right. It is also known as coefficient of determination or the coefficient of multiple determination. Let's move on and see how a square is calculated. So these are our actual values plotted on the graph. We had calculated the predicted values of Y as 2.8 3.2 3.6 4.0 4.4. Remember when we calculated the predicted values of Y for the equation Y predicted equals 0 1 4 x of X plus two point four for every x equal 1 2 3 4 & 5 from there. We got the power. Good values of Phi. All right. So let's plot it on the graph. So these are point and the line passing through these points are nothing but the regression line. All right. Now, what you need to do is you have to check and compare the distance of actual - mean versus the distance of predicted - mean. Alright. So basically what you are doing you are calculating the distance of actual value to the mean to distance of predicted value to the mean. All right, so there is nothing but a square in mathematically you can represent our school. Whereas summation of Y predicted values minus y bar whole Square divided by summation of Y minus y bar whole Square where Y is the actual value y p is the predicted value and Y Bar is the mean value of y that is nothing but 3.6. Remember, this is our formula. So next what we'll do we'll calculate y minus y bar. So we have y is 3y bar as 3 point 6 so we'll calculate it as 3 minus 3.6 that is nothing but minus of 0.6 similarly for y equals 4 and Y Bar equal 3.6. We have y minus y bar as zero point 4 then 2 minus 3.6. It has 1 point 6 4 minus 3.6 again zero point four and five minus 3.6 it is 1.4. So we got the value of y minus y bar. Now what we have to do we have to take it Square. So we have minus of 0.6 Square as 0.36 0.4 Square as 0.16 - of 1.6 Square as 2.56 0.4 Square as 0.16 and 1.4 squared is 1.96 now is a part of formula what we need. We need our YP minus y BAR value. So these are VIP values and we have to subtract it from the No, right. So 2 .8 minus 3.6 that is minus 0.8. Similarly. We will get 3.2 minus 3.6 that is 0.4 and 3.6 minus 3.6 that is 0 for 1 0 minus 3.6 that is 0.4. Then 4 .4 minus 3.6 that is 0.8. So we calculated the value of YP minus y bar now, it's our turn to calculate the value of y b minus y bar whole Square next. We have - of 0.8 Square as 0.64 - of Point four square as 0.160 Square 0 0 point 4 Square as again 0.16 and 0.8 Square as 0.64. All right. Now as a part of formula what it suggests it suggests me to take the summation of Y P minus y bar whole square and summation of Y minus y bar whole Square. All right. Let's see. So on submitting y minus y bar whole Square what you get is five point two and summation of Y P minus y bar whole Square you get one point six. So the value of R square can be calculated as 1 point 6 upon 5.2 fine. So the result which will get is approximately equal to 0.3. Well, this is not a good fit. All right, so it suggests that the data points are far away from the regression line. Alright, so this is how your graph will look like when R square is 0.3 when you increase the value of R square to 0.7. So you'll see that the actual value would like closer to the regression line when it reaches to 0.9 it comes. More clothes and when the value of approximately equals to 1 then the actual values lies on the regression line itself, for example, in this case. If you get a very low value of R square suppose 0.02. So in that case what you'll see that the actual values are very far away from the regression line, or you can say that there are too many outliers in your data. You cannot focus anything from the data. All right. So this was all about the calculation of R square now, you might get a question like are low values of Square always bad. Well in some field it is entirely expected that I ask where value will be low. For example any field that attempts to predict human behavior such as psychology typically has r-squared values lower than around 50% through which you can conclude that humans are simply harder to predict the under physical process furthermore. If you are squared value is low, but you have statistically significant predictors, then you can still draw important conclusion about how changes in the predicator values associated. Oh sated with the changes in the response value regardless of the r-squared the significant coefficient still represent the mean change in the response for one unit of change in the predicator while holding other predators in the model constant, obviously this type of information can be extremely valuable. All right. All right. So this was all about the theoretical concept now, let's move on to the coding part and understand the code in depth. So for implementing linear regression using python, I will be using Anaconda with jupyter notebook installed on it. So I like there's a jupyter notebook and we are using python 3.01 it alright, so we are going to use a data set consisting of head size and human brain of different people. All right. So let's import our data set percent matplotlib and line. We are importing numpy as NP pandas as speedy and matplotlib and from matplotlib. We are importing pipe out of that as PLT. Alright next we will import our data had brain dot CSV and store it in the data variable. Let's execute the Run button and see the armor. But so this asterisk symbol it symbolizes that it still executing. So there's a output or dataset consists of two thirty seven rows and four columns. We have columns as gender age range head size in centimeter Cube and brain weights and Graham fine. So there's our sample data set that is how it looks it consists of all these data set. So now that we have imported our data, so as you can see they are 237 values in the training set so we can find a linear. Relationship between the head size and the Brain weights. So now what we'll do we'll collect X & Y the X would consist of the head size values and the Y would consist of brain with values. So collecting X and Y. Let's execute the Run. Done next what we'll do we need to find the values of b 1 or B not or you can say m and C. So we'll need the mean of X and Y values first of all what we'll do we'll calculate the mean of X and Y so mean x equal NP dot Min X. So mean is a predefined function of Numb by similarly mean underscore y equal NP dot mean of Y, so what it will return if you'll return the mean values of Y next we'll check the total number of values. So m equals. Well length of X. Alright, then we'll use the formula to calculate the values of b 1 and B naught or fnc. All right, let's execute the Run button and see what is the result. So as you can see here on the screen we have got b 1 as 0 point 2 6 3 + B not as three twenty five point five seven. Alright, so now that we have a coefficient. So comparing it with the equation y equal MX plus C. You can say that brain weight equals zero point 2 6 3 X Head size plus three twenty five point five seven so you can say that the value of M here is 0.26 3 and the value of C. Here is three twenty five point five seven. All right, so there's our linear model now, let's plot it and see graphically. Let's execute it. So this is how our plot looks like this model is not so bad. But we need to find out how good our model is. So in order to find it the many methods like root means Square method the coefficient of determination or the a square method. So in this tutorial, I have told you about our score method. So let's focus on that and see how good our model is. So let's calculate the R square value. All right here SS underscore T is the total sum of square SS. Our is the total sum of square of residuals and R square as the formula is 1 minus total sum of squares upon total sum of square of residuals. All right next when you execute it, you will get the value of R square as 0.63 which is pretty very good. Now that you have implemented simple linear regression model using least Square method, let's move on and see how will you implement the model using machine learning library called scikit-learn. All right. So this scikit-learn is a simple machine. Young Library in Python welding machine learning model are very easy using scikit-learn. So suppose there's a python code. So using the scikit-learn libraries your code shortens to this length like so let's execute the Run button and see you will get the same our to score as Well, this was all for today's discussion. Most of the entities in this world are related in one way or another at times finding relationship between entities can help you take valuable business decisions today. I'm going to talk about logistic regression, which is one such approach towards predicting relationships. Now, let us see what all we are going to cover in today's training. So we'll start off the session by getting a quick introduction to what is regression. Then we'll see the different types of regression and we'll be discussing the what and by of logistic regression. So in this part, we'll discuss what exactly it is. It is used why it is used and all those things moving ahead will compare linear regression versus logistic regression along with the various real-life use cases and finally towards the end. I will be practically implementing logistic regression algorithm. So let's quickly start off with the very first topic what is regression. The regression analysis is a predictive modeling technique. So it always involves predictions. So in this session, we'll just talk about predictive analysis and not prescriptive analysis. Now why because if descriptive analysis you Need to have a good base and a stronghold on the predictive part first. Now, it estimates relationship between the dependent variable and an independent variable. So for those of you who are not aware of these terminologies, let me give you a quick summary of it. So dependent variable is nothing but a variable which you want to predict now, let's say I want to know what will be the sales on 26th of this month. So sales becomes a dependent variable or you can see the target variable. Now this dependent variable or Target variable are going to depend on a lot of actors. The number of products you sold till date or what is the season out there? Is there the availability of product or how is the product quality and all these things? So these are the NeverEnding factors which are nothing but the different features that leads to sail so these variables are called as an independent variable or you can say the predictor now if you look at the graph over here, we have some values of X and we have values of Y now as you can see over here if X increases the value of by also increases so let me explain you this with an example. Let's say we have until the value of x which is six point seven five and somebody asked you. What was the value of y when the value of x is 7 so the way that you can do it or how regression comes into the picture is by fitting a straight line by all these points and getting the value of M and C. So this is straight line guys and the formula for the straight line is y is equal to MX plus C. So using this we can try to predict the value of y so here if you notice the X variable can increase as much as it can but the Y variable will increase according to x so Why is basically dependent on your X variable? So for any arbitrary value of x You can predict the value of y and this is always done through regression. So that is how regression is useful. Now regression is basically classified into three types your linear regression, then your logistic regression and polynomial regression. So today we will be discussing logistic regression. So let's move forward and understand the what and by of logistic regression. Now this algorithm is most widely used when the dependent variable or you can see the output is in the binary. A format. So here you need to predict the outcome of a categorical dependent variable. So the outcome should be always discreet or categorical in nature Now by discrete. I mean the value should be binary or you can say you just have two values it can either be 0 or 1 it can either be yes or a no either be true or false or high or low. So only these can be the outcomes so the value which you need to create it should be discrete or you can say categorical in nature. Whereas in linear regression. We have the value of by or you can see Val you need to predict within a range that is how there's a difference between linear regression and logistic regression. We must be having question. Why not linear regression now guys in linear regression the value of by or the value, which you need to predict is in a range, but in our case as in the logistic regression, we just have two values it can be either 0 or it can be one. It should not entertain the values which is below zero or above one. But in linear regression, we have the value of y in the range so here in order to implement logic regression we need To clip this part so we don't need the value that is below zero or we don't need the value which is above 1 so since the value of y will be between only 0 and 1 that is the main rule of logistic regression. The linear line has to be clipped at 0 and 1 now. Once we clip this graph it would look somewhat like this. So here you're getting the curve which is nothing but three different straight lines. So here we need to make a new way to solve this problem. So this has to be formulated into equation. And hence we come up with logistic regression. So here the outcome is either 0 Or one which is the main rule of logistic regression. So with this our resulting curve cannot be formulated. So hence our main aim to bring the values to 0 and 1 is fulfilled. So that is how we came up with large stick regression now here once it gets formulated into an equation. It looks somewhat like this. So guys, this is nothing but an S curve or you can say the sigmoid curve a sigmoid function curve. So this sigmoid function basically converts any value from minus infinity to Infinity to your discrete values, which a Logitech regression wants or it Can say the values which are in binary format either 0 or 1. So if you see here the values as either 0 or 1 and this is nothing but just a transition of it, but guys there's a catch over here. So let's say I have a data point that is 0.8. Now, how can you decide whether your value is 0 or 1 now here you have the concept of threshold which basically divides your line. So here threshold value basically indicates the probability of either winning or losing so here by winning. I mean the value is equal. One and by losing I mean the values equal to 0 but how does it do that? Let's have a data point which is over here. Let's say my cursor is at 0.8. So here I check whether this value is less than the threshold value or not. Let's say if it is more than the threshold value. It should give me the result as 1 if it is less than that, then should give me the result is zero. So here my threshold value is 0.5. I need to Define that if my value let's is 0.8. It is more than 0.5. Then the value shall be rounded of two one. One and let's say if it is less than 0.5. Let's I have a value 0.2 then should reduce it to zero. So here you can use the concept of threshold value to find output. So here it should be discreet. It should be either 0 or it should be one. So I hope you caught this curve of logistic regression. So guys, this is the sigmoid S curve. So to make this curve we need to make an equation. So let me address that part as well. So let's see how an equation is formed to imitate this functionality so over here, we have an equation of a straight. Line, which is y is equal to MX plus C. So in this case, I just have only one independent variable but let's say if we have many independent variable then the equation becomes m 1 x 1 plus m 2 x 2 plus m 3 x 3 and so on till M NX n now, let us put in B and X. So here the equation becomes Y is equal to b 1 x 1 plus beta 2 x 2 plus b 3 x 3 and so on till be nxn plus C. So guys equation of the straight line has a range from minus infinity to Infinity. Yeah, but in our case or you can say largest equation the value which we need to predict or you can say the Y value it can have the range only from 0 to 1. So in that case we need to transform this equation. So to do that what we had done we have just divide this equation by 1 minus y so now Y is equal to 0 so 0 over 1 minus 0 which is equal to 1 so 0 over 1 is again 0 and if we take Y is equals to 1 then 1 over 1 minus 1 which is 0 so 1 over 0 is infinity. So here are my range is now. Between 0 to Infinity, but again, we want the range from minus infinity to Infinity. So for that what we'll do we'll have the log of this equation. So let's go ahead and have the logarithmic of this equation. So here we have this transform it further to get the range between minus infinity to Infinity so over here we have log of Y over 1 minus 1 and this is your final logistic regression equation. So guys, don't worry. You don't have to write this formula or memorize this formula in Python. You just need to call this function which is logistic regression and Everything will be automatically for you. So I don't want to scare you with the maths in the formulas behind it. But it is always good to know how this formula was generated. So I hope you guys are clear with how logistic regression comes into the picture next. Let us see what are the major differences between linear regression was a logistic regression the first of all in linear regression, we have the value of y as a continuous variable or the variable between need to predict are continuous in nature. Whereas in logistic regression. We have the categorical variable so here the value which you need to Should be discrete in nature. It should be either 0 or 1 or should have just two values to it. For example, whether it is raining or it is not raining is it humid outside or it is not humid outside. Now, how's it going to snow and it's not going to snow. So these are the few example, we need to predict where the values are discrete or you can just predict where this is happening or not. Next linear equation solves your regression problems. So here you have a concept of independent variable and a dependent variable. So here you can calculate the value of y which you need to Plate it. Using the value of x. So here your y variable or you can see the value that you need to predict are in a range. But whereas in logistic regression, you have discrete values. So logistic regression basically solves a classification problem so it can basically classify it and it can just give you result whether this event is happening or not. So I hope it is pretty much Clear till now next in linear regression. The graph that you have seen is a straight line graph so over here, you can calculate the value of y with respect to the value of x where as in logistic regression. Glad that we got was a Escobar. You can see the sigmoid curve. So using the sigmoid function You can predict your y values. So I hope you guys are clear with the differences between the linear regression and logistic regression moving the a little see the various use cases where in logistic regression is implemented in real life. So the very first is weather prediction now largest aggression helps you to predict your weather. For example, it is used to predict whether it is raining or not whether it is sunny. Is it cloudy or not? So all these things things can be predicted using logistic regression. Where as you need to keep in mind that both linear regression and logistic regression can be used in predicting the weather. So in that case linear regression helps you to predict what will be the temperature tomorrow whereas logistic regression will only tell you which is going to rain or not or whether it's cloudy or not, which is going to snow or not. So these values are discrete. Whereas if you apply linear regression, you will predicting things like what is the temperature tomorrow or what is the temperature day after tomorrow and all those thing? So these are the slight? Is between linear regression and logistic regression the moving ahead. We have classification problem. So python performs multi-class classification, so here it can help you tell whether it's a bird. It's not a board. Then you classify different kind of mammals. Let's say whether it's a dog or it's not a dog similarly, you can check it for reptile whether it's a reptile or not a reptile. So in logistic regression, it can perform multi-class classification. So this point I've already discussed that it is using classification problems next. It also helps you to determine the illnesses. Where so let me take an example. Let's say a patient goes for a routine check up in hospital. So what doctor will do it, it will perform various tests on the patient and we'll check whether the patient is actually a law or not. So what will be the features so doctor can check the sugar level the blood pressure then what is the age of the patient? Is it very small or is it the old person then? What is the previous medical history of the patient and all of these features will be recorded by the doctor and finally, dr. Checks the patient data and Data - the outcome of Illness and the severity of illness. So using all the data of a doctor can identify whether a patient is ill or not. So these are the various use cases in which you can use logistic regression now, I guess enough of theory part. So let's move ahead and see some of the Practical implementation of logistic regression so over here, I be implementing two projects when I have the data set of a Titanic so over here will predict what factors made people more likely to survive the sinking of the Titanic ship anime. Second project will see the data analysis. On the SUV cars so over here. We have the data of the SUV cars who can purchase it and what factors made people more interested in buying SUV. So these will be the major questions as to why you should Implement logistic regression and what output will you get by it? So let's start by the very first project that is Titanic data analysis. So some of you might know that there was a ship called as Titanic with basically hit an iceberg and sank to the bottom of the ocean and it was a big disaster at that time because it was the first voyage of the ship. It was supposed to be really really strongly built and one of the best ships of that time. So it was a big disaster of that time. And of course there is a movie about this as well. So many of you might have washed it. So what we have we have data of the passengers those who survived and those who did not survive in this particular tragedy. So what you have to do you have to look at this data and analyze which factors would have been contributed the most to the chances of a person survival on the ship or not. So using the logistic regression, we can predict whether the person survived or the person died. Now apart from this we also have a look with the various features along with that. So first it is explore the data set so over here, we have the index value then the First Column is passenger ID, then my next column is survived so over here, we have two values a 0 and a 1 so 0 stands for did not survive and one stands for survive. So this column is categorical where the values are discrete next. We have passenger class so over here, we have three values 1 2 and 3. So this basically tells you that whether a I think a stabbing in the first class second class or third class. Then we have the name of the passenger. We have the six or you can see the gender of the passenger where the passenger is a male or female. Then we have the age we have the Sip SP. So this basically means the number of siblings or the spouses aboard the Titanic so over here, we have values such as 1 0 and so on then we have Parts apart is basically the number of parents or children aboard the Titanic so over here, we also have some values then we I have the ticket number. We have the fear. We have the cabin number and we have the embarked column. So in my inbox column, we have three values we have SC and Q. So s basically stands for Southampton C stands for Cherbourg and Q stands for Queenstown. So these are the features that will be applying our model on so here we'll perform various steps and then we'll be implementing logistic regression. So now these are the various steps which are required to implement any algorithm. So now in our case we are implementing logistic regression, so, Very first step is to collect your data or to import the libraries that are used for collecting your data and then taking it forward then my second step is to analyze your data so over here, I can go to the various fields and then I can analyze the data. I can check did the females or children survive better than the males or did the rich passenger survived more than the poor passenger or did the money matter as in who paid more to get into the shape with the evacuated first? And what about the workers does the worker survived or what is the survival rate? If you were the worker in the ship and not just a traveling passenger, so all of these are very very interesting questions and you would be going through all of them one by one. So in this stage, you need to analyze our data and explore your data as much as you can then the third step is to Wrangle your data now data wrangling basically means cleaning your data so over here, you can simply remove the unnecessary items or if you have a null values in the data set. You can just clear that data and then you can take it forward. So in this step you can build your model using the train data. And then you can test it using a test so over here you will be performing a split which basically split your data set into training and testing data set and find you will check the accuracy. So as to ensure how much accurate your values are. So I hope you guys got these five steps that you're going to implement in autistic regression. So now let's go into all these steps in detail. So number one. We have to collect your data or you can say import the libraries. So it may show you the implementation part as well. So I just open my jupyter notebook and I just Implement all of these steps. It's side-by-side. So guys this is my jupyter notebook first. Let me just rename jupyter notebook to let's say Titanic data analysis. Now our first step was to import all the libraries and collect the data. So let me just import all the libraries first. So first of all, I'll import pandas. So pandas is used for data analysis. So I'll say input pandas as PD then I will be importing numpy. So I'll say import numpy as NP so numpy is a library in Python which basically stands for numerical Python and it is widely used to perform any scientific computation. Next. We will be importing Seaborn. So c 1 is a library for statistical brought think so. Say import Seaborn as SNS. I'll also import matplotlib. So matplotlib library is again for plotting. So I'll say import matplotlib dot Pi plot as PLT now to run this library in jupyter Notebook all I have to write in his percentage matplotlib in line. Next I will be importing one module as well. So as to calculate the basic mathematical functions, so I'll say import mats. So these are the libraries that I will be needing in this Titanic data analysis. So now let me just import my data set. So I will take a variable. Let's say Titanic data and using the pandas. I will just read my CSV or you can see the data set. I like the name of my data set that is Titanic dot CSV. Now. I have already showed you the data set so over here. Let me just print the top 10 rows. So for that I will just say I take the variable Titanic data dot head and I'll say the top ten rules. So now I'll just run this so to run these fellows have to press shift + enter or else you can just directly click on this cell so over here. I have the index. We have the passenger ID, which is nothing. But again the index which is starting from 1 then we have the survived column which has a category. Call values or you can say the discrete values, which is in the form of 0 or 1. Then we have the passenger class. We have the name of the passenger 6 8 and so on so this is the data set that I will be going forward with next let us bring the number of passengers which are there in this original data set for that. I'll just simply type in print. I'll say a number of passengers. And using the length function, I can calculate the total length. So I'll say length and inside this I will be passing this variable because Titanic data, so I'll just copy it from here. I'll just paste it dot index and next set me just bring this one. So here the number of passengers which are there in the original data set we have is 891 so around this number were traveling in the Titanic ship so over here, my first step is done where you have just collected data imported all the libraries and find out the total number of passengers, which are Titanic so now let me just go back to presentation and let's see. What is my next step. So we're done with the collecting data. Next step is to analyze your data so over here, we will be creating different plots to check the relationship between variables as in how one variable is affecting the other so you can simply explore your data set by making use of various columns and then you can plot a graph between them. So you can either plot a correlation graph. You can plot a distribution curve. It's up to you guys. So let me just go back to my jupyter notebook and let me analyze some of the data. Over here. My second part is to analyze data. So I just put this in headed to now to put this in here to I just have to go and code click on mark down and I just run this so first let us plot account plot where you can pay between the passengers who survived and who did not survive. So for that I will be using the Seabourn Library so over here I have imported Seaborn as SNS so I don't have to write the whole name. I'll simply say SNS dot count plot. I say axis with the survive and the data that I'll be using is the Titanic data or you can say the name of variable in which you have store your data set. So now let me just run this so who were here as you can see I have survived column on my x axis and on the y axis. I have the count. So 0 basically stands for did not survive and one stands for the passengers who did survive so over here, you can see that around 550 of the passengers who did not survive and they were around 350 passengers who only survive so here you can basically compute. There are very less survivors than on survivors. So this was the very first floor now that is not another plot to compare the sex as to whether out of all the passengers who survived and who did not survive. How many were men and how many were female so to do that? I'll simply say SNS dot count plot. I add the Hue as six so I want to know how many females and how many male survive then I'll be specifying the data. So I'm using Titanic data set and let me just run this you have done a mistake over here so over here you can see I have survived column on the x-axis and I have the count on the why now. So here your view color stands for your male passengers and orange stands for your female. So as you can see here the passengers who did not survive that has a value 0 so we can see that. Majority of males did not survive and if we see the people who survived here, we can see the majority of female survive. So this basically concludes the gender of the survival rate. So it appears on average women were more than three times more likely to survive than men next. Let us plot another plot where we have the Hue as the passenger class so over here we can see which class at the passenger was traveling in whether it was traveling in class one two, or three so for that I just tried the same command. I'll say SNS dot count plot. I keep my x-axis as subtly I'll change my you to passenger class. So my variable named as PE class. And the data said that I'll be using is Titanic data. So this is my result so over here you can see I have blue for first-class orange for second class and green for the third class. So here the passengers who did not survive a majorly of the third class or you can say the lowest class or the cheapest class to get into the dynamic and the people who did survive majorly belong to the higher classes. So here 1 & 2 has more eyes than the passenger who were traveling in the third class. So here we have concluded that the passengers who did not survive a majorly of third class. Us all you can see the lowest class and the passengers who were traveling in first and second class would tend to survive more next. I just got a graph for the age distribution over here. I can simply use my data. So we'll be using pandas library for this. I will declare an array and I'll pass in the column. That is age. So I plot and I want a histogram so I'll say plot da test. So you can notice over here that we have more of young passengers, or you can see the children between the ages 0 to 10 and then we have the average people and if you go ahead Lester would be the population. So this is the analysis on the age column. So we saw that we have more young passengers and more mediocre eight passengers, which are traveling in the Titanic. So next let me plot a graph of fare as well. So I'll say Titanic data. I say fair. And again, I got a histogram so I'll say haste. So here you can see the fair size is between zero to hundred now. Let me add the bin size. So as to make it more clear over here, I'll say Ben is equals to let's say 20 and I'll increase the figure size as well. So I'll say fixed size. Let's say I'll give the dimensions as 10 by 5. So it is bins. So this is more clear now next. It is analyzed the other columns as well. So I'll just type in Titanic data and I want the information as to what all columns are left. So here we have passenger ID, which I guess it's of no use then you have see how many passengers survived and how many did not we also see the analysis on the gender basis. We saw when the female tend to survive more or the maintain to survive more then we saw the passenger class where the passenger is traveling in the first class second class or third class. Then we have the name. So in name, we cannot do any analysis. We saw the sex we saw the age as well. Then we have sea bass P. So this stands for the number of siblings or the spouses which Are aboard the Titanic so let us do this as well. So I'll say SNS dot count plot. I mentioned X SC SP. And I will be using the Titanic data so you can see the plot over here so over here you can conclude that. It has the maximum value on zero so you can conclude that neither children nor a spouse was on board the Titanic now second most highest value is 1 and then we have various values for 2 3 4 and so on next if I go above the store this column as well. Similarly can do four parts. So next we have part so you can see the number of parents or children which were aboard the Titanic so similarly can do. As well then we have the ticket number. So I don't think so. Any analysis is required for Ticket. Then we have fears of a we have already discussed as in the people would tend to travel in the first class. You will be the highest view then we have the cable number and we have embarked. So these are the columns that will be doing data wrangling on so we have analyzed the data and we have seen quite a few graphs in which we can conclude which variable is better than another or what is the relationship the whole third step is my data wrangling so data wrangling basically means Cleaning your data. So if you have a large data set, you might be having some null values or you can say Nan values. So it's very important that you remove all the unnecessary items that are present in your data set. So removing this directly affects your accuracy. So I'll just go ahead and clean my data by removing all the n n values and unnecessary columns, which has a null value in the data set the next time you're performing data wrangling. Supposed to fall I check whether my data set is null or not. So I'll say Titanic data, which is the name of my data set and I'll say is null. So this will basically tell me what all values are null and will return me a Boolean result. So this basically checks the missing data and your result will be in Boolean format as in the result will be true or false so Falls mean if it is not null and prove means if it is null, so let me just run this. Over here you can see the values as false or true. So Falls is where the value is not null and Drew is where the value is none. So over here you can see in the cabin column. We have the very first value which is null so we have to do something on this so you can see that we have a large data set. So the counting does not stop and we can actually see the some of it. We can actually print the number of passengers who have the Nan value in each column. So I'll say Titanic underscore data is null and I want the sum of it all. Same thought some so this is basically print the number of passengers who have the n n values in each column so we can see that we have missing values in each column that is 177. Then we have the maximum value in the cave in column and we have very Less in the Embark column. That is 2 so here if you don't want to see this numbers, you can also plot a heat map and then you can visually analyze it let me just do that as well. So I'll say SNSD heat map. And save I take labels. False Choice run this as we have already seen that there were three columns in which missing data value was present. So this might be age so over here almost 20% of each column has a missing value. Then we have the cabling columns. So this is quite a large value and then we have two values for embark column as well. Add a see map for color coding. So I'll say see map. So if I do this so the graph becomes more attractive so over here yellow stands for Drew or you can say the values are null. So here we have computed that we have the missing value of H. We have a lot of missing values in the cabin column and we have very less value, which is not even visible in the Embark column as well. So to remove these missing values, you can either replace the values and you can put in some dummy values to it or you can simply drop the column. So here let us suppose pick the age column. So first, let me just plot a box plot and they will analyze with having a column as H. So I'll say SNS dot box plot. I'll say x is equals to passenger class. So it's p class. I'll say Y is equal to H and the data set that I'll be using is Titanic side. So I'll say three times goes to Titanic data. You can see the edge in first class and second class tends to be more older rather than we have it in the third class. Well that depends on The Experience how much you earn or might be there any number of reasons so here we concluded that passengers who were traveling in class one and class two a tend to be older than what we have in the class 3 so we have found that we have some missing values in EM. Now one way is to either just drop the column or you can just simply fill in some values to them. So this method is called as imputation now to perform data wrangling or cleaning it is for spring the head of the data set. So I'll say tightening knot head. So it's Titanic. Data, let's say I just want the five rows. So here we have survived which is again categorical. So in this particular column, I can apply logic to progression. So this can be my y value or the value that you need to predict. Then we have the passenger class. We have the name. Then we have ticket number. We're taping so over here. We have seen that in keeping. We have a lot of null values or you can say that any invalid which is quite visible as well. So first of all, we'll just drop this column for dropping it. I'll just say Titanic underscore data. And I'll simply type in drop and the column which I need to draw so I have to drop the cable column. I mention the access equals to 1 and I'll say in place also to true. So now again, I just print the head and let us see whether this column has been removed from the data set or not. So I'll say Titanic dot head. So as you can see here, we don't have given column anymore. Now, you can also drop the na values. So I'll say Titanic data dot drop all the any values or you can say Nan which is not a number and I will say in place is equal to True its Titanic. So over here, let me again plot the heat map and let's say for the values we should before showing a lot of null values. Has it been removed or not. So I'll say SNS dot heat map. I'll pass in the data set. I'll check it is null. I'll say why tick labels is equal to false. And I don't want color coding. So again I say false. So this will basically help me to check whether my values has been removed from the data set or not. So as you can see here, I don't have any null values. So it's entirely black now. You can actually know the some as well. So I'll just go above So I'll just copy this part and I just use the sum function to calculate the sum. So here the tells me that data set is clean as in the data set does not contain any null value or any Nan value. So now we have R Angela data. You can see cleaner data. So here we have done just one step in data wrangling that is just removing one column out of it. Now you can do a lot of things you can actually fill in the values with some other values or you can just calculate the mean and then you can just fit in the null values. But now if I see my data set, so I'll say Titanic data dot head. But now if I see you over here I have a lot of string values. So this has to be converted to a categorical variables in order to implement logistic regression. So what we will do we will convert this to categorical variable into some dummy variables and this can be done using pandas because logistic regression just take two values. So whenever you apply machine learning you need to make sure that there are no string values present because it won't be taking these as your input variables. So using string you don't have to predict anything but in my case I have the survived columns 2210 how many? People tend to survive and how many did not so CEO stands for did not survive and one stands for survive. So now let me just convert these variables into dummy variables. So I'll just use pandas and a say PD not get dummies. You can simply press tab to autocomplete and say Titanic data and I'll pass the six so you can just simply click on shift + tab to get more information on this. So here we have the type data frame and we have the passenger ID survived and passenger class. So if Run this you'll see that 0 basically stands for not a female and one stand for it is a female similarly for male 0 Stanford's not made and one Stanford may now we don't require both these columns because one column itself is enough to tell us whether it's male or you can say female or not. So let's say if I want to keep only male I'll say if the value of mail is 1 so it is definitely a maid and is not a female. So that is how you don't need both of these values. So for that I just remove the First Column, let's say a female so I'll say drop first. Andrew it has given me just one column which is male and has a value 0 and 1. Let me just set this as a variable hsx so over here I can say sex dot head and just want to see the first five rows. Sorry, it's dot. So this is how my data looks like now here. We have done it for sex. Then we have the numerical values in age. We have the numerical values in spouses. Then we have the ticket number. We have the pair and we have embarked as well. So in Embark the values are in. C and Q so here also we can apply this get dummy function. So let's say I will take a variable. Let's say embark. I'll use the pandas Library. I'll enter the column name that is embarked. Let me just print the head of it. So I'll say Embark dot head so over here. We have c q and s now here also we can drop the First Column because these two values are enough with the passenger is either traveling for Q. That is Q in stone S4 sound time and if both the values are 0 then definitely the passenger is from Cherbourg. That is the third value so you can again drop the first value. So I'll say drop and true. Let me just run this. So this is how my output looks like now similarly you can do it for The class as well. So here also we have three classes one two, and three so I'll just copy the whole statement. So let's say I want the variable name. Let's say PCL. I'll pass in the column name that is PE class and I'll just drop the First Column. So here also the values will be 1 2 or 3 and I'll just remove the First Column. So here we just left with two and three so if both the values are 0 then definitely the passengers travelling in the first class now, we have made the values as categorical now, my next step would be to concatenate all these new rules into a data set. We can see Titanic data using the pandas will just concatenate all these columns. So I'll Superior. One cat and then say if we have to concatenate sex, we have to concatenate embarked and PCL and then I will mention the access to one. I'll just run this can you to print the head so over here you can see that these columns have been added over here. So we have the mail column with basically tells where the person is male or it's a female then we have the Embark which is basically q and s so if it's traveling from Queenstown value would be one else it would be 0 and If both of these values are zeroed, it is definitely traveling from Cherbourg. Then we have the passenger class as 2 and 3. So the value of both these is 0 then passengers travelling in class one. So I hope you got this till now now these are the irrelevant columns that we have it over here so we can just drop these columns will drop in PE class the embarked column and the sex column. So I'll just type in Titanic data dot drop and mention the columns that I want to drop. So I say I even read the passenger ID because it's nothing but just the index value which is starting from one. So I'll drop this as well then I don't want name as well. So I'll delete name as well. Then what else we can drop we can drop the ticket as well. And then I'll just mention the axis. I'll say in place is equal to True. Okay. So now my column name starts uppercase. So these has been dropped now, let me just bring my data set again. So this is my final leader said guys, we have the survived column which has the value 0 and 1 then we have the passenger class or we forgot to drop this as well. So no worries. I'll drop this again. So now let me just run this. So over here we have the survive. We have the age. We have the same SP. We have the part. We have Fair mail and these we have just converted. So here we have just performed data angle. You can see clean the data and then we have just converted the values of gender to male then embarked to q and s and the passenger Class 2 2 & 3. So this was all about my data wrangling or just cleaning the data then my next up is training and testing your data. So here we will split the data set into train subset and test steps. And then what we'll do we'll build a model on the train data and then predict the output on your test data set. So let me just go back to Jupiter and it is implement this as well over here. I need to train my data set. So I just put this indeed heading 3. So over here, you need to Define your dependent variable and independent variable. So here my Y is the output for you can say the value that you need to predict so over here, I will write Titanic data. I'll take the column which is survive. So basically I have to predict this column whether the passenger survived or not. And as you can see we have the discrete outcome, which is in the form of 0 and 1 and rest all the things we can take it as a features or you can say independent variable. So I'll say Titanic data. Not drop so we just simply drop the survive and all the other columns will be my independent variable. So everything else as a features which leads to the survival rate. So once we have defined the independent variable and the dependent variable next step is to split your data into training and testing subset. So for that we will be using SK loan. I just type in from sklearn dot cross validation. import train test plate Now here if you just click on shift and tab, you can go to the documentation and you can just see the examples over here. I second class to open it and then I just go to examples and see how you can split your data. So over here you have extra next test wide range why test and then using this train test platelet and just passing your independent variable and dependent variable and just Define a size and a random straight to it. So, let me just copy this and I'll just paste over here. Over here we will train test then we have the dependent variable train and test and using the split function will pass in the independent and dependent variable and then we'll set a split size. So let's say I'll put it up 0.3. So this basically means that your data set is divided in 0.3 that is in 70/30 ratio, and then I can add any random straight to it. So let's say I'm applying one this is not necessary. If you want the same result as that of mine, you can add the random shape. So this will basically take exactly the same sample every Next I have to train and predict by creating a model. So here logistic regression will graph from the linear regression. So next I'll just type in from SK loan dot linear model import logistic regression. Next I'll just create the instance of this logistic regression model. So I'll say log model is equals to largest aggression now. I just need to fit my model. So I'll say log model dot fit and I'll just pass in my ex train. and white rain Alright, so here it gives me all the details of logistic regression. So here it gives me the class way dual fit intercept and all those things then what I need to do, I need to make prediction. So I'll take a variable and checked addictions and I'll pass on the model to it. So I'll say log model dot predict and I'll pass in the value that is X test. So here we have just created a model fit that model and then we had made predictions. So now to evaluate how my model has been performing. So you can simply calculate the accuracy or you can also calculate a classification report. So don't worry guys. I'll be showing both of these methods. So I'll say from sklearn dot matrix input classification report. It's all here are used as fiction report. And inside this I'll be passing in white test and the predictions. So guys this is my classification report. So over here, I have the Precision. I have the recall. We have the advanced code and then we have support. So here we have the value of decision as 75 72 and 73 which is not that bad now in order to calculate the accuracy as well. You can also use the concept of confusion Matrix. So if you want to print the confusion Matrix, I will simply say from sklearn dot matrix import confusion Matrix first of all, and then we just print this So how my function has been imported successfully so I'll say confusion Matrix. And again passing the same variables which is why test and predictions. So I hope you guys already know the concept of confusion Matrix. So I just tell you in a brief what confusion Matrix is all about? So confusion Matrix is nothing but a 2 by 2 Matrix which has a four outcomes. This basically tells us that how accurate your values are. So here we have the column as predicted. No predicted. Why? And we have actual no and then actually yes. So this is the concept of confusion Matrix. So here let me just fade in these values which we have just calculated. So here we have 105. 105 2125 and 63 So as you can see here, we have got four outcomes now 105 is the value where a model has predicted. No, and in reality. It was also a no so where we have predicted know an actual know similarly. We have 63 as a predicted. Yes. So here the model predicted. Yes, and actually also it was a yes. So in order to calculate the accuracy, you just need to add the sum of these two values and divide the whole by the some. So here these two values tells me where the order has actually predicted the correct output. This value is also called as true- This is called as false positive. This is called as true positive and this is called a false negative. Now in order to calculate the accuracy. You don't have to do it manually. So in Python, you can just import accuracy score function and you can get the results from that. So I'll just do that as well. So I'll say from sklearn dot-matrix import accuracy score and I'll simply print the accuracy and we'll pass in the same variables. That is why it is and predictions so over. Here, it tells me the address. He has 78 which is quite good so over here if you want to do it manually, we have 2 plus these two numbers, which is 105 263. So this comes out to almost 168 and then you have to divide by the sum of all the phone numbers. So 105 plus 63 plus 21 plus 25, so this gives me a result of to 1/4. So now if you divide these two number, you'll get the same accuracy that is 78 percent or you can say point seven eight. So that is how you can calculate the See, so now let me just go back to my presentation. I let's see what all we have covered till now. So here we have first plate our data into train and test subset then we have build a model on the train data and then predicted the output on the test data set and then my fifth step is to check the accuracy. So here we have calculator accuracy to almost 78 percent which is quite good. You cannot say that accuracy is bad. So here it tells me how accurate your results are so him accuracy score defines that and hence got a good accuracy. So now moving ahead. Let us see the second project that is SUV data analysis. So in this a car company has released new SUV in the market and using the previous data about the sales of their SUV. They want to predict the category of people who might be interested in buying this. So using the logistic regression, you need to find what factors made people more interested in buying this SUV. So for this let us hear data set where I have user ID. I have gender as male and female then we have the age we have the estimated. Melody and then we have the purchased column. So this is my discreet column or you can see the categorical column. So here we just have the value that is 0 and 1 and this column we need to predict whether a person can actually purchase a SUV or Not. So based on these factors, we will be deciding whether a person can actually purchase a SUV or not. So we know the salary of a person we know the age and using these we can predict whether person can actually purchase SUV or not. So, let me just go to my jupyter notebook and it is Implement a logistic regression. So guys, I I will not be going through all the details of data cleaning and analyzing the part start part. I'll just leave it on you. So just go ahead and practice as much as you can. Alright, so the second project is SUV predictions. So first of all, I have to import all the libraries so I say import numpy as NP and similarly. I'll do the rest of it. Alright, so now let me just print the head of this data set. So this we have already seen that we have columns as user ID. We have gender. We have the H we have the salary and then we have to calculate whether person can actually purchase a SUV or not. So now let us just simply go on to the algorithm part. So we'll directly start off with the logistic regression on how you can train a model. So for doing all those things, we first need to Define your independent variable and dependent variable. So in this case, I want my ex at is an independent variable is a data set. I lock so here I will be specifying all the School and basically stands for that and in the columns, I want only two and three dot values. So here we should fetch me all the rows and only the second and third column which is age and estimated salary. So these are the factors which will be used to predict the dependent variable that is purchase. So here my dependent variable is purchase and independent variable is of age and salary so I'll say Lena said dot I love I'll have all the rows and add just one fourth column. That is my purchased column. You don't values. All right, so I just forgot when one square bracket over here. Alright, so over here. I have defined my independent variable and dependent variable. So here my independent variable is age and salary and dependent variable is the column purchase. Now, you must be wondering what is this? I lock function. So I look function is basically an index of a panda's data frame and it is used for integer based indexing or you can also say selection by index now, let me just bring these independent variables and dependent variable. If I bring the independent variable I have age as well as a salary next. Let me print the dependent variable as well. So over here you can see I just have the values in 0 and 1 so 0 stands for did not purchase next. Let me just divide my data set into training and test subset. So I'll simply write in from sklearn dot cross plate not cross-validation. Import drain test next I'll just press shift + Tab and over here. I'll go to the examples and just copy the same line. So I'll just copy this. As move the points now, I want to text size to be let's see 25, so I have divided the train in tested in 75/25 ratio. Now, let's say I'll take the random set of 0 So Random State basically ensures the same result or you can say the same samples taken whenever you run the code. So let me just run this now. You can also scale your input values for better performing and this can be done using standard scalar. So let me do that as well. So I'll say from sklearn Dot pre-processing. Import standard scale now. Why do we scale it now? If you see a data set we are dealing with large numbers. Well, although we are using a very small data set. So whenever you're working in a prod environment, you'll be working with large data set we will be using thousands and hundred thousands of you pulls so they're scaling down will definitely affect the performance by a large extent. So here let me just show you how we can scale down these input values and then the pre-processing contains all your methods & functionality, which is Required to transform your data. So now let us scale down for test as well as a training data set. So else First Make an instance of it. So I'll say standard scalar. Then I have Extreme sasc Dot fit fit underscore transform. I'll pass in my Xtreme video. And similarly I can do it for test wherein I'll pass the X test. All right. Now my next step is to import logistic regression. So I'll simply apply logistic regression by first importing it. So I'll say from sklearn sklearn the linear model import logistic regression over here. I'll be using classifier. So I said classifier dot is equals to logistically aggression so over here, I just make an instance of it. So I'll say logistic regression and over here. I just pass in the random state, which is 0 No, I simply fit the model. And I simply passing next rain and white rain. So here it tells me all the details of logistic regression. Then I have to predict the value. So I'll say why I prayed it's equal to classifier. Then predict function and then I just pass in X test. So now we have created the model. We have scaled down our input values. Then we have applied logistic regression. We have predicted the values and now we want to know the accuracy. So now the accuracy first we need to import accuracy scores. So I'll say from sklearn dot matrix input accuracy school and using this function we can calculate the accuracy or you can manually do that by creating a confusion Matrix. So I'll just pass. my lightest and my y predicted All right, so over here I get the accuracy as 89% So we want to know the accuracy in percentage. So I just have to multiply it by a hundred and if I run this so it gives me 89% So I hope you guys are clear with whatever I have taught you today. So here I have taken my independent variable as age and salary and then we have calculated that how many people can purchase SUV and then we have calculated our model by checking the accuracy so over here we get the accuracies 89 which is great. Alright guys that is it for today. So I'll Scoffs what all we have covered in today's training. First of all, we had a quick introduction to what is regression and where the regression is actually use then we have understood the types of regression and then got into the details of what and why of logistic regression of compared linear was in logistic regression. We have also seen the various use cases where you can Implement logistic regression in real life and then we have picked up two projects that is Titanic data analysis and SUV prediction so over here we have seen how we can collect your data analyze your data then perform. Modeling on that data train the data test the data and then finally have calculated the accuracy. So in your SUV prediction, you can actually analyze clean your data and you can do a lot of things so you can just go ahead pick up any data set and explore it as much as you can open your eyes and see around you will find dozens of applications of machine learning which you are using and interacting with in your daily life peed be using the phase detection. And Facebook are getting the recommendation for similar products from Amazon machine learning is applied almost everywhere. So hello and welcome all to this YouTube session will learn about how to build a decision tree. This session is designed in a way that you get most out of it. Alright. So this decision tree is a type of classification algorithm which comes under these supervised learning technique. So before learning about decision tree, I'll give you a short introduction to classification where we'll learn about. What is classification what I'd say, Various types where it is used or what I'd see use cases now, once you get your fundamental clear will jump to the decision tree part under this. First of all, I will teach you to mathematically create a decision tree from scratch then once you get your Concepts clear, we'll see how you can write a decision tree classifier from scratch in Python using the card algorithm. All right. I hope the agenda is scared you guys what is classification? I hope every one of you must have used Gmail. So how do you think the male is getting classified as Spam or not spam mail. Well, there's nothing but classification So What It Is Well classification is the process of dividing the data set into different categories or groups by adding label. In other way, you can say that it is a technique of categorizing the observation into different category. So basically what you are doing is you are taking the data analyzing it and on the basis of some condition you finely divided into various categories. Now, why do we classify it? Well, we classify it to perform predictive analysis on it. Like when you get the mail the machine predicts it to be a Spam or not spam mail and on the basis of that prediction it add the irrelevant or spam mail to the respective folder in general this classification. Algorithm handle questions. Like is this data belongs to a category or B category? Like is this a male or is this a female something like that now the question arises where will you use it? Well, you can use this of protection order to check whether the transaction is genuine or not suppose I am using. A credit card here in India now due to some reason I had to fly to Dubai now. If I'm using the credit card over there, I will get a notification alert regarding my transaction. They would ask me to confirm about the transaction. So this is also kind of predictive analysis as the machine predicts that something fishy is in the transaction as very for our ago. I made the transaction using the same credit card and India and 24 hour later. The same credit card is being used for the payment in Dubai. So the Machine predicts that something fishy is going on in the transaction. So in order to confirm it it sends you a notification alert. All right. Well, this is one of the use case of classification you can even use it to classify different items like fruits on the base of its taste color size overweight a machine. Well trained using the classification algorithm can easily predict the class or the type of fruit whenever new data is given to it. Not just the fruit. It can be any item. It can be a car. It can be a house. It can be a I'm bored or anything. Have you noticed that while you visit some sites or you try to login into some you get a picture capture for that right where you have to identify whether the given image is of a car or its of a pole or not? You have to select it for example that 10 images and you're selecting three Mages out of it. So in a way you are training the machine right you are telling that these three are the picture of a car and rest are not so who knows you are training at for something big right? So moving on ahead. Let's discuss the types. S of classification online. Well, there are several different ways to perform the same tasks like in order to predict whether a given person is a male or a female the machine had to be trained first. All right, but there are multiple ways to train the machine and you can choose any one of them just for Predictive Analytics. There are many different techniques but the most common of them all is the decision tree, which we'll cover in depth in today's session. So as a part of classification algorithm we have decision tree random Forest name buys k-nearest neighbor. Logistic regression linear regression support Vector machines and so on there are many. Alright, so let me give you an idea about few of them starting with decision tree. Well decision tree is a graphical representation of all the possible solution to a decision the decisions which are made they can be explained very easily. For example here is a task, which says that should I go to a restaurant or should I buy a hamburger you are confused on that. So for that what you will do, you will create a dish entry for it starting with the root node will be first of all, you will check whether you are hungry or not. All right, if you're not hungry then just go back to sleep. Right? If you are hungry and you have $25 then you will decide to go to restaurant. And if you're hungry and you don't have $25, then you will just go and buy a hamburger. That's it. All right. So there's about decision tree now moving on ahead. Let's see. What is a random Forest. Well random Forest build multiple decision trees and merges them together to get a more accurate and stable production. All right, most of the time random Forest is trained with a bagging method. The bagging method is based on the idea that the combination of learning model increases the overall result. If you are combining the learning from different models and then clubbing it together what it will do it will Increase the overall result fine. Just one more thing. If the size of your data set is huge. Then in that case one single decision tree would lead to our Offutt model same way like a single person might have its own perspective on the complete population as a population is very huge. Right? However, if we implement the voting system and ask different individual to interpret the data, then we would be able to cover the pattern in a much meticulous way even from the diagram. You can see that in section A we have Howard large training data set what we do. We first divide our training data set into n sub-samples on it and we create a decision tree for each cell sample. Now in the B part what we do we take the vote out of every decision made by every decision tree. And finally we Club the vote to get the random Forest dition fine. Let's move on ahead. Next. We have neighbor Buys. So named by is is a classification technique, which is based on Bayes theorem. It assumes that It's of any particular feature in a class is completely unrelated to the presence of any other feature named buys is simple and easy-to-implement algorithm and due to a Simplicity this algorithm might out perform more complex model when the size of the data set is not large enough. All right, a classical use case of name bias is a document classification. And that what you do you determine whether a given text corresponds to one or more categories in the text case, the features used might be the presence or absence. Absence of any keyword. So this was about Nev from the diagram. You can see that using neighbor buys. We have to decide whether we have a disease or not. First what we do we check the probability of having a disease and not having the disease right probability of having a disease is 0.1 while on the other hand probability of not having a disease is 0.9. Okay first, let's see when we have disease and we go to the doctor. All right, so when we visited the doctor and the test is positive Adjective so probability of having a positive test when you're having a disease is 0.8 0 and probability of a negative test when you already have a disease that is 0.20. This is also a false negative statement as the test is detecting negative, but you still have the disease, right? So it's a false negative statement. Now, let's move ahead when you don't have the disease at all. So probability of not having a disease is 0.9. And when you visit the doctor and the doctor is like, yes, you have the disease. But you already know that you don't have the disease. So it's a false positive statement. So probability of having a disease when you actually know there is no disease is 0.1 and probability of not having a disease when you actually know there is no disease. So and the probability of it is around 0.90 fine. It is same as probability of not having a disease in the test is showing the same results a true positive statement. So it is 0.9. All right. So let's move on ahead and discuss about kn n algorithm. So this KNN algorithm or the k-nearest neighbor, it stores all the available cases and classifies new cases based on the similarity measure the K in the KNN algorithm as the nearest neighbor, we wish to take vote from for example, if k equal 1 then the object is simply assigned to the class of that single nearest neighbor from the diagram. You can see the difference in the image when k equal 1 k equal 3 and k equal 5, right? Well the And systems are now able to use the k-nearest neighbor for visual pattern recognization to scan and detect hidden packages in the bottom bin of a shopping cart at the checkout if an object is detected which matches exactly to the object listed in the database. Then the price of the spotted product could even automatically be added to the customers Bill while this automated billing practice is not used extensively at this time, but the technology has been developed and is available for use if you want you can just use It and yeah, one more thing k-nearest neighbor is also used in retail to detect patterns in the credit card uses many new transaction scrutinizing software application use Cayenne algorithms to analyze register data and spot unusual pattern that indicates a species activity. For example, if register data indicates that a lot of customers information is being entered manually rather than to automated scanning and swapping then in that case. This could indicate that the employees were using the register. Are in fact stealing customers personal information or if I register data indicates that a particular good is being returned or exchanged multiple times. This could indicate that employees are misusing the return policy or trying to make money from doing the fake returns. Right? So this was about KNN algorithm since our main focus for this session will be on decision tree. So starting with what is decision tree, but first, let me tell you why did we choose the Gentry to start with? Well, these decision tree are really very easy. Easy to read and understand it belongs to one of the few models that are interpretable where you can understand exactly why the classifier has made that particular decision right? Let me tell you a fact that for a given data set. You cannot say that this algorithm performs better than that. It's like you cannot say that the Asian trees better than a buys or name biases performing better than decision tree. It depends on the data set, right? You have to apply hit and trial method with all the algorithms one by one and then compare the The model which gives the best result as a model which you can use at for better accuracy for your data set. All right. So let's start with what is decision tree. Well a decision tree is a graphical representation of all the possible solution to our decision based on certain conditions. Now, you might be wondering why this thing is called as decision tree. Well, it is called so because it starts with the root and then branches off to a number of solution just like a tree right even the trees. Starts from a roux and it starts growing its branches once it gets bigger and bigger similarly in a decision tree. It has a roux which keeps on growing with increasing number of decision and the conditions now, let me tell you a real life scenario. I won't say that all of you, but most of you must have used it. Remember whenever you dial the toll-free number of your credit card company. It redirects you to his intelligent computerised assistant where it asks you questions like, press one for English or press 2 for Henry, press 3 for this press 4 for that. Great now once you select one now again, it redirects you to a certain set of questions like press 1 for this press 1 for that and similarly, right? So this keeps on repeating until you finally get to the right person, right? You might think that you are caught in a voicemail hell but what the company was actually doing it was just using a decision tree to get you to the right person. I lied. I'd like you to focus on this particular image for a moment on this particular slide. You can see I image where the task is. Should I accept a new job offer? Or not. All right, so you have to decide that for that what you did you created a decision tree starting with the base condition or the root node. Was that the basic salary or the minimum salary should be $50,000 if it is not $50,000. Then you are not at all accepting the offer. All right. So if your salary is greater than $50,000, then you will further check whether the commute is more than one hour or not. If it is more than one are you will just decline the offer if it is less than one hour, then you are getting closer to accepting the job offer. Photo what you will do you will check whether the company is offering free coffee or not. Right if the company is not offering the free coffee, then you will just declined off and if it is offering the free coffee and yeah, you will happily accept the offer right there are just an example of a decision tree. Now, let's move ahead and understand a decision tree. Well, here is a sample data set that I will be using it to explain you about the decision tree. Alright in this data set each row is an example and the first two columns provide features. Attributes that describes the data and the last column gives the label or the class we want to predict and if you like you can just modify this data by adding additional features and more example and our program will work in exactly the same way fine. Now this data set is pretty straightforward except for one thing. I hope you have noticed that it is not perfectly separable. Let me tell you something more about that as in the second and fifth examples, they have the same features, but different labels, both are Yellow as a Colour and diameter as three, but the labels are mango and lemon right? Let's move on and see how our decision tree handles this case. All right, in order to build a tree will use a decision tree algorithm called card this card algorithm stands for classification and regression tree algorithm online. Let's see a preview of how it works. All right to begin with We'll add a root note for the tree and all the nodes receive a list of rows as input and the root will receive the entire. Training data set now each node will ask true and false question about one other feature. And in response to that question will split or partition the data set into two different subsets these subsets then become input to child node. We are to the tree and the goal of the question is to finally unmix the labels as we proceed down or in other words to produce the purest possible distribution of the labels at each node. For example, the input of this node contains only one single type of label. So we See that it's perfectly unmixed. There is no uncertainty about the type of label as it consists of only grapes right on the other hand the labels in this node are still mixed up. So we would ask another question to further drill it down. Right but before that we need to understand which question to ask and when and to do that we need to conduct by how much question helps to unmix the label and we can quantify the amount of uncertainty at a single node using a metric. Called gini impurity and we can quantify how much a question reduces that uncertainty using a concept called Information Gain will use these to select the best question to ask at each point. And then what we'll do we'll iterate the steps will recursively build the tree on each of the new node will continue dividing the data until they are no further question to ask and finally we reach to our Leaf. Alright, alright. So this was about decision tree. So in order to create a decision tree, first of all what you have to do you have to identify A different set of questions that you can ask to a tree like is this color green and what will be these question? These questions will be decided by your data set like as this colored green is the diameter greater than equal to 3 is the color yellow right questions resembles to your data set remember that? All right. So if my color is green, then what it will do it will divide into two parts. First. The Green Mango will be in the true while on the false. We have lemon and the Mac. All right if the color is green or the diameter. Meter is greater than equal to 3 or the color is yellow Asian tree terminologies. So starting with root node root node is a base node of a tree the entire tree starts from a root node. In other words. It is the first node of a tree it represents the entire population or sample and this entire population is further segregated or divided into two or more homogeneous set fine. Next is the leaf node. Well Leaf node is the one when you reach at the The tree right that is you cannot further segregated down to any other level that is the leaf node. Next is splitting splitting is dividing your root node or node into different sub part on the basis of some condition. All right, then comes the branch or the sub tree. Well, this Branch or subtree gets formed when you split the tree suppose when you split a root node, it gets divided into two branches or two subtrees. Right? Next is the concept of pruning. Well you can Say that pruning is just opposite of splitting what we are doing here. We are just removing the sub node of a decision tree will see more about pruning later in this session. All right, let's move on ahead. Next is parent or child node. Well, first of all root node is always the parent node and all other nodes associated with that is known as chalky node. Well, you can understand it in a way that all the top node belongs to a parent node and all the bottom node, which are derived from a top node is a child node. Node producing a further note is a child node and the node which is producing it as a parent node simple concept, right? It's use the cart algorithm and design a tree manually. So first of all what you will do you decide which question to ask and when so how will you do that? So let's first of all visualize the decision tree. So there's the decision tree which will be creating manually or like first of all, let's have a look at the data set. You have Outlook temperature humidity and windy as your different attribute on the basis of that you have to predict that whether you can play or not. So which one among them should you pick first answer determine the best attribute that classifies the training data? All right. So how will you choose the best attribute or how does a tree decide where to split or how the tree will decide its root node? Well before we move on and split a tree there are some terminologies that you should know. All right, first being the gini index. So what is this gini index? The gini index is the measure of impurity or Purity used in building a day. Gentry and cart algorithm. All right. Next is Information Gain this Information Gain is the decrease in entropy after data set is split on the basis of an attribute constructing a decision tree is all about finding an attribute that Returns the highest Information Gain. All right, so you will be selecting the node that would give you the highest Information Gain. Alright next is reduction in variance. This reduction in variance is an algorithm, which is used for continuous Target variable or regression problems the split With lower variance is selected as a criteria to let the population see in general term. What do you mean by variance? Variance is how much your data is wearing? Right? So if your data is less impure or is more pure than in that case the variation would be less as all the data almost similar, right? So there's also a way of setting a tree the split with lower variance is selected as the criteria to split the population. Alright. Next is the chi Square C Square. It is an algorithm which is used to find out these statistical significance between the Is between sub nodes and the parent nodes fine. Let's move ahead. Now. The main question is how will you decide the best attribute for now just understand that you need to calculate something known as Information Gain the attribute with the highest Information Gain is considered the best. Yeah. I know your next question might be like, what is this information again? But before we move on and see what exactly Information Gain Is let me first introduce you to a term called entropy because this term will be used in calculating the Information Gain. Mmmmmm. Well entropy is just a metric which measures the impurity of something or in other words, you can say that as the first step to do before you solve the problem of a decision tree as I mentioned is something about impurity. So let's move on and understand what is impurity suppose. You are a basket full of apples and another Bowl which is full of same label, which says Apple now if you are asked to pick one item from each basket and ball then the probability of getting the apple and it's correct label is 1 so in this case, You can see that impurities zero. All right. Now what if there are four different fruits in the basket and four different labels in the bowl, then the probability of matching the fruit to a label is obviously not one. It's something less than that. Well, it could be possible that I picked banana from the basket and when I randomly picked the label from the ball, it says a cherry any random permutation and combination can be possible. So in this case I'd say that impurities is nonzero. I hope the concept of impurities care. Are so coming back to entropy as I said entropy is the measure of impurity from the graph on your left. You can see that as the probability is zero or one that has either they are highly impure or they are highly pure than in that case the value of entropy is zero. And when the probability is 0.5, then the value of entropy is maximum. Well, what is impurity impurities the degree of Randomness how random data is so if the data is completely pure in that case the randomness equals 0 or if the Dies completely Empire even in that case the value of impurity will be zero question. Like why is it that the value of entropy is maximum at 0.5 might arise in a mine, right? So let me discuss about that. Let me derive at mathematically as you can see here on the slide, the mathematical formula of entropy is - of probability of yes, let's move on and see what this graph has to say mathematically suppose s is our total sample space and it's divided into two parts. Yes, and no. No, like in our data set the result for playing was divided into two parts. Yes or no, which we have to predict either we have to play or not. Right? So for that particular case, you can Define the formula of entropy as entropy of total sample space equals negative of probability of e is multiplied by log of probability of years with a base 2 minus probability of no X log of probability of no with base to where s is your total sample space and P of v s is the probability of E. And be of known as the probability of no, well, if the number of yes equal number of know that is probability of s equals 0.5 right since you have equal number of yes, and no so in that case value of entropy will be one just put the value over there. All right. Let me just move to the next slide. I'll show you this. Alright next is if it contains all Yes, or all know that is probability of a sample space is either 1 or 0 then in that case entropy will be equal to 0 Let's see the mathematically one by one. So let's start with the first condition where the probability was 0.5. So this is our formula for entropy, right? So there's our first case right which we discuss the art when the probability of vs equal probability of node that is in our data set. We have equal number of yes, and no. All right. So probability of yes equal probability of no and that equals 0.5 or in other words, you can say that yes plus no equal to Total sample. He's all right, since the probability is 0.5. So when you put the values in the formula you get something like this and when you calculate it, you will get the entropy of the total sample space as one. All right. Let's see for the next case. What is the next case either you have totally us or you have totally know so if you have total, yes, let's see the formula when we have totally as so you have all yes and 0 no fine. So probability of e s equal 1 and yes. Yes as the total sample space obviously. So in the formula when you put that thing up here, you get entropy of sample space equal negative X of 1 multiplied by log of 1 as the value of log 1 equals 0. So the total thing will result to 0 similarly is the case with no even in that case, you will get the entropy of total sample space as 0 so this was all about entropy. All right. Next is what is Information Gain? Well Information Gain what it does is it measures the reduction in entropy? It decides which attributes should be selected as the decision node. If s is our total collection than Information Gain equals entropy, which we calculated just now that - weighted average X entropy of each feature. Don't worry. We'll just see how it to calculate it with an example. Let's manually build a decision tree for our data set. So there's our data set which consists of 14 different instances out of which we have nine. Yes and five know I like so we have the formula for entropy just put over that since 9 years. So total probability of e s equals 9 by 14 and total probability of no equals Phi by 14 and when you put up the value and calculate the result, you will get the value of entropy as 0.94. All right. So this was your first step that is compute the entropy for the entire data set only now, you have to select that out of Outlook temperature humidity and windy, which of the node should you select as the root node big question right? I will Decide that this particular node should be chosen at the base note. And on the basis of that only I will be creating the entire tree. I will select that. Let's see. So you have to do it one by one you have to calculate the entropy and Information Gain for all of the different nodes. So starting with Outlook. So Outlook has three different parameters Sunny overcast and rainy. So first of all select how many number of years and no are there in the case of Sunny like when it is sunny how many number of years and how many number of knows? Are there so in total we have to yes and three Nos and case of sunny in case of overcast. We have all yes. So if it is overcast then we will surely go to play. It's like that. Alright and next it is rainy then total number of vs equal 3 and total number of no equals 2 fine next what we do we calculate the entropy for each feature for here. We are calculating the entropy when Outlook equals Sunny. First of all, we are assuming that Outlook is our root node and for that we are calculating the Can gain for it. All right. So in order to calculate the Information Gain remember the formula it was entropy of the total sample space - weighted average X entropy of each feature. All right. So what we are doing here, we are calculating the entropy of Outlook when it was sunny. So total number of yes, when it was Sonny was to and total number of know that was three fine. So let's put up in the formula since the probability of yes is 2 by 5 and the probability of no is 3 by 5. So you will get something like this. All right. So you are getting the entropy of sunny as zero point nine seven one fine. Next we will calculate the entropy for overcast when it was overcast. Remember it was all yes, right. So the probability of e is equal 1 and when you put over that you will get the value of entropy as 0 fine and when it was rainy rainy has 3s and to nose. So probability of e s in case of Sonny's 3 by 5 and probability of know in case of Sonny's 2 by 5 and when you add the You of probability of vs and probability of note the formula you get the entropy of sunny as zero point nine seven one point. Now, you have to calculate how much information you are getting from Outlook that equals weighted average. All right. So what was this weighted average total number of years and total number of no fine. So information from Outlook equals 5 by 14 from where does this 5 came over? We are calculating the total number of sample space within that particular Outlook when it was sunny, right? So in case of Sunny there was two years and three NOS. All right. So weighted average for Sonny would be equal to 5 by 14. All right, since the formula was five by 14 x entropy of each feature. All right, so as calculated the entropy for Sonny is zero point nine seven one, right? So what we'll do we'll multiply five by 14 with 0.97 one, right? Well, this was the calculation for information when Outlook equal sunny, but Outlook even equals overcast and rainy. In that case, what we'll do again similarly will calculate for everything for overcast and sunny for overcast weighted averages for by 14 x its entropy. That is 0 and for Sonny it is same 5i 14-3. Yes and two nodes X its entropy that is zero point nine seven one. And finally we'll take the sum of all of them which equals to 0.693 right next. We will calculate the information gained this what we did earlier was Malaysian taken from Outlook. Now. We are calculating. What is the information? We are gaining from Outlook right. Now this Information Gain that equals to Total entropy minus the information that is taken from Outlook. All right. So total entropy we had 0.94 - information we took from Outlook as 0.693. So the value of information gained from Outlook results to zero point two four seven. All right. So next what we have to do. Let's assume that Wendy is our root node. So Wendy consists of two parameters false and true. Let's see how many years and how many nodes are there in case of true and false. So when Wendy has Falls as its parameter, then in that case, it has six years and two nodes and when it as true as its parameter, it has 3 S and 3 nodes. All right. So let's move ahead and similarly calculate the information taken from Wendy and finally calculate the information gained from Wendy. Alright, so first of all, what we'll do we'll calculate the entropy of each feature. ER starting with windy equal true. So in case of true we had equal number of yes and equal number of know. We'll remember the graph when we had the probability as 0.5 as total number of years equal total number of know and for that case the entropy equals 1 so we can directly write entropy of room when it's windy is one as we had already proved it when probability equals 0.5 the entropy is the maximum that equals to 1. All right. Next is entropy of false when it is Vending. I like so similarly just put the probability of yes and no in the formula and then calculate the result since you have six years and to nose. So in total, you'll get the probability of yes 6 by 8 and probability of no as 2 by 8. All right, so when you will calculate it, you will get the entropy of false as zero point eight one one. Alright now, let's calculate the information from windy. So total information collected from Windy equals information taken when Wendy equal true plus Action taken when Wendy equal false. So we'll calculate the weighted average for each one of them and then we'll sum it up to finally get the total information taken from windy. So in this case, it equals to 8 by 14 multiplied by 0.8 1 1 plus 6 by 14 x 1. What is this? 8 it is total number of yes, and no in case when when D equals false, right? So when it was false, so total number of BS that equals to 6 and total more of know that equal to 2 that some UPS to 8. Alright, so that is why the waiter. Resul results to Aid by 14 similarly information taken when windy equals true equals to 3 plus 3 that is 3 S and 3 no equal 6 divided by total number of sample space that is 14 x 1 that is entropy of true. All right. So it is 8 by 14 multiplied by 0.8 1 1 plus 6 by 14 x one which results to 0.89 to this is information taken from Windy. All right. Now how much information you are gaining from Wendy? So for that what you will do, so total information gained from Windy that equals to Total entropy - information taken from Windy. All right, that is 0.94 - 0.89 to that equals to zero point zero four eight. So 0.048 is the information gained from Windy. Similarly. We calculated for the rest too. So for Outlook as you can see, the information was 0.693, and it's Information Gain was zero point two four seven in case of temperature the information was around. Zero point nine one one and the Information Gain that was equal to 0.02 9 in case of humidity. The information gained was 0.15 to and in the case of windy. The information gained was 0.048. So what we'll do we'll select the attribute with the maximum fine. Now, we are selected Outlook as our root node, and it is further subdivided into three different parts Sunny overcast and rain, so in case of overcast we have seen that it consists of all ears so we can consider it as a Leaf node, but in case of sunny and rainy it's doubtful as it consists of both. Yes and both know so you need to recalculate the things right again for this node. You have to recalculate the things. All right, you have to again select the attribute which is having the maximum Information Gain. All right, so there is how your complete tree will look like. All right. So, let's see when you can play so you can play when Outlook is overcast. All right in that case. You can always play if the Outlook is sunny. You will further drill. Time to check the humidity condition. All right, if the humidity is normal, then you will play if the humidity is high then you won't play right when the Outlook predicts that it's raining then further you will check whether it's windy or not. If it is a week went then you will go and offer play but if it has strong wind, then you won't play right? So this is how your entire decision tree would look like at the end. Now comes the concept of pruning say is that what should I do to play? Well you have to do pruning pruning will decide how you will play. Say what is this pruning? Well, this pruning is nothing but cutting down the nodes and order to get the optimal solution. All right. So what pruning does it reduces the complexity? All right, as are you can see on the screen that it showing only the result for yes that is it showing all the result which says that you can play before we drill down to a practical session a common question might come in your mind. You might think that our tree based model better than linear model right? You can think like if I can Was a logistic regression for classification problem and linear regression for regression problem. Then why there is a need to use the tree. Well, many of us have this question in their mind and well there's a valid question too. Well actually as I said earlier, you can use any algorithm. It depends on the type of problem. You're solving let's look at some key factor, which will help you to decide which algorithm to use and when so the first point being if the relationship between dependent and independent variable as well approximated by By a linear model, then linear regression will outperform tree base model second case if there is a high non-linearity and complex relationship between dependent and independent variables at remodel will outperform a classical regression model in third case. If you need to build a model which is easy to explain to people a decision tree model will always do better than a linear model as the decision tree models are simpler to interpret then linear regression. All right. Now let's move on ahead and see how you can write it as Gentry classifier from scratch and python using the cart algorithm. All right for this. I will be using jupyter notebook with python 3.0 installed on it. Alright, so let's open the Anaconda and the jupyter notebook. Where is that? So this is our Anaconda Navigator and I will directly jump over to jupyter notebook and hit the launch button. I guess everyone knows that jupyter. Notebook is a web-based interactive Computing notebook environment where you can run your python codes. So my Jupiter notebook it opens on my Local Host w89 1 so I will be using this jupyter notebook in order to write my decision tree classifier using python for this decision tree classifier. I have already written the set of codes. Let me explain you just one by one. So we'll start with initializing our training data set. So there's our sample data set for which each row is an example. The last column is a label and the first two columns are the features. If you want you can add some more features an example for your practice interesting fact is that This data set is design and way that the second and fifth example have almost the same features, but they have different labels. All right, so let's move on and see how the tree handles this case as you can see here. Both of them II and the fifth column have the same features. What did different is just their label? Right? So let's move ahead. So this is our training data set next what we are doing we are adding some column labels. So they are used only to print the trees fine. So what we'll do we'll add header to the columns like the First Column is of Close second is of diameter and third is a label column. All right, next what we'll do we'll Define a function as unique values in which will pass the rows and the columns. So this function what it will do it will find the unique values for a column in the data set. So there's an example for that. So what we are doing here, we are passing training data Hazard row and column number as 0 so what we are doing we are finding unique values in terms of color. And in this since the row is training data and the column is 1 so what you are doing here, so we are finding the you Values in terms of diameter fine. So this is just an example next what we'll do we'll Define a function as class count and we'll pass the rows into it. So what it does, it counts the number of each type of example within data set. So in this function what you are basically doing we are counting the number of each type for example in the data set or what we are doing we are counting the unique values for the label in the data set as a sample. You can see here we can pass that entire training data set to this particular function as class underscore count what it will do it will find all the different types of Label within the training data set as you can see here the unique label consists of mango grape and lemon. So next what we'll do. We'll Define a function is numeric and we'll pass a value into it. So what it will do it will just test if the value is numeric or not and it will return if the value is an integer or a float. For example, you can see is numeric. We are passing 7 so it is an integer so it will return in value and if we are passing red, it's not a numeric value, right? So moving on ahead where you define a class named as question, so This question does this question is used to partition the data set. This class voted does it just records a column number? For example 0 for color a light and a column value for example, green next what we are doing we are defining a match method which is used to compare the feature value in the example to the feature values stored in the question. Let's see how first of all what you are doing. We are defining an init function and inside that we are passing the self column and the value as parameter. So next what we do we Define a function as match what it Does it compares the feature value in an example to the feature value in this question when next we'll Define a function as re PR, which is just a helper method to print the question in a readable format next what we are doing we are defining a function partition. Well, this function is used to partition the data set each row in the data set it checks if it match the question or not if it does so it adds it to the true rose or if not then it adds to the false Rose. All right, for example, as you can see, it's partition the training data. Based on whether the roses are red or not here. We are calling the function question and we are passing a value of zero and read to it. So what did we do it will assign all the red rose to True underscore Rose and everything else will be assigned to false underscore rose fine. Next. What we'll do we'll Define a gini impurity function and inside that will pass the list of rows. So what it will do it will just calculate the gini impurity for the list of rows. Next what we are doing here. We defining a function as Information Gain. So what this Information Gain function does it calculates the information game using the uncertainty of the starting node - the weighted impurity of the child node. The next function is find the best plate. Well, this function is used to find the best question to ask by iterating over every feature of value and then calculating the Information Gain. But the detail explanation on the code, you can find the code in the description given below. All right next we'll define a class as leave for classifying the data. It holds a dictionary of glass like mango for how many times it appears in the row from the training data that reaches the sleeve. Alright, next is the decision node. So this decision node, it will ask a question. This holds a reference to the question and the two child nodes on the base of it. You are deciding which node to add further to which branch. Alright so next. What we are doing we are defining a function of build tree and inside that we are passing our number of rows. So this is the function that is used to build the tree. So initially what we did we Define all the various function that we'll be using in order to build a tree. So let's start by partitioning the data set for each unique attribute, then we'll calculate the information gain and then return the question that produces the highest gain and on the basis of that will split the tree. So what we are doing here, we are partitioning the data set calculating the Information Gain. And then what this is returning it is returning the question that is producing the highest gain. All right. Now if gain equals 0 return Leaf Rose, so what it will do. So if you are getting no for the gain that is gain equals 0 then in that case since no further question could be asked so what it will do it will return a leaf fine now true or underscore Rose or false underscore Rose equal partition with rose and the question. So if we are reaching till this position, then you have already found. A feature of value which will be used to partition the data set then what you will do you will recursively build the true branch and similarly recursively build the false Branch. So return Division and Discord node and side that will be passing question to branch and false front. So what it will do it will return a question node. Alice question owed this recalls the best feature or the value to ask at this point fine. Now that we have built our tree next what we'll do we'll Define a print underscore tree function which will be used to print the tree fine. So finally what we are doing in this particular function that we are printing our tree next is the classify function which will use it to decide whether to follow the true Branch or the false branch and then compared to the feature values stored in the node to the example. We are considering and last what we'll do we'll finally print the production at Leaf. So let's execute it and see okay, so there's our testing data. All right. So we printed all Leaf as well now that we have trained our algorithm with our training data set now it's time to test it. So there's our testing data set. So let's finally execute it and see what is the result. So this is the result you will get so first question, which is asked by the algorithm is is diameter greater than equal to 3 if it is true, then it will further ask if the color is yellow again, if it is true, then it will predict mango as one and lemon with one. And in case it is false, then it will just predict the mango. Now. This was the true part. Now next coming to diameter is not greater than or equal to 3 then in that case it's false and what it will do it will just predict the grape fine. Okay. So this was all about the coding part now, let's conclude this session. But before concluding let me just show you one more thing. Now, there's a scikit-learn algorithm cheat sheet, which explains you which algorithm you should use and when all right, let's build in a decision tree format. Let's see how it is built. So first condition it will check whether you have 50 samples or not. If your samples are greater than 50, then we'll move ahead if it is less than 50, then you need to collect more data if you sample is greater than 50, then you have to decide whether you want to predict a category or not. If you want to predict a category, then further you will see that whether you have labeled data or not. If you have label data, then that would be a classification algorithm problem. If you don't have the label data, then it would be a clustering problem. Now if you don't want to Category then what? Do you want to predict predict a quantity? Well, if you want to predict a quantity, then in that case, it would be a regression problem. If you don't want to predict a quantity and you want to keep looking further, then in that case, you should go for dimensionality reduction problems and still if you don't want to look and the predicting structure is not working. Then you have tough luck for that. I hope this doesn't recession clarifies all your doubt over decision tree algorithm. Let's begin this tutorial by looking at the topics that we'll be covering today. So first of all, we'll start Away by getting a brief introduction of random forest and then we'll go as to see why we actually need random Forest right? Why not anything else but actually random Forest. So once we understand it's need at first place, then we'll go on to learn more about what is random forest and we'll also look at various. Examples of random Forest so that we get a very clear understanding of it. So for the will also delve inside in to understand the working of random Forest as to how exactly random Forest Works will also watch out the random Forest algorithm step by step, right so that you are able to write any piece of code any domain specific algorithm on your own now, I personally believe that any learning is really incomplete. If it's not put into application so for its completion will also Implement random forest in r with a very simple use case that is diabetes prevention. So let's get started with the introduction then. No, random Forest is actually one of the classifiers which is used for solving classification problems. Now since some of you might not be really aware of what classification is. So let's quickly understand classification first, and then we'll try to related to the random Forest. So basically classification is a machine learning technique in which you already have predefined categories under which you can classify your data. So it's nothing but to supervised learning model where you already have a data based on which you can train your machine, right? So your machine actually learns from this data. So whatever all that predefined data that you already have it actually works as a fuel for your machine, right? So let's say for an example ever wondered how your Gmail gets to know about the spam emails and filters it out from the rest of the genuine emails any guesses. All right. I'll give you a hint try to think something on the line that what would it actually look for what can be the possible parameters based on which you can decide or read. This is a genuine email or this is a spam email. So there are certain parameters that your classifier will actually look for like The subject line or the text or the HTML tags and also the IP address of the source from where is this mail getting from so it will analyze all these variables and then it will classify them into this Pam or the genuine folder. So let's say for an example if your subject line States like mad or cute or pretty and some other absurd keywords. Your classifier is smart enough and it's trained in such a manner that it will Get to know. All right, this is a spam email and it will automatically filter it out from your genuine emails. So that is how you classify it works basically, so that's pretty much about the classification now, let's move forward and see what always can be there through which you can actually perform classification. So we have three classifiers namely decision tree random forest and a base, right so speaking briefly about Season 3 at first so decision tree actually splits your entire data set in this structure of a tree and it makes decision at every node and hence called decision tree. So no big bang theory, right? So you have certain data set. There are certain nodes at each node. It will for the split into the child nodes and at each node. It will make a decision. So final decision will be in the form of positive and negative, right? So let's say for an example you want to purchase a car, right? So what all will be the parameters? Let's say I have a go and I want to purchase a car and I will keep certain parameters in my mind. That would be what exactly is my income. What is my budget? What is the particular brand that I want to go for? What is the mileage of the car? What is the cylinder capacity of the car and so on and so forth, right? So I'll make my decision based on. All these parameters, right and that is how you make decisions and further. If you really want to know more about decision tree as to how it exactly works. You can also check out our decision tree tutorial as well. So let's begin now to the random Forest now. So Random Forest isn't in simple classifier. Actually now, let's understand what this war in symbol means. So in simple methods actually. Use multiple machine learning algorithms to obtain better predictive performance. So particularly talking about random Forest So Random forests uses multiple decision trees for prediction, right? So you are in assembling a lot of decision trees to come up to your final outcome. As you can also look here in the image that your entire data set is actually for the split into three subsets, right and each subset for Leads to a particular decision tree. So here you have three decision trees and each decision tree will lead to certain outcome. Now what random Forest will do is it will compile the results from all the decision trees and then it will lead to a final outcome. Right? So it's compiled a section of all the multiple decision trees. That's all about the random Forest now, let's see what's lies there in a pace, right? So naive Bayes is very famous classifier, which is made on a very famous rule called Bayes theorem. You might have studied about Nee Bayes theorem in your 10 standard as well. So let's just see what Bayes theorem describes. So based on actually describes the probability of an event based on certain prior knowledge of conditions that might be related to the event, right? So for example, if cancer is related to age, right, so then person's age can be used to more accurately assess probability of having a cancer than without having the knowledge of age. So if you know the age then it will become handy in addicting the occurrence of cancer for a particular person. Right? So the outcome of first event here is actually affecting your final outcome, isn't it? Yeah. So this is how naive Bayes classifier actually works. So that was all to give an overview of Nave Bayes classifier. And this were pretty much about the types of classifiers now, we'll try to find out the answer to this particular question as to why we need random Forest fine. So like human beings learn from the past experiences. So unlike human beings a computer does not have experiences then how does machine takes decisions? Where does it learn from? Um, well a computer system actually learns from the data which represents some past experiences of an application domain. So now let's see how random Forest helps in building up in learning model with a very simple use case of credit risk detection. Now needless to say that credit card companies have a very nested interest in identifying Financial transactions that are illegitimate and criminal in nature. And also I would like to mention this point that according to the Federal Reserve payment study Americans used credit cards to pay for twenty six point two million purchases in 2012, and the estimated loss due to unauthorized transactions that here was us six point 1 billion dollars now in the banking industry measuring risk is very critical because the stakes are too high. So the overall goal is actually to figure out Out who all can be fraudulent before too much Financial damage has been done. So for this a credit card company receives thousands of applications for new cards and each application contains information about an applicant, right? So so here as you can see that from all those applications what we can actually figure out is that predictor variables. Like what is the marital status of the person? What is the gender of the person? The age of the person and the status which is actually whether it is a default pair or a non-default pair. So default payments are basically when payments are not made in time and according to the agreement signed by the cardholder. So now that account is actually set to be in the default. So you can easily figure out the history of the particular card holder from this then we can also look at the time of payment whether he has been a regular pair or not. Regular one, what is the source of income for that particular person? And so and so forth. So to minimize loss the back actually needs certain decision rule to predict whether to approve a particular loan of that particular person or not. Now here is where the random Forest actually comes into the picture right now. Let's see how random Forest can actually help us in this particular scenario. Now, we have taken randomly two parameters. Out of all the predictive variables that we saw previously now, we have taken two predictor variables here. The first one is the income and the second one is the H right and similarly parallel it to decision trees have been implemented upon those predicted variables and let's first assume the case of the income variable, right? So here we have divided our income into three categories the first one being the person earning over 35,000. And dollars second from 15 to 35 thousand dollars the third one running in the range of 0 to 15 thousand dollars. Now if a person is earning over $35,000, which is a pretty good income pretty decent. So now we'll check out for the credit history. Now the here the probability is that if a person is earning a good amount then there is very low risk that he won't be able to pay back already earning good. So the It is that his application of loan will get approved. Right? So there is actually low risk or moderate risk, but there's no real issue of high risk as such we can approve the applicants request here. Now, let's move on and watch out for the second category where the person is actually earning from 15 to 35 thousand dollars right now here the person may or may not pay back. So in such scenarios will look for the credit. History as to what has been his previous history. Now if his previous history has been bad like he has been a default. ER in the previous transactions will definitely not consider approving his request and he will be at the high risk in which is not good for the bank. If the previous history of that particular applicant is really good then we will just to clarify our doubt will consider another pair. Dress. Well, that will be on depth. I have his already in really high depth then the risks again increases and there are chances that he might not pay repay in the future. So here will not accept the request of the person having high dipped if the person is in the low depth and he has been a good pair in his past history. Then there are chances that he might be back and we can consider approving the request of this particular applicant. And let's look at the third category, which is a person earning from 0 to 15 thousand dollars. Now, this is something which actually raises I broke and this person will actually lie in the category of high risk. All right. So the probability is that his application of loan would probably get rejected now, we'll get one final outcome from this income parameter, right? Now let us look at our second variable that is age which will lead into the second decision tree. Now. Let us say if the person is Young, right? So now we will look forward to if it is a student now if it is a student then the chances are high that he won't be able to repay back because he has no learning Source, right? So here the risks are too high and probability is that his application of loan will get rejected fine. Now if the person is Young And he's not a student then we'll probably go on and look for another variable. That is pan balance. Now. Let's look if the bank balance is less than 5 lakhs. So again the risk arises and the probabilities that his application of loan will get rejected. Now if the person is Young is not a student and his bank balance of greater than 5 lakhs is got a pretty good and stable and balanced then the probabilities that his zone of application will get approved. Of not let us take another scenario if he's a senior, right? So if he is a senior will probably go and check out for this credit history. How well has he been in his previous transactions? What kind of a person he is like whether he's a defaulter or is Ananda falter now if he is a very fair kind of person in his previous transactions then again the risk arises and the probability of his application getting rejected actually increases right now. If he has been an excellent person as per his transactions in the previous history. So now again here there is least risk and the probabilities that his application of loan will get approved. So now here these two variables income and age have led to two different decision trees. Right and these two different decision trees actually led to two different results. Now what random forest does is it will actually compile these two different results from these two different. Decision trees and then finally, it will lead to a final outcome. That is how random Forest actually works. Right? So that is actually the motive of the random Forest. Now let us move forward and see what is random Forest right? You can get an idea of the mechanism from the name itself random forests. So a collection of trees is a fortress that's why I called for is probably and here also the trees are actually because being trained on subsets which are being selected at random. And therefore they are called random forests. So a random forests is a collection or an in symbol of decision. Eat straight head a decision trees actually built using the whole data set considering all features, but actually in random Forest only a fraction of the number of rows is selected and that too at random and a particular number of features, which are actually selected at random are trained upon and that is how the decision trees are built upon. Right? So similarly number of decision trees will be grown and each decision tree will result in two. With a certain final outcome and random Forest will do nothing, but actually just compiled the results of all those decision trees to bring up the final result. As you can see in this particular figure that a particular instance actually has resulted into three different decision trees right sonar tree one results into a final outcome called Class A and tree to results into class B. Similarly tree three results into class P So Random Forest will compile the results of all these decision trees. And it will go by the goal of the majority voting now since head to decision trees have actually voted into the favor of the Class B that is decision tree two, and three therefore the final outcome will be in the favor of the Class B. And that is how random Forest actually works upon. Now one really beautiful thing about this particular algorithm is that it is one of the versatile algorithms which is capable of Performing both regression as well as classification. Now, let's try to understand random Forest further with a very beautiful example or a this is my favorite one. So let's say you want to decide if you want to watch edge of tomorrow or not, right? So in this particular scenario, you will have two different actions to work Bond either. You can just straight away go to your best friend asked him about or read. Whether should I go for Edge of Tomorrow? And what will I like this movie or you can ask a bunch? Your friends and take their opinion consideration and then based on the final results. You can go out and watch Edge of Tomorrow, right? So now let's just take the first scenario. So where you go to your best friend asked about whether you should go out to watch edge of tomorrow or not. So your friend will probably ask you certain questions like the first one being here Jonah. So so let's say your friend asks you if you really like The Adventurous kind of movies or not. So you say yes, definitely I would love to watch it Venture kind of movie. So the probabilities that you will like edge of tomorrow as well. Since it's of Tomorrow is also a movie of Adventure and sci-fi kind of Jonah, right? So let's say you do not like the adventure John a movie. So then again the probability reduces that you might really not like edge of Morrow right. So from here you can come to a certain conclusion right? Let's say your best friend puts you into another situation where he'll ask you or a do you like Emily plant? And you see definitely I like Emily Blunt and then he puts another question to you. Do you like Emily Blunt to be in the main lead and you say yes, then again, the probability arises that you will definitely like edge of tomorrow as well because Edge of Tomorrow is Has the Emily plant in the main lead cast so and if you say oh I do not like Emily Blunt then again, the probability reduces that you would like Edge of Tomorrow to write. So this is one way where you have one decision tree and your final outcome. Your final decision will be based on your one decision tree, or you can see your final outcome will be based on just one friend. No, definitely not really convinced. You want to consider the options of your other friends also so that you can make very precise and crisp decision right you go out and you approach some other bunch of friends of yours. So now let's say you go to three of your friends and you ask them the same question whether I would like to watch Age of Tomorrow or not. So you go out and approach three or four friends friend one friend twin friend three. Now, you will consider each of their Sport and then you will your decision now will be dependent on the compiled results of all of your three friends, right? Now here, let's say you go to your first friend and you ask him whether you would like to watch it if tomorrow not and your first friend puts you to one question. Did you like Top Gun? And you say yes, definitely I did like the movie Top Gun and the probabilities that you would like edge of tomorrow as well because topgun is actually a military action drama, which is also Tom Cruise. So now again the probability Rises that yes, you will like edge of tomorrow as well and If you say no I didn't like Top Gun then again. The chances are that you wouldn't like Edge of Tomorrow, right? And then another question that he puts you across is that do you really like to watch action movies? And you say yes, I would love to watch them. Then again. The chances are that you would like to watch Edge of Tomorrow. So from your friend when you can come to one conclusion, I hear since the ratio of liking the movie to don't like is actually 2 is to 1 so the final result. Actually, you would like Edge of Tomorrow. Now you go to your second friend and you ask the same question. So now you are second friend asks you did you like far and away when we went out and did the last time when we washed it and you say no I really didn't like far and away then you would say then you are definitely going to like Edge of Tomorrow. Why does so because far and away is actually since most of whom might not be knowing it so far in a ways Johner of romance and it revolves around a girl and a guy Guy falling in love with each other and so on. So the probability is that you wouldn't like edge of tomorrow. So he ask you another question. Did you like Bolivian and to really like to watch Tom Cruise? And you say Yes, again. The probability is that you would like to watch Edge of Tomorrow. Why because Oblivion again is a science fiction casting Tom Cruise full of strange experiences. And where Tom Cruise is the savior of the masses. Kind well, that is the same kind of plot in edge of tomorrow as well. So here it is pure yes that you would like to watch edge of tomorrow. So you get another second decision from your second friend. Now you go to your third friend and ask him so probably our third friend is not really interesting in having any sort of conversation with you say just simply asks you did you like Godzilla and you say no I didn't like Godzilla's we say definitely you wouldn't like Edge of Tomorrow why so because Godzilla is also actually Fiction movie from the adventure Jonah. So now you have got three results from three different decision trees from three different friends. Now you compile the results of all those friends and then you make a final call that yes, would you like to watch edge of tomorrow or not? So this is some very real time and very interesting example where you can actually Implement random Forest into ground reality. Now let us look at various domains where random Forest is actually used. So because of its diversity random Forest is actually used in various diverse to means like so beat banking beat medicine beat land use beat marketing name it and random Forest is there so in banking particularly random Forest is being actually used to make it out whether the applicant will be a default a pair or it will be Older one so that it can accordingly approve or reject the applications of loan, right? So that is how random Forest is being used in banking talking about medicine. Random. Forest is widely used in medicine field to predict beforehand. What is the probability if a person will actually have a particular disease or not? Right? So it's actually used to look at the various disease Trends. Let's say you want to figure out what is the probability that a person will have diabetes or not? It and so what would you do? It'd probably look at the medical history of the patient and then you will see. All right. This has been the glucose concentration. What was the BMI? What was the insulin levels in the patient in the past previous three months. What is the age of this particular person and do it'll make a different decision trees based on each one of these predictor variables and then you'll finally compiled the results of all those variables and then you will make a final decision. As to whether the person will have diabetes in the near future or not. That is how random Forest will be used in medicine sector now move. Random Forest is also actually used to find out the land use. For example, I want to set up a particular industry in certain area. So what would I probably look for a look for? What is the vegetation over there? What is the Urban population over there? Right and how much is the distance from the nearest modes of Transport like from the bus station or the railway station and accordingly. I will split my parameters and I will make decision on each one of these parameters and finally I'll compile my decision of all these parameters in that will be my final outcome. So that is how I am finally going to predict whether I should put my industry at this particular location or not. Right? So these three examples have actually been of majorly Classification problem because we are trying to classify whether or not with actually trying to answer this question whether or not right now, let's move forward and look how marketing is revolving around random Forest. So particularly in marketing we try to identify the customer churn. So this is particularly the regression kind of problem right now how let's see so customer churn is nothing but actually the number of people which are actually on the number. Of customers who are losing out. So we're going out of your market. Now you want to identify what will be your customer churn in near future. So you'll most of them e-commerce Industries are actually using this like Amazon Flipkart Etc. So they particularly look at your each Behavior as to what has been your past history. What has been your purchasing history. What do you like based on your activity around certain things around certain ads around certain discounts? And I'm certain kind of materials right if you would like a particular top your activity will be more around that particular top. So that is how they track each and every particular move of yours and then they try to predict whether you will be moving out or not. So that is how they identify the customer churn. So these all are various domains where random Forest is used. And this is not the only list so there are numerous other examples, which actually Lee are using random forests that makes it so special actually. Now, let's move forward and see how random Forest actually works. Right. So let us start with the random Forest algorithm first. Let's just see it step by step as to how random Forest algorithm works. So the first step is to actually select certain M features from T. Where m is less than T. So here T is the total number of the predictor variables that you have in your data set and out of those total predictor variables. You will select some random Lisa. Um few features out of those now why we are actually selecting a few features only. The reason is that if you will select all the predictive variables or the total predictor variables then each of your decision tree will be same so we model is not actually learning something new. It is learning the same previous thing because all those decision trees will be similar right if you actually split your predicted variables and you select randomly a few predicted variables. Need let's say there are 14 total number of variables and out of those who randomly pick just three right? So every time you will get a new decision tree, so there will be a variety right? So the classification model will be actually much more intelligent than the previous one. Now. It has got very yet experiences. So definitely it will make different decisions each time. And then when you will compile all those different decisions, it will be a new more. Are accurate and efficient result, right? So the first important step is to select certain number of features out of all the features now, let's move on to the second step. Let's say for any node D. Now. The first step is to calculate the best plate at that point. So, you know that decision tree how decision trees actually implemented so you pick up a the most significant variable right? And then you will split that particular node. For the child nodes, that is how the split takes place, right? So you will do it for M number of variables that you've selected. Let's say you have selected three so you will implement the split at all. Those three nodes in one particular decision tree, right the third step is split up the node into two daughter nodes. So now you can split your root note into as many notes as you want to but here we'll split our node into 2.2 notes as to this or that so it will be an answer. In terms of this or that right at fourth step will be to repeat all these three steps that we've done previously and we'll repeat all this splitting until we have reached all the N number of nodes, right? So we need to repeat until we have reached till the leaf nodes of a decision tree that is how we will do it right now after these four steps. We will have our one decision tree. But random Forest is actually about Decision trees. So here our fifth step will come into the picture which will actually repeat all these previous steps for D number of times now hit these the the number of decision trees. Let's say I want to implement five decision trees. So my first step will be to implement all the previous steps 5 times. So the head the eye tration is 4/5 number of times right now. Once I have created these five decision trees still my task is not completed. Pleat yet. Now. My final task will be to compile the results of all these five different decision trees and I will make a call in the majority voting right here. As you can see in this picture. I had in different instances. Then I created indifferent decision trees. And finally, I will compile the result of all these n different decision trees and I will take my call on the majority voting right. So whatever my majority vote says It will be my final result. So this is basically an overview of the random Forest algorithm how it actually works. Let's just have a look at this example to get much better understanding of what we have learnt. So let's say I have this data set which consists of four different instances, right? So basically it consists of the weather information of previous 14 days right from D1 tildy 14, and this basically Outlook humidity and Win, this basically gives me the weather condition of those 14 days. And finally I have play which is my target variable weather match did take place on that particular day or not right. Now. My main goal is to find out whether the match will actually take place if I have following these weather conditions with me on any particular day. Let's say the Outlook is rainy that day and humidity is high and the wind is very weak. So now I need to predict whether I will be able to play The match that they are not all right. So this is a problem statement fine. Now, let's see how random Forest is used in this to sort it out now here the first step is to actually split my entire data set into subsets here. I have split my entire 14 variables into further smaller subsets right now these subsets may or may not overlap like there is certain overlapping between d 1 till D3 and D3 till D6. Fine, so there is an overlapping of D3. So it might happen that there might be overlapping so you need not really worry about the overlapping but you have to make sure that all those subsets are actually different right? So here I have taken three different subsets my first sub set consists of D1 till D3 Mexican subset consists of D3 till D6 and methods subset consists of D7 tildy. Now now I will first be focusing on my first subset now here, let's say that particular day the out It was overcast fine. If yes, it was overcast then the probabilities that the match will take place. So overcast is basically when your weather is too cloudy. So if that is the condition then definitely the match will take place and let's say it wasn't overcast. Then you will consider these second most probable option that will be the wind and we will make a decision based on this now whether wind was weak or strong if wind was weak, then you will definitely go out. And play the match else you would not. So now the final outcome out of this decision tree will be Play Because here the ratio between the play and no play is to is to 1 so we get to a certain decision from a first decision tree. Now, let us look at the second subset now since second subset has different number of variables. So that is why this decision trees absolutely different from what we saw in our four subsets. So let's say if it was overcast then you will play the match. If it isn't the overcast and you would go and look out for humidity now further, it will get split into two whether it was high or normal. Now, we'll take the first case if the humidity was high and wind was week. Then you will play the match else if humidity was high but wind was too strong, then you would not go out and play the match right now. Let us look at the second dot to node of humidity if the humidity was Oil and the wind was weak then you will definitely go out and play the match as you want go out and play the match. So here if you look at the final result, then the ratio of placed no play is 3 is to 2 then again. The final outcome is actually play, right? So from second subset, we get the final decision of play now, let us look at our third subset which consists of D7 till D9 here if again the overcast is yes, then you will A match it's you will go and check out for humidity. And if the humidity is really high then you won't play the match and you will play the match again the probability of playing the matches. Yes, because the ratio of no play is Twist one, right? So three different subsets three different decision trees three different outcomes and one final outcome after compiling all the results from these three different decision trees are so I hope this gives a better perspective a bit understanding of random Forest like how it really works. All right. So now let's just have a look at various features of random Forest Ray. So the first and the foremost feature is that it is one of the most accurate learning algorithms, right? So why it is so because single decision trees are actually prone to having high variance or Hive bias and on the contrary actually. Random Forest it averages the entire variance across the decision trees. So let's say if the variances say X4 decision tree, but for random Forest, let's say we have implemented n number of decision trees parallely. So my entire variance gets averaged to upon and my final variance actually becomes X upon n so that is how the entire variance actually goes down as compared to other algorithms. Thumbs right now second most important feature is that it works? Well for both classification and regression problems and by far I have come across this is one and the only algorithm which works equally well for both of them. Beh classification kind of problem or a regression kind of problem, right? Then it's really runs efficient on large databases. So basically it's really scalable. Even if you work for the lesser amount of database or if you work for really huge volume of data, right? So that's a very good part about it. Then the fourth most important point is that it requires almost no input preparation. Now, why am I saying this is because it has got certain implicit methods, which actually take care. And remove all the outliers and all the missing data and you really don't have to take care about all that thing while you are in the stages of input preparations. So Random Forest is all here to take care of everything else and next. Is it performs implicit feature selection, right? So while we are implementing multiple decision trees, so it has got implicit method which will automatically pick up some random features. Result of all your parameters and then it will go on and implementing different decision trees. So for example, if you just give one simple command that all right, I want to implement 500 decision trees no matter how so Random Forest will automatically take care and it will Implement all those 500 decision trees and those all 500 decision trees will be different from each other and this is because it has got implicit methods which will automatically collect different parameters. Has itself out of all the variables that you have right, then it can be easily grown in parallel why it is so because we are actually implementing multiple decision trees and all those decision trees are running or all those decisions trees are actually getting implemented parallely. So if you say I want thousand trees to be implemented. So all those thousand trees are getting implemented parallely. So that is how the computation time reduces. Right, and the last point is that it has got methods for balancing error in unbalanced it as it's now what exactly unbalanced data sets are let me just give you an example of that. So let's say you're working on a data set fine and you create a random forest model and get 90% accuracy immediately. Fantastic you think right. So now you start diving deep you go a little Little deeper and you discovered that ninety percent of that data actually belongs to just one class tan your entire data set your entire decision is actually biased to just one particular class. So Random Forest actually takes care of this thing and it is really not biased towards any particular decision tree or any particular variable or any class. So it has got methods which looks after it and they Does all the balance of errors in your data sets? So that's pretty much about the features of random forests. K-nearest neighbor is a simple algorithm which uses entire data set in its training phase when our prediction is required for unseen data. What it does is it searches through the entire training data set for kaymu similar instances and the data with the most similar instance is finally returned as the prediction. So hello. Oh and welcome all to this YouTube session and in today's session will be dealing with KNN algorithm. So without doing any further, let's move on and discuss agenda for today's session. So we'll start our session with what is KN where I'll brief you about the topic and we'll move ahead to see what its popular use cases or how the industry is using KN for their benefit. Once we are done with it. We will drill down to the working of algorithm and while learning the algorithm you will also understand the significance of K, or what does this case stands for in the nearest neighbor algorithm? Then we'll see how the prediction is made using Canon algorithm manually or mathematically. All right. Now once we are done with the theoretical concept will start the Practical or the demo session where we'll learn how to implement KNN algorithm using python. So let's start our session. So starting with what is KNN algorithm will k-nearest neighbor is a simple algorithm that stores all the available cases and classify the new data or case based on a similarity measure. It suggests that if you are similar to your neighbors, then you have one of them right for example, if apple looks more similar to banana orange or Melon rather than a monkey rat or a cat that most likely Apple belong to the group of fruits. All right. Well in general Cayenne is used in Search application where you are looking for similar items that is when your task is some form of fine items similar to this one. Then you call this search as a Cayenne in search. But what is this KN KN? Well this K denotes the number of nearest neighbor which are voting class of the new data or the testing data. For example, if k equal 1 then the Sting data are given the same label as a close this example in the training set similarly. If k equal 3 the labels are the three closes classes are checked and the most common label is assigned to then testing data. So this is what a KN KN algorithm means so moving on ahead. Let's see some of the example of scenarios where KN is used in the industry. So, let's see the industrial application of KNN algorithm starting with recommender system. Well the biggest use case of cayenne and search is a recommender system. Thus recommended system is like an automated. Good form of a shop counter guy when you asked him for a product not only shows you the product but also suggest you or displays your relevant set of products, which are related to the item. You're already interested in buying this KNN algorithm applies to recommending products like an Amazon or for recommending media, like in case of Netflix or even for recommending advertisement to display to a user if I'm not wrong almost all of you must have used Amazon for shopping, right? So just to tell you more than 35% of amazon.com revenue is generated by its recommendation engine. So what's the strategy Amazon uses recommendation as a targeted marketing tool in both the email campaigns around most of its website Pages Amazon will recommend many products from different categories based on what you have browser and it will pull those products in front of you which you are likely to buy like the frequently bought together option that comes at the bottom of the product page to tempt you into buying the combo. Well, this recommendation has just one main goal that is increase average order value or to upsell and cross-sell customers by providing product suggestions. Eastern items in the shopping cart or based on the product. They're currently looking at on site. So next industrial application of KNN algorithm is concept search or searching semantically similar documents and classifying documents containing similar topics. So as you know, the data on the Internet is increasing exponentially every single second. There are billions and billions of documents on the internet each document on the internet contains multiple Concepts, that could be a potential concept. Now, this is a situation where the main problem is to Extract concept from a set of documents as each page could have thousands of combination that could be potential Concepts an average document could have millions of concept combined that the vast amount of data on the web. Well, we are talking about an enormous amount of data set and Sample. So what we need is we need to find a concept from the enormous amount of data set and samples, right? So for this purpose, we will be using KNN algorithm more advanced example could include handwriting detection like an OCR or image recognization or even video. Organization. All right. So now that you know various use cases of KNN algorithm. Let's proceed and see how does it work. So how does a KNN algorithm work? Let's start by plotting these blue and orange point on our graph. So these Blue Points the belong to class A and the orange ones they belong to class B. Now you get a star as a new pony and your task is to predict whether this new point it belongs to class A or it belongs to the class B. So to start the production the very first thing that you have to do is select the Value of K. Just as I told you KN KN algorithm refers to the number of nearest neighbors that you want to select. For example, in this case k equal to 3. So what does it mean it means that I am selecting three points which are the least distance to the new point or you can say I am selecting three different points which are closest to the star. Well at this point of time you can ask how will you calculate the least distance? So once you calculate the distance, you will get one blue and two orange points which are closest to this star now. Since in this case as we have a majority of orange points, so you can say that for k equal 3D star belongs to class B, or you can say that the star is more similar to the orange points moving on ahead. Well, what if k equal to 6 well for this case, you have to look for six different points which are closest to this star. So in this case after calculating the distance, we find that we have four blue points and two Orange Point which are closest to the star now as you can see that the blue points are in majority, so you Can say that for k equals 6 this star belongs to class A or the star is more similar to Blue Points. So by now, I guess you know how a KNN algorithm work and what is the significance of gain KNN algorithm. So how will you choose the value of K? So keeping in mind this case the most important parameter in KNN algorithm. So, let's see when you build a k nearest neighbor classifier. How will you choose a value of K? Well, you might have a specific value of K in mind or you could divide up your data and use something like cross-validation technique to test several values of K in order. To determine which works best for your data, for example, if n equal 2,000 cases then in that case the optimal value of K lies somewhere in between 1 to 19. But yes, unless you try it you cannot be sure of it. So, you know how the algorithm is working on a higher level. Let's move on and see how things are predicted using KNN algorithm. Remember I told you the KNN algorithm uses the least distance measure in order to find its nearest neighbors. So, let's see how these distance is calculated. Well, there are several distance measure which can be used. So to start with Will mainly focus on euclidean distance and Manhattan distance in this session. So what is this euclidean distance? Well, this euclidean distance is defined as the square root of the sum of difference between a new point x and an existing Point why so for example here we have Point P1 and P2 Point T. 1 is 1 1 and point p 2 is 5 for so what is the euclidean distance between both of them? So you can see that euclidean distance is a direct distance between two points. So what is the distance between the point P1 and P2 so we can calculate it as 5 minus 1 whole square plus 4 minus 1 whole square and we can route it over which results to 5. So next is the Manhattan distance. Well, this Manhattan distance is used to calculate the distance between real Vector using this some of their absolute difference in this case the Manhattan distance between the point P1 and P2 is Mode of 5 minus 1 plus mod value of 4 minus 1 which results to 3 plus 4 that is 7 so this slide shows the difference between euclidean and Manhattan distance from point A to point B. So euclidean distance is nothing but the direct or the least possible distance between A and B. Whereas the Manhattan distance is a distance between A and B measured along the axis at right angle. Let's take an example and see how things are predicted using KNN algorithm or how the cannon algorithm is working. Suppose we have a data set which consists of height weight and T-shirt size of some customers. Now when a new customer come we only have his height and weight as the information now our task is to predict. What is the T-shirt size of that particular customer so for this will be using the KNN algorithm. So the very first thing what we need to do, we need to calculate the euclidean distance. So now that you have a new data of height 160 one centimeter and weight are 61 kg. So the very first thing that we'll do is we'll calculate the euclidean distance. Stance which is nothing but the square root of 160 1 minus 158 whole square plus 61 minus 58 whole square and square root of that is 4.24. Let's drag and drop it. So these are the various euclidean distance of other points. Now, let's suppose k equal to 5 then the algorithm what it does is it searches for the five customer closest to the new customer that is most similar to the new data in terms of its attribute for k equal 5. Let's find the top five minimum euclidian distance. So these are the distance which we are going to use Two three four and five. So let's rank them in the order first. This is second. This is third then this one is for again. This one is 5 so there is our order. So for k equal 5 we have for t-shirts which commanders size M and one t-shirt which comes under size l so obviously best guess for the best protection for the T-shirt size of height 160 one centimeter and wait 60 1 kg is M. Or you can say that a new customer Fittin to size M. Well this was all about Body theoretical session, but before we drill down to the coding part, let me just tell you why people call KN as a lazy learner. Well Cannon for classification is a very simple algorithm, but that's not why they are called lazy KN is a lazy learner because it doesn't have a discriminative function from the training data. But what it does it memorizes the training data, there is no learning phase of the model and all of the work happens at the time. Your prediction is requested. So as such there's the reason why KN is often referred to us lazy learning algorithm. So this was all about Or detail reticle session now, let's move on the coding part. So for the Practical implementation of the Hands-On part, I'll be using the IRS data set. So this data set consists of 150 observation. We have four features and one class label the four features include the sepal length sepal width petal length and the petrol head whereas the class label decides which flower belongs to which category. So this was the description of the data set, which we are using now, let's move on and see what are the step by step solution to perform a KNN algorithm. So first we'll start by handling the The data what we have to do we have to open the data set from the CSV format and split the data set into train and test part next we'll take the similarity where we have to calculate the distance between two data instances. Once we calculate the distance next. We'll look for the neighbor and select K Neighbors which are having the least distance from a new point. Now once we get our neighbor, then we'll generate a response from a set of data instances. So this will decide whether the new Point belongs to class A or Class B. Finally, we'll create the accuracy function and in the end. We'll tie it all together in the main function. So let's start with our code for implementing KNN algorithm using python. I'll be using jupyter notebook python 3.0 installed on it. Now, let's move on and see how can an algorithm can be implemented using python. So there's my jupyter notebook, which is a web-based interactive Computing notebook environment with python 3.0 installed on it so that the launched its launching so there's our jupyter notebook and we'll be riding our python codes on it. So the first thing that we need to do is load our file, our data is in CSV format without a header line or any code we can open the file the open function and read the data line using the reader function in the CSV module. So let's write a code to load our data file. Let's execute the Run button. So once you execute the Run button, you can see the entire training data set as the output next. We need to split the data into a training data set that KN can use to make prediction and a test data set that we can use to evaluate the accuracy of The module so we first need to convert the flower measure that were loaded as string into numbers that we can work. Next. We need to split the data set randomly to train and test ratio of 67 is 233 for test is to train as a standard ratio, which is used for this purpose. So let's define a function as load data set that loads a CSV with the provided file named and split it randomly into training and test data set using the provided split ratio. So this is our function load data set which is using filenames that ratio training data set and testing data set. As its input. All right. So let's execute the Run button and check for any errors. So it's executed with zero errors. Let's test this function. So there's our training set testing set load data set. So this is our function load data set on inside that we are passing. Our file is data with a split ratio of 0.66 and training data set and test data set. Let's see what our training data set and test data set its dividing into so it's giving a count of training data set and testing data set. The total number of training data set as split into is 97 and total number of Test data set we have is 53. So total number of training data set we have here is 97 and total number of test data set we have here is 53. All right. Okay. So our function load data set is performing. Well, so let's move on to step two which is similarity. So in order to make prediction, we need to calculate the similarity between any two given data instances. This is needed so that we can locate the kamo similar data instances in the training data set are in turn make a prediction given that all for flour measurement are numeric and have same unit. We can directly use the euclidean distance measure. This is nothing but the square root of the sum of squared differences between two eras of the number given that all the for flower measurements are numeric and have same unit. We can directly use the euclidean distance measure which is nothing but the square root of the sum of squared difference between two arrays or the number additionally we want to control which field to include in the distance calculation. So specifically we only want to include first for attribute. So our approach will be to limit the euclidean distance to a fixed length. All right. So let's define our euclidean function. So these are euclidean distance function which takes instance one instance to and length as parameters instance one and instance two are the two points of which you want to calculate the euclidean distance, whereas this length and denote that how many attributes you want to include. Okay. So there's our euclidean function. Let's execute it. It's executing fine without any errors. Let's test the function suppose the data one or the first instance consists of the data point us to to to and it belongs to class A. A and data to consist of four for four and it belongs to class P. So when we calculate the euclidean distance of data one to data to and what we have to do we have to consider only first three features of them. All right. So let's print the distance as you can see here. The distance comes out to be three point four six four. All right. So this is nothing but the square root of 4 minus 2 whole Square. So this distance is nothing but the euclidean distance and it is calculated as square root of 4 minus 2 whole square plus 4 minus 2 whole square that is nothing but 3 times or 4 minus 2 whole That is 12 + square root of 12 is nothing but 3.46 for all right. So now that we have calculated the distance now, we need to look for K nearest neighbors. Now that we have a similarity measure we can use it to collect the kamo similar instances for a given unseen instance. Well, this is a straightforward process of calculating the distance for all the instances and selecting a subset with the smallest distance value. And now what we have to do we have to select the smallest distance values. So for that will be defining a function as get neighbors. So for that what we will be doing will be defining a function as get neighbors what it will do it will return the K most similar Neighbors From the training set for a given test instance. All right. So this is how our get nabal function look like it takes training data set and test instance and K as its input here. The K is nothing but the number of nearest neighbor you want to check for. All right. So basically what you'll be getting from this get Mabel's function is K different points having least euclidean distance from the test instance. All right, let's execute it. So the function executed without any errors. So let's test our function. Suppose the training data set includes the data like 2 to 2 and it belongs to class A and other data includes four four four and it belongs to class P and our testing instances five five five or now. We have to predict whether this test instance belongs to class A or it belongs to class be. All right for k equal 1 we have to predict its nearest neighbor and predict whether this test instance it will belong to class A or will it belong to class be? All right. So let's execute the Run button aligned. So an executing the Run button you can see that we have output is 4 4 4 and B. Be a new instance 5 5 5 is closest to point 4 4 4 which belongs to class be? All right. Now once you have located the most similar neighbor for a test instance next task is to predict a response based on those neighbors. So how we can do that. Well, we can do this by allowing each neighbor to vote for the class attribute and take the majority vote as a prediction. Let's see how we can do that. So we have a function as getresponse with takes neighbors as the input. Well, this neighbor was nothing but the output of this get me / function. The output of get neighbor function will be fed to get response. All right, let's execute the Run button. It's executed. Let's move ahead and test our function get response. So we have a neighbor as one one one. It belongs to class A to to to it belongs to class a33. It belongs to class B. So this response what it will do it will store the value of get response by passing this neighbor value. All right. So what we want to check is we want to predict whether that test instance five five five. It belongs to class A or Class B. Be when the neighbors are 1 1 1 a 2 2 A + 3 3 p. So let's check our response now that we have created all the different function which are required for a KNN algorithm. So important main concern is how do you evaluate the accuracy of the prediction and easy way to evaluate the accuracy of the model is to calculate a ratio of the total correct prediction to all the prediction made. So for this I will be defining function as get accuracy and inside that I'll be passing my test data set and the predictions get accuracy function. Check it. Executed without any error. Let's check it for a sample data set. So we have our test data set as 1 1 1 which belongs to class A 2/2 which again belongs to class 3 3 3 which belongs to class B and my predictions is for first test data. It predicted latter belongs to class A which is true for next it predicted that belongs to class E, which is again to and for the next again it predictive that it belongs to class A which is false in this case cause the test data belongs to class be. All right. So in total we have to correct prediction out of three. All right. Right. So the ratio will be 2 by 3, which is nothing but 66.6. So our accuracy rate is 66.6. So now that you have created all the function that are required for KNN algorithm. Let's compile them into one single main function. Alright, so this is our main function and we are using Iris data set with a split of 0.67 and the value of K is 3 Let's see. What is the accuracy score of this check how accurate are modulus so in training data set, we have 113 values and then the test data set we have Seven values. These are the predicted and the actual values of the output. Okay. So in total, we got an accuracy of ninety seven point two nine percent, which is really very good. Alright, so I hope the concept of this KNN algorithm is here devised in a world full of machine learning and artificial intelligence surrounding almost everything around us classification and prediction is one of the most important aspects of machine learning. So before moving forward, let's have a Look at the agenda. I'll start of this video by explaining you guys. What exactly is Nave biased then we'll understand what is space theorem which serves as a logic behind the name pass algorithm moving forward. I'll explain the steps involved in the neighb as algorithm one by one and finally, I'll finish off this video with a demo on the Nave bass using the sklearn package noun a bass is a simple but surprisingly powerful algorithm from predictive analysis. It is a classification technique based on base. him with an assumption of Independence among predictors it comprises of two parts, which is knave and bias in simple terms neighbors classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature, even if this features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability whether a fruit is an apple or an orange or a banana, so That is why it is known as naive now naive based model is easy to build and particularly useful for very large data sets in probability Theory and statistics based theorem, which is already known as the base law or the base rule describes the probability of an event based on prior knowledge of the conditions that might be related to the event now pasted here m is a way to figure out conditional probability. The conditional probability is the probability of an event happening given that it has some relationship. One or more other events, for example, your probability of getting a parking space is connected to the time of the day. You park where you park and what conventions are you going on at that time Bayes theorem is slightly more nuanced in a nutshell. It gives you an actual probability of an event given information about the tests. Now, if you look at the definition of Bayes theorem, we can see that given a hypothesis H and the evidence e-base term states that the relationship between the E of the hypothesis before getting the evidence, which is the P of H and the probability of the hypothesis after getting the evidence that is p of H given e is defined as probability of e given H into probability of H divided by probability of e it's rather confusing, right? So let's take an example to understand this theorem. So suppose I have a deck of cards and if a single card is drawn from the deck of playing cards, the probability that the card is a king is for by 52 since there are four Kings in a standard deck of 52 cards. Now if King is an event, this card is a king. The probability of King is given as 4 by 52 that is equal to 1 by 13. Now if the evidence is provided for instance someone looks Such as the card that the single card is a face card the probability of King given that it's a face can be calculated using the base theorem by this formula. Now since every King is also a face card the probability of face given that it's a king is equal to 1 and since there are three phase cards in each suit. That is the chat king and queen. The probability of the face card is equal to 12 by 52. That is 3 by 30. No using base certain we can find out the probability of King given that it's a face. So our final answer comes to 1 by 3, which is also true. So if you have a deck of cards, which has having only faces now, there are three types of phases which are the chat king and queen. So the probability that it's the king is 1 by 3. Now. This is the simple example of how based on works now if we look at the proof as in how this paste Serum evolved. So here we have probability of a given B and probability of B given a now for a joint probability distribution over the sets A and B, the probability of a intersection B, the conditional probability of a given B is defined as the probability of a intersection B divided by probability of B, and similarly probability of B, given a is defined as probability of B intersection a divided by probability of a now we Equate probability of a intersection p and probability of B intersection a as both are the same thing now from this method as you can see, we get our final base theorem proof, which is the probability of a given b equals probability of B, given a into probability of P divided by the probability of a now while this is the equation that applies to any probability distribution over the events A and B. It has a particular nice interpretation in case where a is represented as the hypothesis h H and B is represented as some observed evidence e in that case the formula is p of H given e is equal to P of e given H into probability of H divided by probability of e now this relates the probability of hypothesis before cutting the evidence, which is p of H to the probability of the hypothesis after getting the evidence which is p of H given e for this reason P of H is known as the prior probability while P of It's given e is known as the posterior probability and the factor that relates the two is known as the likelihood ratio Now using this term space theorem can be rephrased as the procedure probability equals. The prior probability times the likelihood ratio. So now that we know the maths which is involved behind the Bayes theorem. Let's see how we can implement this in real life scenario. So suppose we have a data set. Set in which we have the Outlook the humidity and we need to find out whether we should play or not on that day. So the Outlook can be sunny overcast rain and the humidity are high normal and the wind are categorized into two phases which are the weak and the strong winds. The first of all will create a frequency table using each attribute of the data set. So the frequency table for the Outlook looks like this we have Sunny overcast and rainy the frequency table of humidity looks like this. And a frequency table of when looks like this we have strong and weak for wind and high and normal ranges for humidity. So for each frequency table, we will generate a likelihood table now now the likelihood table contains the probability of a particular day suppose we take the sunny and we take the play as yes and no so the probability of Sunny given that we play yes is 3 by 10, which is 0.3 the probability of X, which is the probability of Sunny He is equal to 5 by 14. Now. These are all the terms which are just generated from the data which we have here. And finally the probability of yes is 10 out of 14. So if we have a look at the likelihood of yes given that it's a sunny we can see using Bayes theorem. It's the probability of Sunny given yes into probability of yes divided by the probability of Sunny. So we have all the values here calculated. So if you put that in our base serum equation, we get the likelihood of Is a 0.59 similarly the likelihood of no can also be calculated here is 0.40 now similarly. We are going to create the likelihood table for both the humidity and the win there's a for humidity the likelihood for yes given the humidity is high is equal to 0.4 to and the probability of playing know given the Venice High is 0.58 the similarly for table wind. The probability of e is given that the wind is week is 0.75 and the probability of no given that the win is week is 0.25 now suppose we have of day which has high rain which has high humidity and the wind is weak. So should we play or not? That's all for that. We use the base theorem here again the likelihood of yes on that day is equal to the probability of Outlook rain given that it's a yes into probability. Of humidity given that say yes, and the probability of when that is we given that it's we are playing yes into the probability of yes, which equals to zero point zero one nine and similarly the likelihood of know on that day is equal to zero point zero one six. Now if we look at the probability of yes for that day of playing we just need to divide it with the likelihood some of both the yes and no so the probability of playing tomorrow, which is yes is .5. Whereas the probability of not playing is equal to 0.45. Now. This is based upon the data which we already have with us. So now that you have an idea of what exactly is named by as how it works and we have seen how it can be implemented on a particular data set. Let's see where it is used in the industry. The started with our first industrial use case, which is news categorized. It's move on to them or we can use the term text classification to broaden the spectrum of this algorithm news in the web are rapidly growing in the era of Information Age where each new site has its own different layout and categorization for grouping news. Now these heterogeneity of layout and categorization cannot always satisfy individual users need to remove these heterogeneity and classifying the news articles. Owing to the user preference is a formidable task companies use web crawler to extract useful text from HTML Pages the news articles and each of these news articles is then tokenized now these tokens are nothing but the categories of the news now in order to achieve better classification result. We remove the less significant Words, which are the stop was from the documents or the Articles and then we apply the Nave base classifier for classifying the news contents based on the news. Now this is by far one of the best examples of Neighbors classifier, which is Spam filtering. Now. It's the Nave Bayes classifier are a popular statistical technique for email filtering. They typically use bag-of-words features to identify at the spam email and approach commonly used in text classification as well. Now it works by correlating the use of tokens, but the spam and non-spam emails and then the Bayes theorem, which I explained earlier is used to calculate the probability that an email is or not a Spam so named by a Spam filtering is a baseline technique for dealing with Spam that container itself to the emails need of an individual user and give low false positive spam detection rates that are generally acceptable to users. It is one of the oldest ways of doing spam filtering with its roots in the 1990s particular words have particular probabilities of occurring in spam. And in legitimate email as well for instance most emails users will frequently encounter the world lottery or the lucky draw a spam email, but we'll sell them see it in other emails. The filter doesn't know these probabilities in advance and must be friends. So it can build them up to train the filter. The user must manually indicate whether a new email is Spam or not for all the words in each straining email. The filter will adjust the probability that each word will appear in a Spam or legitimate. All in the database now after training the word probabilities also known as the likelihood functions are used to compute the probability that an email with a particular set of words as in belongs to either category each word in the email contributes the email spam probability. This contribution is called the posterior probability and is computed again using the base 0 then the email spam probability is computed over all the verse in the email and if the total exceeds a certain threshold say Or 95% the filter will Mark the email as spam. Now object detection is the process of finding instances of real-world objects such as faces bicycles and buildings in images or video now object detection algorithm typically use extracted features and learning algorithm to recognize instance of an object category here again, a bias plays an important role of categorization and classification of object now medical area. This is increasingly voluminous amount of electronic data, which are becoming more and more complicated. The produced medical data has certain characteristics that make the analysis very challenging and attractive as well among all the different approaches. The knave bias is used. It is the most effective and efficient classification algorithm and has been successfully applied to many medical problems empirical comparison of knave bias versus five popular classifiers on Medical data sets shows that may bias is well suited for medical application and has high performance in most of the examine medical problems. Now in the past various statistical methods have been used for modeling in the area of disease diagnosis. These methods require prior assumptions and are less capable of dealing with massive and complicated nonlinear and dependent data one of the main advantages of neighbor as approach which is appealing to Physicians is that all the available information is used? To explain the decision this explanation seems to be natural for medical diagnosis and prognosis. That is it is very close to the way how physician diagnosed patients now weather is one of the most influential factor in our daily life to an extent that it may affect the economy of a country that depends on occupation like agriculture. Therefore as a countermeasure to reduce the damage caused by uncertainty in whether Behavior, there should be an efficient way to print the weather now whether projecting has Challenging problem in the meteorological department since ears even after the technology skill and scientific advancement the accuracy and production of weather has never been sufficient even in current day this domain remains as a research topic in which scientists and mathematicians are working to produce a model or an algorithm that will accurately predict the weather now a bias in approach based model is created by where procedure probabilities are used to calculate the likelihood of each class label for input. Data instance and the one with the maximum likelihood is considered as the resulting output now earlier. We saw a small implementation of this algorithm as well where we predicted whether we should play or not based on the data, which we have collected earlier. Now, this is a python Library which is known as scikit-learn it helps to build in a bias and model in Python. Now, there are three types of named by ass model under scikit-learn Library. The first one is the caution. It is used in classification and it Assumes that the feature follow a normal distribution. The next we have is multinomial. It is used for discrete counts. For example, let's say we have a text classification problem and here we consider bernouli trials, which is one step further and instead of word occurring in the document. We have count how often word occurs in the document you can think of it as a number of times outcomes number is observed in the given number of Trials. And finally we have the bernouli type. Of Naples, the binomial model is useful if your feature vectors are binary bag of words model where the once and the zeros are words occur in the document and the verse which do not occur in the document respectively based on their data set. You can choose any of the given discussed model here, which is the gaussian the multinomial or the bernouli. So let's understand how this algorithm works. And what are the different steps? One can take to create a bison model and use knave bias to predict the output so here to understand better. We are going to predict the onset of diabetes Now this problem comprises of 768 observations of medical details for Pima Indian patients. The record describes instantaneous measurement taken from the patient such as the age the number of times pregnant and the blood work group now all the patients are women aged 21 and Old and all the attributes are numeric and the unit's vary from attribute to attribute. Each record has a class value that indicate whether the patient suffered on onset of diabetes within five years or the measurements. Now, these are classified as zero. Now, I've broken the whole process down into the following steps. The first step is handling the data in which we load the data from the CSV file and split it into training and test data sets. The second step is summarizing the data. In which we summarize the properties in the training data sets so that we can calculate the probabilities and make predictions. Now the third step comes is making a particular prediction. We use the summaries of the data set to generate a single prediction. And after that we generate predictions given a test data set and a summarize training data sets. And finally we evaluate the accuracy of the predictions made for a test data set as the percentage correct out of all the predictions made and finally We tied together and form. Our own model of nape is classifier. Now. The first thing we need to do is load our data the data is in the CSV format without a header line or any codes. We can open the file with the open function and read the data lines using the read functions in the CSV module. Now, we also need to convert the attributes that were loaded as strings into numbers so that we can work with them. So let me show you how this can be implemented now for that you need to Tall python on a system and use the jupyter notebook or the python shell. Hey, I'm using the Anaconda Navigator which has all the things required to do the programming in Python. We have the Jupiter lab. We have the notebook. We have the QT console. Even we have a studio as well. So what you need to do is just install the Anaconda Navigator it comes with the pre installed python also, so the moment you click launch on The jupyter Notebook. It will take you to the Jupiter homepage in a local system and here you can do programming in Python. So let me just rename it as by my India diabetes. So first, we need to load the data set. So I'm creating here a function load CSV now before that. We need to import certain CSV the math and the random method. So as you can see, I've created a load CSV function which will take the pie my Indian diabetes data dot CSV file using the CSV dot reader method and then we are converting every element of that data set into float originally all the Ants are in string, but we need to convert them into floor for our calculation purposes. Now next we need to split the data into training data sets that nay bias can use to make the prediction and this data set that we can use to evaluate the accuracy of the model. We need to split the data set randomly into training and testing data set in the ratio of usually which is 70 to 30, but for this example, I am going to use 67 and 33 now 70 and 30 is a Ratio for testing algorithms so you can play around with this number. So this is our split data set function. Now the Navy base model is comprised of summary of the data in the training data set. Now this summary is then used while making predictions. Now the summary of the training data collected involves the mean the standard deviation of each attribute by class value now, for example, if there are two class values and seven numerical attributes, then we need a mean and the standard deviation for each of these seven attributes and the class value which makes The 14 attribute summaries so we can break the preparation of this summary down into the following sub tasks which are the separating data by class calculating mean calculating standard deviation summarizing the data sets and summarizing attributes by class. So the first task is to separate the training data set instances by class value so that we can calculate statistics for each class. We can do that by creating a map of each class value to a list of instances that belong to the class. Class and sort the entire dataset of instances into the appropriate list. Now the separate by class function just the same. So as you can see the function assumes that the last attribute is the class value the function returns a map of class value to the list of data instances next. We need to calculate the mean of each attribute for a class value. Now, the mean is the central middle or the central tendency of the data and we use it as a middle of our gaussian distribution when Calculating the probabilities. So this is our function for mean now. We also need to calculate the standard deviation of each attribute for a class value. The standard deviation is calculated as a square root of the variance and the variance is calculated as the average of the squared differences for each attribute value from the mean now one thing to note that here is that we are using n minus one method which subtracts one from the number of attributes values when calculating the variance. The now that we have the tools to summarize the data for a given list of instances, we can calculate the mean and standard deviation for each attribute. Now that's if function groups the values for each attribute across our data instances into their own lists so that we can compute the mean and standard deviation values for each attribute. The next comes the summarizing attributes by class. We can pull it all together by first separating our training data sets into instances growth by class then calculating the summaries for each a To be with now. We are ready to make predictions using the summaries prepared from our training data making predictions involves calculating the probability that a given data instance belong to each class then selecting the class with the largest probability as a prediction. Now we can divide this whole method into four tasks which are the calculating gaussian probability density function calculating class probability making a prediction and then estimating the accuracy now to calculate the gaussian probability density function. We use the gaussian function to estimate the probability of a given attribute value given the node mean and the standard deviation of the attribute estimated from the training data. As you can see the parameters are x mean and the standard deviation now in the calculate probability function, we calculate the exponent first then calculate the main division this lets us fit the equation nicely into two lines. Now, the next task is calculating the class properties now that we had can calculate the probability of an attribute belonging to a class. We can combine the probabilities of all the attributes values for a data instance and come up with a probability of the entire. Our data instance belonging to the class. So now that we have calculated the class properties. It's time to finally make our first prediction now, we can calculate the probability of the data instance belong to each class value and we can look for the largest probability and return the associated class and for that we are going to use this function to predict which uses the summaries and the input Vector which is basically all the probabilities which are being input for a particular label now finally we can An estimate the accuracy of the model by making predictions for each data instances in our test data for that. We use the cat predictions method. Now this method is used to calculate the predictions based upon the test data sets and the summary of the training data set. Now, the predictions can be compared to the class values in our test data set and classification accuracy can be calculated as an accuracy ratio between the zeros and the hundred percent. Now the get accuracy method will calculate this accuracy ratio. Now finally to sum it all up. We Define our main function we call all these methods which we have defined earlier one by one to get the Courtesy of the model which we have created. So as you can see, this is our main function in which we have the file name. We have defined the split ratio. We have the data set. We have the training and test data set. We are using the split data set method next. We are using the summarized by class function using the get prediction and the get accuracy method as well. So guys as you can see the output of this one gives us that we are splitting the seven sixty eight rows into 514 which is the training and 254 which is the test data set rows and the accuracy of this model is 68% Now we can play with the amount of training and test data sets which are to be used so we can change the split ratio to seventies. 238 is 220 to get different sort of accuracy. So suppose I change the split ratio from 0.67 20.8. So as you can see, we get the accuracy of 62 percent. So splitting it into 0.67 gave us a better result which was 68 percent. So this is how you can Implement Navy bias caution classifier. These are the step by step methods which you need to do in case of using the Nave Bayes classifier, but don't worry. We do not need to write all this many lines of code to make a model this with The Sacketts. And I really comes into picture the scikit-learn library has a predefined method or as say a predefined function of neighbor bias, which converts all of these lines, of course into merely just two or three lines of codes. So, let me just open another jupyter notebook. So let me name it as sklearn a pass. Now here we are going to use the most famous data set which is the iris dataset. Now, the iris flower data set is a multivariate data set introduced by the British statistician and biologists Roland Fisher and based on this fish is linear discriminant model this data set became a typical test case for many statistical classification techniques in machine learning. So here we are going to use the caution NB model, which is already available in the sklearn. As I mentioned earlier, there were three types of Neighbors which are the question multinomial and the bernouli. So here we are going to use the caution and be model which is already present in the sklearn library, which is the cycle learn Library. So first of all, what we need to do is import the sklearn data sets and the metrics and we also need to import the caution and be Now once all these libraries are lowered we need to load the data set which is the iris dataset. The next what we need to do is fit a Nave by a small to this data set. So as you can see we have so easily defined the model which is the gaussian NB which contains all the programming which I just showed you earlier all the methods which are taking the input calculating the mean the standard deviation separating it bike last and finally making predictions. Calculating the prediction accuracy. All of this comes under the caution and be method which is inside already present in the sklearn library. We just need to fit it according to the data set which we have so next if we print the model we see which is the gaussian NB model. The next what we need to do is make the predictions. So the expected output is data set dot Target and the projected is using the pretend model and the model we are using is the cause in NB here. Here now to summarize the model which created we calculate the confusion Matrix and the classification report. So guys, as you can see the classification to provide we have the Precision of Point Ninety Six, we have the recall of 0.96. We have the F1 score and the support and finally if we print our confusion Matrix, as you can see it gives us this output. So as you can see using the gaussian and we method just putting it in the model and using any of the data. Fitting the model which you created into a particular data set and getting the desired output is so easy with the scikit-learn library. So guys, this is it. I hope you understood a lot about the nape Bayes classifier how it is used where it is used and what are the different steps involved in the classification technique and how the scikit-learn makes all of those techniques very easy to implement in any data set which we have. As we M or support Vector machine is one of the most effective machine learning classifier and it has been used in various Fields such as face recognition cancer classification and so on today's session is dedicated to how svm works the various features of svm and how it is used in the real world. So without any further due let's take a look at the agenda for today. We're going to begin the session with an introduction to machine learning and the different types of machine learning. Next we'll discuss what exactly support Vector machines are and then we'll move on and see how svm works and how it can be used to classify linearly separable data will also briefly discuss about how nonlinear svm's work and then we'll move on and look at the use case of svm in colon cancer classification and finally we'll end the session by running a demo where we'll use svm to predict whether a patient is suffering from a heart disease or not. Okay, so that was the agenda. Let's get stood with our first topic. So what is machine learning machine learning is a science of getting computers to act by feeding them data and letting them learn a few tricks on their own. Okay, we're not going to explicitly program the machine instead. We're going to feed it data and let it learn the key to machine learning is the data machines learn just like us humans. We humans need to collect information and data to learn similarly machines must also be fed data in order to learn and make decisions. Let's say that you want a machine to predict the value of a stock. All right in such situations. You just feed the machine with relevant data after which you develop a model which is used to predict the value of the stock. NOW one thing to keep in mind is the more data you feed the machine the better it will learn and make more accurate predictions obviously machine learning is not so simple in order for a machine to analyze and get useful insights from data. It must process and study the data by running different. Algorithms on it. All right. And today we'll be discussing about one of the most widely used algorithm called the support Vector machine. Okay. Now that you have a brief idea about what machine learning is, let's look at the different ways in which machines Lon first. We have supervised learning in this type of learning the machine learns under guidance. All right, that's why it's called supervised learning now at school. Our teachers guided us and taught us similarly in supervised learning machines learn by feeding them labeled data. Explicitly telling them. Hey, this is the input and this is how the output must look. Okay. So guys the teacher in this case is the training data. Next we have unsupervised learning here. The data is not labeled and there is no guide of any sort. Okay, the machine must figure out the data set given and must find hidden patterns in order to make predictions about the output an example of unsupervised learning is an adult's like you and me. We don't need a guide to help us with our daily activities. They figured things out on our own without any supervision. All right, that's exactly how I'm supervised learning work. Finally. We have reinforcement learning. Let's say you were dropped off at an isolated island. What would you do now initially you would panic and you'll be unsure of what to do where to get food from How To Live and all of that but after a while you will have to adapt you must learn how to live in the island adapt to the changing climate learn what to eat and what not to eat. You're basically following the hit and trial. Because you're new to the surrounding and the only way to learn is experience and then learn from your experience. This is exactly what reinforcement learning is. It is a learning method wherein an agent interacts with its environment by producing actions and discovers errors or words. Alright, and once it gets trained it gets ready to predict the new data presented to it. Now in our case the agent was you basically stuck on the island and the environment was the island. All right? Okay now now let's move on and see what svm algorithm is all about. So guys svm or support Vector machine is a supervised learning algorithm, which is mainly used to classify data into different classes now unlike most algorithms svm makes use of a hyperplane which acts like a decision boundary between the various classes in general svm can be used to generate multiple separating hyperplanes so that the data is divided into segments. Okay and each These segments will contain only one kind of data. It's mainly used for classification purpose wearing you want to classify or data into two different segments depending on the features of the data. Now before moving any further, let's discuss a few features of svm. Like I mentioned earlier svm is a supervised learning algorithm. This means that svm trains on a set of labeled data svm studies the label training data and then classifies any new input data depending on what it learned in the training. In Phase a main advantage of support Vector machine is that it can be used for both classification and regression problems. All right. Now even though svm is mainly known for classification the svr which is the support Vector regressor is used for regression problems. All right, so svm can be used both for classification. And for regression. Now, this is one of the reasons why a lot of people prefer svm because it's a very good classifier and along with that. It is also used for regression. Another feature is the svm kernel functions svm can be used for classifying nonlinear data by using the kernel trick the kernel trick basically means to transform your data into another dimension so that you can easily draw a hyperplane between the different classes of the data. Alright, nonlinear data is basically data which cannot be separated with a straight line. Alright, so svm can even be used on nonlinear data sets. You just have to use a kernel functions to do this. All right, so Guys, I hope you all are clear with the basic concepts of svm. Now. Let's move on and look at how svm works so guys an order to understand how svm Works let's consider a small scenario now for a second pretend that you own a firm. Okay, and let's say that you have a problem and you want to set up a fence to protect your rabbits from the pack of wolves. Okay, but where do you build your fence one way to get around? The problem is to build a classifier based on the position of the rabbits and words in your Faster. So what I'm telling you is you can classify the group of rabbits as one group and draw a decision boundary between the rabbits and the world. All right. So if I do that and if I try to draw a decision boundary between the rabbits and the Wolves, it looks something like this. Okay. Now you can clearly build a fence along this line in simple terms. This is exactly how SPM work it draws a decision boundary, which is a hyperplane between any two classes in order to separate them or class. Asif I them now, I know you're thinking how do you know where to draw a hyperplane the basic principle behind svm is to draw a hyperplane that best separates the two classes in our case the two glasses of the rabbits and the Wolves. So you start off by drawing a random hyperplane and then you check the distance between the hyperplane and the closest data points from each glove these closes on your is data points to the hyperplane are known as support vectors and that's where the name comes from support. Active machine. So basically the hyperplane is drawn based on these support vectors. So guys an Optimum hyperplane will have a maximum distance from each of these support vectors. All right. So basically the hyper plane which has the maximum distance from the support vectors is the most optimal hyperplane and this distance between the hyperplane and the support vectors is known as the margin. All right. So to sum it up svm is used to classify data by using a hyper plane such that the distance distance between the hyperplane and the support vectors is maximum. So basically your margin has to be maximum. All right, that way, you know that you're actually separating your classes or add because the distance between the two classes is maximum. Okay. Now, let's try to solve a problem. Okay. So let's say that I input a new data point. Okay. This is a new data point and now I want to draw a hyper plane such that it best separates the two classes. Okay, so I start off by drawing a hyperplane like this and then I check the distance between Hyper plane and the support vectors. Okay, so I'm trying to check if the margin is maximum for this hyperplane, but what if I draw a hyper plane which is like this? All right. Now I'm going to check the support vectors over here. Then I'm going to check the distance from the support vectors and with this hyperplane, it's clear that the margin is more right when you compare the margin of the previous one to this hyperplane. It is more. So the reason why I'm choosing this hyperplane is because the distance between the support vectors and the hi Hyperplane is maximum in this scenario. Okay, so guys this is how you choose a hyperplane. You basically have to make sure that the hyper plane has a maximum. Margin. All right, it has two best separate the two classes. All right. Okay so far it was quite easy. Our data was linearly separable which means that you could draw a straight line to separate the two classes. All right, but what will you do? If the data set is like this you possibly can't draw a hyper plane like this. All right. It doesn't separate the two. At all, so what do you do in such situations now earlier in the session I mentioned how a kernel can be used to transform data into another dimension that has a clear dividing margin between the classes of data. Alright, so kernel functions offer the user this option of transforming nonlinear spaces into linear ones. Nonlinear data set is the one that you can't separate using a straight line. All right, in order to deal with such data sets you're going to Ants form them into linear data sets and then use svm on them. Okay. So simple trick would be to transform the two variables X and Y into a new feature space involving a new variable called Z. All right, so guys so far we were plotting our data on two dimensional space. Correct? We will only using the X and the y axis so we had only those two variables X and Y now in order to deal with this kind of data a simple trick would be to transform the two variables X and I into a new feature space involving a new variable called Z. Ok, so we're basically visualizing the data on a three-dimensional space. Now when you transform the 2D space into a 3D space, you can clearly see a dividing margin between the two classes of data right now. You can go ahead and separate the two classes by drawing the best hyperplane between them. Okay, that's exactly what we discussed in the previous slides. So guys, why don't you try this yourself dry drawing a hyperplane, which is the most Optimum. For these two classes. All right, so guys, I hope you have a good understanding about nonlinear svm's now. Let's look at a real world use case of support Vector machines. So guys s VM as a classifier has been used in cancer classification since the early 2000s. So there was an experiment held by a group of professionals who applied svm in a colon cancer tissue classification. So the data set consisted of about 2,000 transmembrane protein samples and Only about 50 to 200 genes samples were input Into the svm classifier Now this sample which was input into the svm classifier had both colon cancer tissue samples and normal colon tissue samples right now. The main objective of this study was to classify Gene samples based on whether they are cancerous or not. Okay, so svm was trained using the 50 to 200 samples in order to discriminate between non-tumor from tumor specimens. So the performance of The svm classifier was very accurate for even a small data set. All right, we had only 50 to 200 samples. And even for the small data set svm was pretty accurate with its results. Not only that its performance was compared to other classification algorithm like naive Bayes and in each case svm outperform naive Bayes. So after this experiment it was clear that svm classify the data more effectively and it worked exceptionally good with small data sets. Let's go ahead and understand what exactly is unsupervised learning. So sometimes the given data is unstructured and unlabeled so it becomes difficult to classify the data into different categories. So unsupervised learning helps to solve this problem. This learning is used to Cluster the input data in classes on the basis of their statistical properties. So example, we can cluster Different Bikes based upon the speed limit their acceleration or the average. Average that they are giving so and suppose learning is a type of machine learning algorithm used to draw inferences from data sets consisting of input data without labels responses. So if you have a look at the workflow or the process flow of unsupervised learning, so the training data is collection of information without any label. We have the machine learning algorithm and then we have the clustering malls. So what it does is that distributes the data into different clusters and again if you provide any Lebanon new data, it will make a prediction and find out to which cluster that particular data or the data set belongs to or the particular data point belongs to so one of the most important algorithms in unsupervised learning is clustering. So let's understand exactly what is clustering. So a clustering basically is the process of dividing the data sets into groups consisting of similar data points. It means grouping of objects based on the information found in the data describing the objects or their relationships, so So clustering malls focus on and defying groups of similar records and labeling records according to the group to which they belong now. This is done without the benefit of prior knowledge about the groups and their creator districts. So and in fact, we may not even know exactly how many groups are there to look for. Now. These models are often referred to as unsupervised learning models, since there's no external standard by which to judge the malls classification performance. There are no right or wrong answers to these model and if we talk about why clustering is used so the goal of clustering is to determine the intrinsic growth in a set of unlabeled data sometime. The partitioning is the goal or the purpose of clustering algorithm is to make sense of and exact value from the last set of structured and unstructured data. So that is why clustering is used in the industry. And if you have a look at the various use cases of clustering in Industry so first of all, it's being used in marketing. So discovering distinct groups in customer databases such as customers who make a lot of long distance calls customers who use internet more than cause they're also using insurance companies for like identifying groups of Corporation insurance policy holders with high average claim rate Farmers crash cops, which is profitable. They are using C Smith studies and Define probability areas of oil or gas exploration based. Don't cease make data and they're also used in the recommendation of movies. If you'd say they are also used in Flickr photos. They also used by Amazon for recommending the product which category it lies in. So basically if we talk about clustering there are three types of clustering. So first of all, we have the exclusive clustering which is the hard clustering so here and item belongs exclusively to one cluster not several clusters and the datapoint belong exclusively to one cluster. ER so an example of this is the k-means clustering so claiming clustering does this exclusive kind of clustering so secondly, we have overlapping clustering so it is also known as soft clusters in this and item can belong to multiple clusters as its degree of association with each cluster is shown and for example, we have fuzzy or the c means clustering which has been used for overlapping clustering and finally we have the hierarchical clustering so When two clusters have a parent-child relationship or a tree-like structure, then it is known as hierarchical cluster. So as you can see here from the example, we have a parent-child kind of relationship in the cluster given here. So let's understand what exactly is K means clustering. So today means clustering is an Enquirer them whose main goal is to group similar elements of data points into a cluster and it is a process by which objects are classified into a predefined number of groups so that they They are as much just similar as possible from one group to another group but as much as similar or possible within each group now if you have a look at the algorithm working here, you're right. So first of all, it starts with and defying the number of clusters, which is K that I can we find the centroid we find that distance objects to the distance object to the centroid distance of object to the centroid. Then we find the grouping based on the minimum distance. Past the centroid Converse if true then we make a cluster false. We then I can't find the centroid repeat all of the steps again and again, so let me show you how exactly clustering was with an example here. So first we need to decide the number of clusters to be made now another important task here is how to decide the important number of clusters or how to decide the number of classes will get into that later. So first, let's assume that the number of clusters we have decided. It is three. So after that then we provide the centroids for all the Clusters which is guessing and the algorithm calculates the euclidean distance of the point from each centroid and assize the data point to the closest cluster now euclidean distance. All of you know is the square root of the distance the square root of the square of the distance. So next when the centroids are calculated again, we have our new clusters for each data point then again the distance from the points. To the new classes are calculated and then again the points are assigned to the closest cluster. And then again, we have the new centroid scattered and now these steps are repeated until we have a repetition the centroids or the new centralized are very close to the very previous ones. So until unless our output gets repeated or the outputs are very very close enough. We do not stop this process. We keep on calculating the euclidean distance of all the points to the centroid. It's then we calculate the new centroids and that is how K means clustering Works basically, so an important part here is to understand how to decide the value of K or the number of clusters because it does not make any sense. If you do not know how many classes are you going to make? So to decide the number of clusters? We have the elbow method. So let's assume first of all compute the sum squared error, which is sse4 some value of a for example. Take two four six and eight now the SSE which is the sum squared error is defined as a sum of the squared distance between each number member of the cluster and its centroid mathematically and if you mathematically it is given by the equation which is provided here. And if you brought the key against the SSE, you will see that the error decreases as K gets large not this is because the number of cluster increases they should be smaller. So the Distortion is also smaller know. The idea of the elbow method is to choose the K at which the SSE decreases abruptly. So for example here if we have a look at the figure given here. We see that the best number of cluster is at the elbow as you can see here the graph here changes abruptly after the number four. So for this particular example, we're going to use for as a number of cluster. So first of all while working with k-means clustering there are two key points to know first of all, Be careful about where you start so choosing the first center at random during the second center. That is far away from the first center similarly choosing the NIH Center as far away as possible from the closest of the of the other centers and the second idea is to do as many runs of k-means each with different random starting points so that you get an idea of where exactly and how many clusters you need to make and where exactly the centroid lies and how the data is getting converted. Divorced now k-means is not exactly a very good method. So let's understand the pros and cons of k-means clustering. We know that k-means is simple and understandable. Everyone learns to the first go the items automatically assigned to the Clusters. Now if we have a look at the cons, so first of all one needs to define the number of clusters, there's a very heavy task asks us if we have three four or if we have 10 categories, and if you do not know what the number of clusters are going to be. It's very difficult for anyone. You know to guess the number of clusters not all the items are forced into clusters whether they are actually belong to any other cluster or any other category. They are forced to rely in that other category in which they are closest to this against happens because of the number of clusters with not defining the correct number of clusters or not being able to guess the correct number of clusters. So and for most of all, it's unable to handle the noisy data and the outliners because anyways machine learning engineers and date. Our scientists have to clean the data. But then again it comes down to the analysis what they're doing and the method that they are using so typically people do not clean the data for k-means clustering or even if the clean there's sometimes a now see noisy and outliners data which affect the whole model so that was all for k-means clustering. So what we're going to do is now use k-means clustering for the movie datasets, so, Have to find out the number of clusters and divide it accordingly. So the use case is that first of all, we have a data set of five thousand movies. And what we want to do is grip them if the movies into clusters based on the Facebook likes, so guys, let's have a look at the demo here. So first of all, what we're going to do is import deep copy numpy pandas Seaborn the various libraries, which we're going to use now and from my proclivities in the use ply plot. And we're going to use this ggplot and next what we're going to do is import the data set and look at the shape of the data set. So if you have a look at the shape of the data set we can see that it has 5043 rose with 28 columns. And if you have a look at the head of the data set we can see it just 5043 data points, so George we going to do is place the data points in the plot we take the director Facebook likes and we have a look at the data columns face number in post cars total Facebook likes director Facebook likes. So what we have done here now is taking the director Facebook likes and the actor three Facebook likes, right. So we have five thousand forty three rows and two columns Now using the k-means from sklearn what we're going to do is import it. First we're going to import k-means from scale and Dot cluster. Remember guys eschaton is a very important library in Python for machine learning. So and the number of cluster what we're going to do is provide as five now this again, the number of cluster depends upon the SSE, which is the sum of squared errors all the we're going to use the elbow method. So I'm not going to go into the details of that again. So we're going to fit the data into the k-means to fit and if you find the cluster, Us than for the k-means and printed. So what we find is is an array of five clusters and Fa print the label of the k-means cluster. Now next what we're going to do is plot the data which we have with the Clusters with the new data clusters, which we have found and for this we're going to use the CC Bond and as you can see here, we have plotted that car. We have plotted the data into the grid and you can see here we have five clusters. So probably what I would say is that the cluster 3 and the cluster zero are very very close. So it might depend see that's exactly what I was going to say. Is that initially the main Challenge and k-means clustering is to define the number of centers which are the K. So as you can see here that the third Center and the zeroth cluster the third cluster and the zeroth cluster up very very close to each other. So guys It probably could have been in one another cluster and the another disadvantage was that we do not exactly know how the points are to be arranged. So it's very difficult to force the data into any other cluster which makes our analysis a little different works fine. But sometimes it might be difficult to code in the k-means clustering now, let's understand what exactly is c means clustering. So the fuzzy see means is an extension of the k-means clustering the popular simple. Clustering technique so fuzzy clustering also referred as soft clustering is a form of clustering in which each data point can belong to more than one cluster. So k-means tries to find the heart clusters where each point belongs to one cluster. Whereas the fuzzy c means discovers the soft clusters in a soft cluster any point can belong to more than one cluster at a time with a certain Affinity value towards each 4zc means assigns the degree of membership, which Just from 0 to 1 to an object to a given cluster. So there is a stipulation that the sum of Z membership of an object to all the cluster. It belongs to must be equal to 1 so the degree of membership of this particular point to pull of these clusters as 0.6 0.4. And if you add up we get 1 so that is one of the logic behind the fuzzy c means so and and this Affinity is proportional to the distance from the point to the center of a cluster now then again We have the pros and cons of fuzzy see means. So first of all, it allows a data point to be in multiple cluster. That's a pro. It's a more neutral representation of the behavior of jeans jeans usually are involved in multiple functions. So it is a very good type of clustering when we're talking about genes First of and again, if we talk about the cons again, we have to Define c which is the number of clusters same as K next. We need to determine the membership cutoff value also, so that takes a lot of I'm and it's time-consuming and the Clusters are sensitive to initial assignment of centroid. So a slight change or deviation from the center's it's going to result in a very different kind of, you know, a funny kind of output with that from the fuzzy see means and one of the major disadvantage of c means clustering is that it's this a non-deterministic algorithm. So it does not give you a particular output as in such that's that now let's have a look at At the throat type of clustering which is the hierarchical clustering. So hierarchical clustering is an alternative approach which builds a hierarchy from the bottom up or the top to bottom and does not require to specify the number of clusters beforehand. Now, the algorithm works as in first of all, we put each data point in its own cluster and if I the closest to Cluster and combine them into one more cluster repeat the above step till the data points are in a single cluster. Now, there are two types of hierarchical clustering one is I've number 80 plus string and the other one is division clustering. So a cumulative clustering bills the dendogram from bottom level while the division clustering it starts all the data points in one cluster the fruit cluster now again hierarchical clustering also has some sort of pros and cons. So in the pros don't know Assumption of a particular number of cluster is required and it may correspond to meaningful tax anomalies. Whereas if we talk about the cons once a decision is made to combine two clusters. It cannot be undone and one of the major disadvantage of these hierarchical clustering is that it becomes very slow. If we talked about very very large data sets and nowadays. I think every industry are using last year as it's and collecting large amounts of data. So hierarchical clustering is not the act or the best method someone might need to go for so there's that Hello everyone and welcome to this interesting session on a prairie algorithm. Now many of us have visited retails shops such as Walmart or Target for our household needs. Well, let's say that we are planning to buy a new iPhone from Target. What we would typically do is search for the model by visiting the mobile section of the stove and then select the product and head towards the billing counter. But in today's world the goal of the organization is to increase the revenue. Can this be done by just pitching one? I worked at a time to the customer. Now. The answer to Is is clearly no hence organization began mining data relating to frequently bought items. So a Market Basket analysis is one of the key techniques used by large retailers to uncover associations between items now examples could be the customers who purchase Bread have a 60 percent likelihood to also purchase Jam customers who purchase laptops are more likely to purchase laptop bags as well. They try to find out associations between different items and products that can be sold together which gives assisting in the right product placement. Typically, it figures out what products are being bought together and organizations can place products in a similar manner, for example, people who buy bread also tend to buy butter, right and the marketing team at retail stores should Target customers who buy bread and butter and provide an offer to them so that they buy a But item suppose X so if a customer buys bread and butter and sees a discount offer on X, he will be encouraged to spend more and buy the eggs and this is what Market Basket analysis is all about. This is what we are going to talk about in this session, which is Association rule Mining and the a prayer real Corinth now Association rule can be thought of as an if-then relationship just to elaborate on that. We have come up with a rule suppose if an item a is Been bought by the customer. Then the chances of Item B being picked by the customer to under the same transaction ID is found out you need to understand here that it's not a cash reality rather. It's a co-occurrence pattern that comes to the force. Now, there are two elements to this rule first if and second is the then now if is also known as antecedent. This is an item or a group of items that are typically found in the item set and the later one. Is called the consequent this comes along as an item with an antecedent group or the group of antecedents a purchase. Now if we look at the image here a arrow B, it means that if a person buys an item a then he will also buy an item b or he will most probably by an item B. Now the simple example that I gave you about the bread-and-butter and the x is just a small example, but what if you have thousands and thousands of items if you go to any proof additional data scientist with that data, you can just imagine how much of profit you can make if the data scientist provides you with the right examples and the right placement of the items, which you can do and you can get a lot of insights. That is why Association rule mining is a very good algorithm which helps the business make profit. So, let's see how this algorithm works. So Association rule mining is all about building the rules and we have just seen one rule that If you buy a then there's a slight possibility or there is a chance that you might buy be also this type of a relationship in which we can find the relationship between these two items is known as single cardinality, but what if the customer who bought a and b also wants to buy C or if a customer who bought a b and c also wants to buy D. Then in these cases the cardinality usually increases and we can have a lot of combination around. These data and if you have around 10,000 or more than 10,000 data or items just imagine how many rules you're going to create for each product. That is why Association rule mining has such measures so that we do not end up creating tens of thousands of rules. Now that is where the a priori algorithm comes in. But before we get into the a priori algorithm, let's understand. What's the maths behind it. Now there are three types of matrices. Which help to measure the association? We have support confidence and lift. So support is the frequency of item a or the combination of item ARB. It's basically the frequency of the items, which we have bought and what are the combination of the frequency of the item. We have bought. So with this what we can do is filter out the items, which have been bought less frequently. This is one of the measures which is support now what confidence tells us so conference. Gives us how often the items NB occur together given the number of times a occur. Now this also helps us solve a lot of other problems because if somebody is buying a and b together and not buying see we can just rule out see at that point of time. So this solves another problem is that we obviously do not need to analyze the process which people just by barely. So what we can do is according to the sages we can Define our minimum support and confidence and when you have set Values we can put this values in the algorithm and we can filter out the data and we can create different rules and suppose even after filtering you have like five thousand rules. And for every item we create these 5,000 rules. So that's practically impossible. So for that we need the third calculation, which is the lift so lift is basically the strength of any Rule now, let's have a look at the denominator of the formula given here and if you see Here, we have the independent support values of A and B. So this gives us the independent occurrence probability of A and B. And obviously there's a lot of difference between the random occurrence and Association and if the denominator of the lift is more what it means is that the occurrence of Randomness is more rather than the occurs because of any association. So left is the final verdict where we know whether we have to spend time. On this particular rule what we have got here or not. Now, let's have a look at a simple example of Association rule mining. So suppose. We have a set of items a b c d and e and a set of transactions T1 T2 T3 T4 and T5 and as you can see here, we have the transactions T1 in which we have ABC T to a CD t3b CDT for a d e and T5 BCE. Now what we generally do is create. At some rules or Association rules such as a gives T or C gives a a gift C B and C gives a what this basically means is that if a person buys a then he's most likely to buy D. And if a person by C, then he's most likely to buy a and if you have a look at the last one, if a person buys B and C is most likely to buy the item a as well now if we calculate the support confidence and lift using these rules as you can see here in the table, we have the rule. And the support confidence handle lift values. Let's discuss about a prairie. So a priori algorithm uses the frequent itemsets to generate the association Rule and it is based on the concept that subset of a frequent itemsets must also be a frequent item set itself. Now this raises the question what exactly is a frequent item set. So a frequent item set is an item set whose support value is greater than the threshold value just now we discussed that the marketing team according to the says have a minimum threshold value for the confidence as well as the support. So frequent itemsets is that animset who support value is greater than the threshold value already specified example, if A and B is a freaker item set Than A and B should also be frequent itemsets individually. Now, let's consider the following transaction to make the things such as easier suppose. We have transactions 1 2 3 4 5 and these Items out there. So T 1 has 1 3 & 4 T 2 has 2 3 and 5 T3 has 1 2 3 5 T 4 to 5 and T 5 1 3 & 5 now the first step is to build a list of items sets of size 1 by using this transactional data. And one thing to note here is that the minimum support count which is given here is to Let's suppose it's too so the first step is to create item sets of size 1 and calculate their support values. So as you can see here. We have the table see one in which we have the item sets 1 2 3 4 5 and the support values if you remember the formula of support, it was frequency divided by the total number of occurrence. So as you can see here for the items that one the support is 3 as you can see here that item set one up here s and t 1 T 3 and T 5. So as you can see, it's frequency is 1 2 & 3 now as you can see the item set for has a support of one as it occurs only once in Transaction one but the minimum support value is 2 that's why it's going to be eliminated. So we have the final table which is the table F1, which we have the item sets 1 2 3 and 5 and we have the support values 3 3 4 & 4 now the next step is to create Adam sets of size 2 and calculate their support values now all the combination of the item sets in the F1, which is the final table in which it is carded the for are going to be used for this iteration. So So we get the table c 2. So as you can see here, we have 1 2 1 3 1 5 2 3 2 5 & 3 5 now if you calculate the support here again, we can see that the item set 1 comma 2 has a support of one which is again less than the specified threshold. So we're going to discard that so if we have a look at the table f 2 we have 1 comma 3 1 5 2 3 2 5 & 3 5 again, we're going to move forward and create the atoms. That of size 3 and calculate this support values. Now all the combinations are going to be used from the item set F to for this particular iterations. Now before calculating support values, let's perform proning on the data set. Now what is pruning now after the combinations are being made we device c 3 item sets to check if there is another subset whose support is less than the minimum support value. That is what frequent items that means. So if you have a look here the item sets. We have is 1 2 3 1 2 1 3 2 3 4 the first one because as you can see here if we have a look at the subsets of one two, three, we have 1 comma 2 as well, so we are going to discard this whole item set same goes for the second one. We have one to five. We have 1/2 in that which was discarded in the previous set or the previous step. That's why we're going to discard that also which leaves us with only two factors, which is 1 3 5 8. I'm set and the two three five and the support for this is 2 and 2 as well. Now if we create the table C for using four elements, we going to have only one item set, which is 1 2 3 and 5 and if you have a look at the table here the transaction table one, two, three and five appears only one. So the support is one and since C for the support of the whole table C 4 is less than 2 so we're going to stop here and return to the previous item set that It is 3 3 so the frequent itemsets have 1 3 5 and 2 3 5 now let's assume our minimum confidence value is 60 percent for that. We're going to generate all the non-empty subsets for each frequent itemsets. Now for I equals 1 comma 3 comma 5 which is the item set. We get the subset one three one five three five one three and five similarly for 2 3 5 we get to three to five three five two three. and five now this rule states that for every subset s of I the output of the rule gives something like s gives i2s that implies s recommends I of s and this is only possible if the support of I divided by the support of s is greater than equal to the minimum confidence value now applying these rules to the item set of F3 we get rule 1 which is 1 3 gives 1 comma 3 comma 5 and 1/3 3 it means 1 and 3 gives 5 so the confidence is equal to the support of 1 comma 3 comma fire driver support of 1 comma 3 that equals 2 by 3 which is 66% and which is greater than the 60 percent. So the rule 1 is selected now if we come to rule 2 which is 1 comma 5 it gives 1 comma 3 comma 5 and 1 5 it means if we have 1 & 5 it implies. We also going to have three know. Calculate the confidence of this one. We're going to have support 1 3 5 whereby support 1/5 which gives us a hundred percent which means rule 2 is selected as well. But again if you have a look at rule 506 over here similarly, if it's select 3 gives 1 3 5 & 3 it means if you have three, we also get one and five. So the confidence for this comes at 50% Which is less than the given 60 percent Target. So we're going to reject this Rule and same. Goes for the rule number six. Now one thing to keep in mind here is that all those are rule 1 and Rule 5 look a lot similar they are not so it really depends what's on the left hand side of the arrow. And what's on the right-hand sides of the arrow. It's the if-then possibility. I'm sure you guys can understand what exactly these rows are and how to proceed with this rules. So, let's see how we can implement the same in Python, right? So for that what I'm going to do is create a new python. and I'm going to use the chapter notebook. You're free to use any sort of ID. I'm going to name it as a priority. So the first thing what we're going to do is we will be using the online transactional data of retail store for generating Association rules. So firstly what we need to do is get the pandas and ml x 10 libraries imported and read the file. So as you can see here, we are using the online retail dot xlsx format file and from ml extant. We're going to import a prairie and Association rules at all comes under MX 10. So as you can see here, we have the invoice the stock quote the description the quantity the invoice data unit price customer ID and the country now next in this step. What we're going to do is do data cleanup which includes removing the spaces from some of the descriptions. And drop the rules that do not have invoice numbers and remove the great grab transactions because that is of no use to us. So as you can see here at the output in which we have like five hundred and thirty two thousand rows with eight columns. So after the cleanup, we need to consolidate the items into one transaction per row with each product for the sake of keeping the data set small. We are only looking at the sales for France. So as you can see here, we have excluded all the other says we're just looking at the sales for France. Now. There are a lot of zeros in the data. But we also need to make sure any positive values are converted to 1 and anything less than zero is set to 0 so as you can see here, we are still 392 Rose. We're going to encode it and see. Check again. Now that you have structured the data properly in this step. What we're going to do is generate frequent itemsets that have support at least seven percent, but this number is chosen so that you can get close enough and generated rules with the corresponding support confidence and lift. So go ahead you can see here. The minimum support is 0.71 of what if we add another constraint on the rules such as the lift is greater than 6 and the conference is greater than 0.8. So as you can see here, we have the left-hand side and the right-hand side of the association rule, which is the antecedent and the consequence. We have the support. We have the confidence to lift the leverage and the conviction. So guys, that's it for this session. That is how you create Association rules using the API. Real gold tone which helps a lot in the marketing business. It runs on the principle of Market Basket analysis, which is exactly what big companies like Walmart. You have Reliance and Target to even Ikea does it and I hope you got to know what exactly is Association rule mining what is lift confidence and support and how to create Association rules. So guys reinforcement learning. Dying is a part of machine learning where an agent is put in an environment and he learns to behave in this environment by performing certain actions. Okay, so it basically performs actions and it either gets a rewards on the actions or it gets a punishment and observing the reward which it gets from those actions reinforcement learning is all about taking an appropriate action in order to maximize the reward in a particular situation. So guys in supervised learning the training data comprises of the input and the expected output And so the model is trained with the expected output itself, but when it comes to reinforcement learning, there is no expected output here. The reinforcement agent decides what actions to take in order to perform a given task in the absence of a training data set. It is bound to learn from its experience itself. Alright. So reinforcement learning is all about an agent who's put in an unknown environment and he's going to use a hit and trial method in order to figure out the environment and then come up with an outcome. Okay. Now, let's look at it. Reinforcement learning within an analogy. So consider a scenario where in a baby is learning how to walk the scenario can go about in two ways. Now in the first case the baby starts walking and makes it to the candy here. The candy is basically the reward it's going to get so since the candy is the end goal the baby is happy. It's positive. Okay, so the baby is happy and it gets rewarded a set of candies now another way in which this could go is that the baby starts walking but Falls due to some hurdle in between The baby gets hot and it doesn't get any candy and obviously the baby is sad. So this is a negative reward. Okay, or you can say this is a setback. So just like how we humans learn from our mistakes by trial and error reinforcement learning is also similar. Okay, so we have an agent which is basically the baby and a reward which is the candy over here. Okay, and with many hurdles in between the agent is supposed to find the best possible path to read through the reward. So guys. I hope you all are clear with the reinforcement learning now, let's look at At the reinforcement learning process. So generally a reinforcement learning system has two main components, right? The first is an agent and the second one is an environment. Now in the previous case, we saw that the agent was the baby and the environment was the living room where in the baby was crawling. Okay. The environment is the setting that the agent is acting on and the agent over here represents the reinforcement learning algorithm. So guys the reinforcement learning process starts when the environment sends a state to the And then the agent will take some actions based on the observations in turn the environment will send the next state and the respective reward back to the agent. The agent will update its knowledge with the reward returned by the environment and it uses that to evaluate its previous action. So guys this Loop keeps continuing until the environment sends a terminal state which means that the agent has accomplished all his tasks and he finally gets the reward. Okay. This is exactly what was depicted in this scenario. So the agent keeps climbing up ladders until he reaches his reward to understand this better. Let's suppose that our agent is learning to play Counter Strike. Okay. So let's break it down now initially the RL agent which is basically the player player 1. Let's say it's a player one who is trying to learn how to play the game. Okay. He collects some state from the environment. Okay. This could be the first date of Counter-Strike now based on the state the agent will take some action. Okay, and this action can be anything that causes a result. So if the Almost left or right it's also considered as an action. Okay, so initially the action is going to be random because obviously the first time you pick up Counter-Strike, you're not going to be a master at it. So you're going to try with different actions and you just want to pick up a random action in the beginning. Now the environment is going to give a new state. So after clearing that the environment is now going to give a new state to the agent or to the player. So maybe he's across th one now. He's in stage 2. So now the player will get a reward our one from the environment. Because it cleared stage 1. So this reward can be anything. It can be additional points or coins or anything like that. Okay. So basically this Loop keeps going on until the player is dead or reaches the destination. Okay, and it continuously outputs a sequence of States actions and rewards. So guys, this was a small example to show you how reinforcement learning process works. So you start with an initial State and once a player clothes that state he gets a reward after that the environment will give another stage to the player. And after it clears that state it's going to get another award and it's going to keep happening until the player reaches his destination. All right, so guys, I hope this is clear now, let's move on and look at the reinforcement learning definitions. So there are a few Concepts that you should be aware of while studying reinforcement learning. Let's look at those definitions over here. So first we have the agent now an agent is basically the reinforcement learning algorithm that learns from trial and error. Okay, so an agent takes actions like For example a soldier in Counter-Strike navigating through the game. That's also an action. Okay, if he moves left right or if he shoots at somebody that's also an action. Okay. So the agent is responsible for taking actions in the environment. Now the environment is the whole Counter-Strike game. Okay. It's basically the world through which the agent moves the environment takes the agents current state and action as input and it Returns the agency reward and its next state as output. Alright next we have action now all the possible. Steps that an agent can take are called actions. So like I said, it can be moving right left or shooting or any of that. Alright, then we have state now state is basically the current condition returned by the environment. So whichever State you are in if you are in state 1 or if you're in state to that represents your current condition. All right. Next we have reward a reward is basically an instant return from the environment to appraise Your Last Action. Okay, so it can be anything like coins or it can be audition. Two points. So basically a reward is given to an agent after it clears the specific stages. Next we have policy policies basically the strategy that the agent uses to find out his next action based on his current state policy is just the strategy with which you approach the game. Then we have value. Now while you is the expected long-term return with discount so value in action value can be a little bit confusing for you right now, but as we move further, you'll understand what I'm talking. Kima okay. So value is basically the long-term return that you get with discount. Okay discount. I'll explain in the furthest lines. Then we have action value now action value is also known as Q value. Okay. It's very similar to Value except that it takes an extra parameter, which is the current action. So basically here you'll find out the Q value depending on the particular action that you took. All right. So guys don't get confused with value and action value. We look at examples in the further slides and you will understand this better. Okay. So guys make sure that you're familiar with these terms because you'll be seeing a lot of these terms in the further slides. All right. Now before we move any further, I'd like to discuss a few more Concepts. Okay. So first we will discuss the reward maximization. So if you haven't already realized it the basic aim of the RL agent is to maximize the reward now, how does that happen? Let's try to understand this in depth. So the agent must be trained in such a way that he takes the best action so that the reward is Because the end goal of reinforcement learning is to maximize your reward based on a set of actions. So let me explain this with a small game now in the figure you can see there is a fox there's some meat and there's a tiger so our agent is basically the fox and his end goal is to eat the maximum amount of meat before being eaten by the tiger now since the fox is a clever fellow he eats the meat that is closer to him rather than the meat which is closer to the tiger. Now this is because the closer he is to the tiger the higher our his chances of getting killed. So because of this the rewards which are near the tiger, even if they are bigger meat chunks, they will be discounted. So this is exactly what discounting means so our agent is not going to eat the meat chunks which are closer to the tiger because of the risk. All right now, even though the meat chunks might be larger. He does not want to take the chances of getting killed. Okay. This is called discounting. Okay. This is where you discount because it improvise and you just eat the meat which are closer to you instead of taking risks and eating the meat which are The to your opponent. All right. Now the discounting of reward Works based on a value called gamma will be discussing gamma in our further slides but in short the value of gamma is between 0 and 1. Okay. So the smaller the gamma the larger is the discount value. Okay. So if the gamma value is lesser, it means that the agent is not going to explore and he's not going to try and eat the meat chunks which are closer to the tiger. Okay, but if the gamma value is closer to 1 it means that our agent is actually We're going to explore and it's going to dry and eat the meat chunks which are closer to the tiger. All right, now, I'll be explaining this in depth in the further slides. So don't worry if you haven't got a clear concept yet, but just understand that reward maximization is a very important step when it comes to reinforcement learning because the agent has to collect maximum rewards by the end of the game. All right. Now, let's look at another concept which is called exploration and exploitation. So exploration like the name suggests is about exploring and capturing. More information about an environment on the other hand exploitation is about using the already known exploited information to heighten the rewards. So guys consider the fox and tiger example that we discussed now here the fox eats only the meat chunks which are close to him, but he does not eat the meat chunks which are closer to the tiger. Okay, even though they might give him more Awards. He does not eat them if the fox only focuses on the closest rewards, he will never reach the big chunks of meat. Okay, this is what exploitation is the about you just going to use the currently known information and you're going to try and get rewards based on that information. But if the fox decides to explore a bit, it can find the bigger award which is the big chunks of meat. This is exactly what exploration is. So the agent is not going to stick to one corner instead. He's going to explore the entire environment and try and collect bigger rewards. All right, so guys, I hope you all are clear with exploration and exploitation. Now, let's look at the markers decision process. So guys this is basically a mathematical approach for mapping a solution in reinforcement learning in a way. The purpose of reinforcement learning is to solve a Markov decision process. Okay. So there are a few parameters that are used to get to the solution. So the parameters include the set of actions the set of states the rewards the policy that you're taking to approach the problem and the value that you get. Okay, so to sum it up the agent must take an action a to transition from a start state. The end State s while doing so the agent will receive a reward are for each action that he takes. So guys a series of actions taken by the agent Define the policy or it defines the approach and the rewards that are collected Define the value. So the main goal here is to maximize the rewards by choosing the optimum policy. All right. Now, let's try to understand this with the help of the shortest path problem. I'm sure a lot of you might have gone through this problem when you are in college. So guys look at the graph over here. So our aim here is to find the shortest path between a and d with minimum possible cost. So the value that you see on each of these edges basically denotes the cost. So if I want to go from a to c it's going to cost me 15 points. Okay. So let's look at how this is done. Now before we move and look at the problem in this problem the set of states are denoted by the nodes, which is ABCD and the action is to Traverse from one node to the other. So if I'm going from a Be that's an action similarly a to see that's an action. Okay, the reward is basically the cost which is represented by each Edge over here. All right. Now the policy is basically the path that I choose to reach the destination. So let's say I choose a seed be okay that's one policy in order to get to D and choosing a CD which is a policy. Okay. It's basically how I'm approaching the problem. So guys here you can start off at node a and you can take baby steps to your destination now initially you're Clueless. So you can just take the next possible node, which is visible to you. So guys if you're smart enough, you're going to choose a to see instead of ABCD or ABD. All right. So now if you are at nodes see you want to Traverse to note D. You must again choose a wise path or red you just have to calculate which path has the highest cost or which path will give you the maximum rewards. So guys, this is a simple problem. We just drank to calculate the shortest path between a and d by traversing through these nodes. So if I travels from a CD it gives me the maximum reward. Okay, it gives me 65 which is more than any other policy would give me okay. So if I go from ABD, it would be 40 when you compare this to a CD. It gives me more reward. So obviously I'm going to go with a CB. Okay, so guys was a simple problem in order to understand how Markov decision process works. All right, so guys, I want to ask you a question. What do you think? I did hear did I perform exploration or did I perform exploitation? Now the policy for the above example is of exploitation because we didn't explore the other nodes. Okay. We just selected three notes and we Traverse through them. So that's why this is called exploitation. We must always explore the different notes so that we can find a more optimal policy. But in this case, obviously a CD has the highest reward and we're going with a CD, but generally it's not so simple. There are a lot of nodes there hundreds of notes to Traverse and they're like 50 60 policies. Okay, 50 60 different policies. So you make sure you explore. All the policies and then decide on an Optimum policy which will give you a maximum reward. So guys before we perform the Hands-On part. Let's try to understand the math behind our demo. Okay. So in our demo will be using the Q learning algorithm which is a type of reinforcement learning algorithm. Okay, it's simple, it just means that if you take the best possible actions to reach your goal or to get the most rewards. All right, let's try to understand this with an example. So guys, this is exactly what be running in In our demo, so make sure you understand this properly. Okay. So our goal here is we're going to place an agent in any one of the rooms. Okay. So basically these squares you see here our rooms. OK 0 is a room for is a room three is a room one is a room and 2:05 is also a room. It's basically a way outside the building. All right. So what we're going to do is we're going to place an agent in any one of these rooms and the goal is to reach outside the building. Okay outside. The building is room number five. Okay, so these are These spaces are basically doors, which means that you can go from zero to four. You can go from 4 to 3 3 to 1 1 to 5 and similarly 3 to 2, but you can't go from 5 to 2 directly. All right, so there are certain set of rooms that don't get connected directly. Okay. So like of mentioned here each room is numbered from 0 to 4, and the outside of the building is numbered as five and one thing to note here is Room 1 and room for directly lead to room number five. All right. So room number one and four will directly lead out to room number five. So basically our goal over here is to get to room number five. Okay to set this room as a goal will associate a reward value to each door. Okay. Don't worry. I'll explain what I'm saying. So if you re present these rooms in a graph this is how the graph is going to look. Okay. So for example from true, you can go to three and then three two, one one two five which will lead us to our goal these arrows represent the link between the dose. No, this is quite understandable now. Our next step is to associate a reward value to each of these doors. Okay, so the rooms that are directly connected to our end room, which is room number five will get a reward of hundred. Okay. So basically our room number one will have a reward five now. This is obviously because it's directly connected to 5 similarly for will also be associated with a reward of hundred because it's directly connected to 5. Okay. So if you go out from for it will lead to five now the other know. Roads are not directly connected to 5. So you can't directly go from 0 to 5. Okay. So for this will be assigning a reward of zero. So basically other doors not directly connected to the Target room have a zero reward. Okay now because the doors are to weigh the two arrows are assigned to each room. Okay, you can see two arrows assigned to each room. So basically zero leads to four and four leads back to 0 now. We have assigned 0 0 over here because 0 does not directly lead to five but one directly leads to Five and that's why you can see a hundred over here similarly for directly leads to our goal State and that's why we were signed a hundred over here and obviously five two five is hundred as well. So here all the direct connections to room number five are rewarded hundred and all the indirect connections are awarded zero. So guys in q-learning the end goal is to reach the state with the highest reward so that the agent arrives at the goal. Okay. So let me just explain this graph to you in detail now these These rooms over here labeled one, two, three to five they represent the state an agent is in so if I stay to one It means that the agent is in room number one similarly the agents movement from one room to the other represents the action. Okay. So if I say one two, three, it represents an action. All right. So basically the state is represented as node and the action is represented by these arrows. Okay. So this is what this graph is about these nodes represent the rooms and these Arrows represent the actions. Okay. Let's look at a small example. Let's set the initial state to 0. So my agent is placed in room number two, and he has to travel all the way to room number five. So if I set the initial stage to to he can travel to State 3. Okay from three he can either go to one or you can go back to to or you can go to for if he chooses to go to for it will directly take him to room number 5, okay, which is our end goal and even if he goes from room number 3 2 1 it will take him to room number. High five, so this is how our algorithm works is going to drivers different rooms. In order to reach the Gold Room, which is room number 5. Now, let's try and depict these rewards in the form of a matrix. Okay, because we'll be using this our Matrix or the reward Matrix to calculate the Q value or the Q Matrix. Okay. We'll see what the Q value is in the next step. But for now, let's see how this reward Matrix is calculated. Now the - ones that you see in the table, they represent the null values. Now these -1 basically means that Wherever there is no link between nodes. It's represented as minus 1 so 0 2 0 is minus 1 0 to 1 there is no link. Okay, there's no direct link from 0 to 1. So it's represented as minus 1 similarly 0 to 2 or 2. There is no link. You can see there's no line over here. So this is also minus 1, but when it comes to 0 to 4, there is a connection and we have numbered 0 because the reward for a state which is not directly connected to the goal is zero, but if you look at this 1 comma 5 which is is basically traversing from Node 1 to node 5, you can see the reward is hundred. Okay, that's basically because one and five are directly connected and five is our end goal. So any node which will directly connected to our goal state will get a reward of hundred. Okay. That's why I've put hundred over here similarly. If you look at the fourth row over here. I've assigned hundred over here. This is because from 4 to 5 that is a direct connection. There's a direct connection which gives them a hundred reward. Okay, you can see from 4 to 5. There is a direct link. Okay, so from room number for to room number five you can go directly. That's why there's a hundred reward over here. So guys, this is how the reward Matrix is made. Alright, I hope this is clear to you all. Okay. Now that we have the reward Matrix. We need to create another Matrix called The Q Matrix. OK here, you'll store or the Q values that will calculate now this Q Matrix basically represents the memory of what the agent has learned through experience. Okay. So once he traverses from one room to the final room, whatever he's learned. It is stored in this Q Matrix. Okay, in order for him to remember that the next time he travels this we use this Matrix. Okay. It's basically like a memory. So guys the rows of the Q Matrix will represent the current state of the agent The Columns will represent the possible actions and to calculate the Q value use this formula. All right, I'll show you what the Q Matrix looks like, but first, let's understand this formula. Now this Q value will calculating because we want to fill in the Q Matrix. Okay. So this is basically a Matrix over here initially, it's all 0 but as the agent Traverse is from different nodes to the destination node. This Matrix will get filled up. Okay. So basically it will be like a memory to the agent. He'll know that okay, when he traversed using a particular path, he found out that his value was maximum or as a reward was maximum of year. So next time he'll choose that path. This is exactly what the Q Matrix is. Okay. Let's go back now guys, don't worry about this formula for now because we'll be implementing this formula in an example. In the next slide. Okay, so don't worry about this formula for now, but here just remember that this Q basically represents the Q Matrix the r represents the reward Matrix and the gamma is the gamma value which I'll talk about shortly and here you just finding out the maximum from the Q Matrix. So basically the gamma parameter has a range from 0 to 1 so you can have a value of 0.1 0.3 0.5 0.8 and all of that. So if the gamma is closer to zero it means That the agent will consider only the immediate rewards which means that the agent will not explore the surrounding. Basically, it won't explore different rooms. It will just choose a particular room and then we'll try sticking to it. But if the value of gamma is high meaning that if it's closer to one the agent will consider future Awards with greater weight. This means that the agent will explore all the possible approaches or all the possible policies in order to get to the end goal. So guys, this is what I was talking about when I mention ation and exploration. All right. So if the gamma value is closer to 1 it basically means that you're actually exploring the entire environment and then choosing an Optimum policy. But if your gamma value is closer to zero, it means that the agent will only stick to a certain set of policies and it will calculate the maximum reward based on those policies. Now next. We have the Q learning algorithm that we're going to use to solve this problem. So guys now this is going to look very confusing to y'all. So let me just explain In this with an example. Okay. We'll see what we're actually going to run in our demo. We will do the math behind it. And then I'll tell you what this Q learning algorithm is. Okay, you'll understand it as I'm showing you the example. So guys in the Q learning algorithm the agent learns from his experience. Okay, so each episode, which is basically when the agents are traversing from an initial room to the end goal is equivalent to one training session and in every training session the agent will explore the environment it will Receive some reward until it reaches the goal state which is five. So there's a purpose of training is to enhance the brain of our agent. Okay only if he knows the environment very well, will he know which action to take and this is why we calculate the Q Matrix okay in Q Matrix, which is going to calculate the value of traversing from every state to the end state from every initial room to the end room. Okay, so when we calculate all the values or how much reward we're getting from each policy that we We know the optimum policy that will give us the maximum reward. Okay, that's why we have the Q Matrix. This is very important because the more you train the agent and the more Optimum your output will be so basically here the agent will not perform exploitation instead. He'll explore around and go back and forth through the different rooms and find the fastest route to the goal. All right. Now, let's look at an example. Okay. Let's see how the algorithm works. Okay. Let's go back to the previous slide and Here it says that the first step is to set the gamma parameter. Okay. So let's do that. Now the first step is to set the value of the learning parameter, which is gamma and we have randomly set it to zero point eight. Okay. The next step is to initialize the Matrix Q 2 0 Okay. So we've set Matrix Q 2 0 over here and then we will select the initial stage Okay, the third step is select a random initial State and here we've selected the initial State as room number one. Okay. So after you initialize the matter Q as a zero Matrix from room number one, you can either go to room number three or number five. So if you look at the reward Matrix can see that from room number one, you can only go to room number three or room number five. The other values are minus 1 here, which means that there is no link from 1 to 0 1 2 1 1 2 2 and 1 to 4. So the only possible actions from room number one is to go to room number 3 and to go to room number five. All right. Okay. So let's select room number five, okay. So from room number one, you can go to 3 and 5 and we have randomly selected five. You can also select three but for example, let's select five over here. Now from Rome five, you're going to calculate the maximum Q value for the next state based on all possible actions. So from number five, the next state can be room number one four or five. So you're going to calculate the Q value for traversing 5 to 1 5 2 4 5 2 5 and you're going to find out which has the maximum Q value and that's how you're going. Compute the Q value. So let's Implement our formula. Okay, this is the q-learning formula. So right now we're traversing from room number one to room number 5. Okay. This is our state. So here I've written Q 1 comma 5. Okay one represents our current state which is room number one. Okay. Our initial state was room number one and we are traversing to room number five. Okay. It's shown in this figure room number 5 now for this we need to calculate the Q value next in our formula. It says the reward Matrix State and action. So the reward Matrix for 1 comma 5 let's look at 1 comma 5 1 comma 5 corresponds to a hundred. Okay, so I reward over here will be hundred so r 1 comma 5 is basically hundred then you're going to add the gamma value. Now the gamma value will be initialized it to zero point eight. So that's what we have written over here. And we're going to multiply it with the maximum value that we're going to get for the next date based on all possible actions. Okay. So from 5, the next state is 1 4 and 5. So if Travis from five to one that's what I've written over here 5 to 4. You're going to calculate the Q value of Fire 2 4 & 5 to 5. Okay. That's what I mentioned over here. So Q 5 comma 1 5 comma 4 and 5 comma 5 are the next possible actions that you can take from State V. So r 1 comma 5 is hundred. Okay, because from the reward Matrix, you can see that 1 comma 5 is hundred 0.8 is the value of gamma after that. We will calculate Q of 5 comma 1 5 comma 4 and 5 comma 5 Like I mentioned earlier that we're going to initialize Matrix Q as zero Matrix So based setting the value of 0 because initially obviously the agent doesn't have any memory of what is happening. Okay, so he just starting from scratch. That's why all these values are 0 so Q of 5 comma 1 will obviously be 0 5 comma 4 would be 0 and 5 comma 5 will also be zero and to find out the maximum between these it's obviously 0. So when you compute this equation, you will get hundred so the Q value of 1 comma 5 is So if I agent goes from room number one to room number five, he's going to have a maximum reward or Q value of hundred. All right. Now in the next slide you can see that I've updated the value of Q of 1 comma 5. Okay, it said 200. All right now similarly, let's look at another example so that you understand this better. So guys, this is exactly what we're going to do in our demo. It's only going to be coded. Okay. I'm just explaining our code right now. I'm just telling you the math behind it. Alright now, let's look at another example. Example OK this time. We'll start with a randomly chosen initial State. Let's say that we've chosen State 3. Okay. So from room 3, you can either go to room number one two, or four randomly will select room number one and from room number one, you're going to calculate the maximum Q value for the next state based on all possible actions. So the possible actions from one is to go to 3 and to go to 5 now if you calculate the Q value using this formula, so let me explain this to you once again now, 3 comma 1 basically represents that we're in room number three and we are going to room number one. Okay. So this represents our action? Okay. So we're going from 3 to 1 which is our action and three is our current state next we will look at the reward of going from 3 to 1. Okay, if you go to the reward Matrix 3 comma 1 is 0 okay. Now this is because there's no direct link between three and five. Okay, so that's why the reward here is zero. So the value here will be 0 after that we have the gamma value, which is zero point. Eight and then we're going to calculate the Q Max of 1 comma 3 and 1 comma 5 out of these whichever has the maximum value we're going to use that. Okay, so Q of 1 comma 3 is 0. All right 0 you can see here 1 comma 3 is 0 and 1 comma 5 if you remember we just calculated 1 comma 5 in the previous slide. Okay 1 comma 5 is hundred. So here I'm going to put a hundred. So the maximum here is hundred. So 0.8 in 200 will give us c t so that's the Q value. Going to get if you Traverse from three two one. Okay. I hope that was clear. So now we have Travers from room number three to room number one with the reward of 80. Okay, but we still haven't reached the end goal which is room number five. So for our next episode the state will be room. Number one. So guys, like I said, we'll repeat this in a loop because room number one is not our end goal. Okay, our end goal is room number 5. So now we need to figure out how to get from room number one to room number 5. So from room number one, you can either either go to three or five. That's what I've drawn over here. So if we select five we know that it's our end goal. Okay. So from room number 5, then you have to calculate the maximum Q value for the next possible actions. So the next possible actions from five is to go to room number one room number four or room number five. So you're going to calculate the Q value of 5 to 1 5 2 4 & 5 2 5 and find out which is the maximum Q value here and you're going to use that value. All right. So let's look at the formula now now again, we're in room number one and Want to go to room number 5. Okay, so that's exactly what I've written here Q 1 comma 5 next is the reward Matrix. So reward of 1 comma 5 which is hundred. All right, then we have added the gamma value which is 0.8. And then we're going to find the maximum Q value from 5 to 1 5 2 4 & 5 to 5. So this is what we're performing over here. So 5 comma 1 5 comma 4 and 5 comma 5 are all 0 this is because we initially set all the values of the Q Matrix as 0 so you get Hundred over here and the Matrix Remains the Same because we already had calculated Q 1 comma 5 so the value of 1 comma 5 is already fed to the agent. So when he comes back here, he knows our okay. He's already done this before now. He's going to try and Implement another method. Okay is going to try and take another route or another policy. So he's going to try to go from different rooms and finally land up in room number 5, so guys, this is exactly how our code runs. We're going to Traverse through each and every node because we want an Optimum ball. See, okay. An Optimum policy is attained only when you Traverse through all possible actions. Okay. So if you go through all possible actions that you can perform only then will you understand which is the best action which will lead us to the reward. I hope this is clear now, let's move on and look at our code. So guys, this is our code and this is executed in Python and I'm assuming that all of you have a good background in Python. Okay, if you don't understand python very well. I'm going to leave a link in the description. You can check out that video on Python and then maybe come back to this later. Okay, but I'll be explaining the code to you anyway, but I'm not going to spend a lot of time explaining each and every line of code because I'm assuming that you know python. Okay. So let's look at the first line of code over here. So what we're going to do is we're going to import numpy. Okay numpy is basically a python library for adding support for large multi-dimensional arrays and matrices and it's basically for computing mathematical functions. Okay so first Want to import that after that we're going to create the our Matrix. Okay. So this is the our Matrix next we're going to create a q Matrix and it's a 6 into 6 Matrix because obviously we have six states starting from 0 to 5. Okay, and we are going to initialize the value to zero. So basically the Q Matrix is going to be initialized to zero over here. All right, after that we're setting the gamma parameter to 0.8. So guys you can play with this parameter and you know move it to 0.9 or movement logo to 0.8. Okay, you can see see what happens then then we'll set an initial stage. Okay initial stage is set as 1 after that. We're defining a function called available actions. Okay. So basically what we're doing here is since our initial state is one. We're going to check our row number one. Okay, this is our own number one. Okay. This is wrong number zero. This is zero number one and so on. So we're going to check the row number one and we're going to find the values which are greater than or equal to 0 because these values basically The nodes that we can travel to now if you select minus 1 you can Traverse 2-1. Okay, I explained this earlier the - one represents all the nodes that we can travel to but we can travel to these nodes. Okay. So basically over here a checking all the values which are equal to 0 or greater than 0 these will be our available actions. So if our initial state is one we can travel to other states whose value is equal to 0 or greater than 0 and this is stored in this variable called. All available act right now. This will basically get the available actions in the current state. Okay. So we're just storing the possible actions in this available act variable over here. So basically over here since our initial state is one we're going to find out the next possible States we can go to okay that is stored in the available act variable. Now next is this function chooses at random which action to be performed within the range. So if you remember over here, so guys initially we are in stage number. Okay are available actions is to go to stage number 3 or stage number five. Sorry room number 3 or room number 5. Okay. Now randomly, we need to choose one room. So for that using this line of code, okay. So here we are randomly going to choose one of the actions from the available act this available act. Like I said earlier stores all our possible actions. Okay from the initial State. Okay. So once it chooses an action is going to store it in next action, so guys this action will Present the next available action to take now next is our Q Matrix. Remember this formula that we used. So guys this formula that we use is what we are going to calculate in the next few lines of code. So in this block of code, which is executing and Computing the value of Q. Okay, this is our formula for computing the value of Q current state Karma action. Our current state Karma action gamma into the maximum value. So here basically we're going to calculate the maximum index meaning that To be going to check which of the possible actions will give us the maximum Q value read if you remember in our explanation over here this value over here Max Q or five comma 1 5 comma 4 and 5 comma 5 we had to choose a maximum Q value that we get from these three. So basically that's exactly what we're doing in this line of code, the calculating the index which gives us the maximum value after we finish Computing the value of Q will just have to update our Matrix. After that, we'll be updating the Q value and will be choosing a new initial State. Okay. So this is the update function that is defined over here. Okay. So I've just called the function over here. So guys this whole set of code will just calculate the Q value. Okay. This is exactly what we did in our examples after that. We have the training phase. So guys remember the more you train an algorithm the better it's going to learn. Okay so over here I have provided around 10,000 titrations. Okay. So my range is 10 thousand iterations meaning that my age It will take 10,000 possible scenarios and in go to 10,000 titrations to find out the best policy. So you're exactly what I'm doing is I'm choosing the current state randomly after that. I'm choosing the available action from the current state. So either I can go to stage 3 or straight five then I'm calculating the next action and then I'm finally updating the value in the Q Matrix and next. We just normalize the Q Matrix. So sometimes in our Q Matrix the value might exceed. Okay, let's say it. Heated to 500 600 so that time you want to normalize The Matrix. Okay, we want to bring it down a little bit. Okay, because larger numbers we won't be able to understand and computation would be very hard on larger numbers. That's why we perform normalization. You're taking your calculated value and you're dividing it with the maximum Q value in 200. All right, so you are normalizing it over here. So guys, this is the testing phase. Okay here you will just randomly set a current state and you want given any other data because you've already trained our model. Okay, you're To give a Garden State then you're going to tell your agent that listen you're in room. Number one. Now. You need to go to room number five. Okay, so he has to figure out how to go to room number 5 because we have trained him now. All right. So here we have set the current state to one and we need to make sure that it's not equal to 5 because 5 is the end goal. So guys this is the same Loop that we executed earlier. So we're going to do the same I trations again now if I run this entire code, let's look at the result. So our current state here we've chosen as one. Okay and And if we go back to our Matrix, you can see that there is a direct link from 1 to 5, which means that the route that the agent should take is one to five. Okay directly. You should go from 1 to 5 because it will get the maximum reward like that. Okay. Let's see if that's happening. So if I run this it should give me a direct path from 1 to 5. Okay, that's exactly what happened. So this is the selected path so directly from one to five it went and it calculated the entire Q Matrix. Works for me. So guys this is exactly how it works. Now. Let's try to set the initial stage as that's a to so if I set the initial stage as to and if I try to run the code, let's see the path that it gives so the selected path is 2 3 4 5 now chose this path because it's giving us the maximum reward from this path. Okay. This is the Q Matrix that are calculated and this is the selected path. All right, so guys with this we come to the end of this demo. So basically what we did was we just placed an agent in a room random room and we ask it to Traverse through and reach to the end room, which is room number five. So basically we trained our agent and we made sure that it went through all the possible paths. to calculate the best path the for a robot and environment is a place where it has been put to use. Now. Remember this reward is itself the agent for example an automobile Factory where a robot is used to move materials from one place to another now the task we discussed just now have a property in common. Now, these tasks involve and environment and expect the agent to learn from the environment. Now, this is where traditional machine learning phase and hence the need for reinforcement learning now, it is good to have Establish overview of the problem that is to be solved using the Q learning or the reinforcement learning. So it helps to define the main components of a reinforcement learning solution. That is the agent environment action rewards and States. So let's suppose we are to build a few autonomous robots for an automobile building Factory. Now, these robots will help the factory personal by conveying them the necessary parts that they would need in order to pull the car. Now these different parts are located at Nine different positions within the factory warehouse the car part include the chassis Wheels dashboard the engine and so on and the factory workers have prioritized the location that contains the body or the chassis to be the topmost but they have provided the priorities for other locations as well, which will look into the moment. Now these locations within the factory look somewhat like this. So as you can see here, we have L1 L2 L3 all of these stations. Now one thing you might notice here that there are little obstacle prison in between the locations. So L6 is the top priority location that contains the chassis for preparing the car bodies. Now the task is to enable the robots so that they can find the shortest route from any given location to another location on their own. Now the agents in this case are the robots the environment is the automobile factory warehouse the let's talk about the state's the states. Are the location in which a particular robot is present in the particular instance of time which will denote it states the machines understand numbers rather than let us so let's map the location codes to number. So as you can see here, we have map location l 1 to this t 0 L 2 and 1 and so on we have L8 as state 7 + L line at state. So next what we're going to talk about are the actions. So in our example, the action will be the direct location that a robot can. Call from a particular location, right consider a robot that is a tel to location and the Direct locations to which it can move our L5 L1 and L3. Now the figure here may come in handy to visualize this now as you might have already guessed the set of actions here is nothing but the set of all possible states of the robot for each location the set of actions that a robot can take will be different. For example, the set of actions will change if the robot is. An L1 rather than L2. So if the robot is in L1, it can only go to L 4 and L 2 directly now that we are done with the states and the actions. Let's talk about the rewards. So the states are basically zero one two, three four and the actions are also 0 1 2 3 4 up till 8:00. Now, the rewards now will be given to a robot. If a location which is the state is directly reachable from a particular location. So let's take an example suppose l Lane is directly reachable from L8. Right? So if a robot goes from LA to align and vice versa, it will be rewarded by one and if a location is not directly reachable from a particular equation. We do not give any reward a reward of 0 now the reward is just a number and nothing else it enables the robots to make sense of the movements helping them in deciding what locations are directly reachable and what are not now with this Q. We can construct a reward table which contains all the required. Use mapping between all possible States. So as you can see here in the table the positions which are marked green have a positive reward. And as you can see here, we have all the possible rewards that a robot can get by moving in between the different states. Now comes an interesting decision. Now remember that the factory administrator prioritized L6 to be the topmost. So how do we incorporate this fact in the above table now, this is done by associating the topmost priority location with a very high reward. The usual ones so let's put 999 in the cell L 6 comma and six now the table of rewards with a higher reward for the topmost location looks something like this. We have not formally defined all the vital components for the solution. We are aiming for the problem discussed now, we will shift gears a bit and study some of the fundamental concepts that Prevail in the world of reinforcement learning and q-learning the first of all we'll start with the Bellman equation now consider the following Square. Rooms, which is analogous to the actual environment from our original problem. But without the barriers now suppose a robot needs to go to the room marked in the green from its current position a using the specified Direction. Now, how can we enable the robot to do this programmatically one idea would be introduced some kind of a footprint which the robot will be able to follow now here a constant value is specified in each of the rooms, which will come along the robots way if it follows the directions by Fight about now in this way if it starts at location a it will be able to scan through this constant value and will move accordingly but this will only work if the direction is prefix and the robot always starts at the location a now consider the robot starts at this location rather than its previous one. Now the robot now sees Footprints in two different directions. It is therefore unable to decide which way to go in order to get the destination which is the Green Room. It happens. Primarily because the robot does not have a way to remember the directions to proceed. So our job now is to enable the robot with a memory. Now, this is where the Bellman equation comes into play. So as you can see here, the main reason of the Bellman equation is to enable the reward with the memory. That's the thing we're going to use. So the equation goes something like this V of s gives maximum a r of s comma a plus gamma of vs - where s is a particular state Which is a room is the Action Moving between the rooms as - is the state to which the robot goes from s and gamma is the discount Factor now we'll get into it in a moment and obviously R of s comma a is a reward function which takes a state as an action a and outputs the reward now V of s is the value of being in a particular state which is the footprint now we consider all the possible actions and take the one that yields the maximum value. Now there is one constraint. However regarding the value footprint that is the room marked in the yellow just below the Green Room. It will always have the value of 1 to denote that is one of the nearest room adjacent to the green room. Now. This is also to ensure that a robot gets a reward when it goes from a yellow room to The Green Room. Let's see how to make sense of the equation which we have here. So let's assume a discount factor of 0.9 as remember gamma is the discount value or the discount Factor. So let's Take a 0.9. Now for the room, which is Mark just below the one or the yellow room, which is the Aztec Mark for this room. What will be the V of s that is the value of being in a particular state? So for this V of s would be something like maximum of a will take 0 which is the initial of our s comma. Hey plus 0.9 which is gamma into 1 that gives us zero point nine now here the robot will not get any reward for Owing to a state marked in yellow hence the IR s comma a is 0 here but the robot knows the value of being in the yellow room. Hence V of s Dash is one following this for the other states. We should get 0.9 then again, if we put 0.9 in this equation, we get 0.81 then zero point seven to nine and then we again reached the starting point. So this is how the table looks with some value Footprints computer. From the Bellman equation now a couple of things to notice here is that the max function has the robot to always choose the state that gives it the maximum value of being in that state now the discount Factor gamma notifies the robot about how far it is from the destination. This is typically specified by the developer of the algorithm. That would be installed in the robot. Now, the other states can also be given their respective values in a similar way. So as you can see here the boxes Into the green one have one and if we move away from one we get 0.9 0.8 1 0 1 7 to 9. And finally we reach 0.66 now the robot now can precede its way through the Green Room utilizing these value Footprints event if it's dropped at any arbitrary room in the given location now, if a robot Lance up in the highlighted Sky Blue Area, it will still find two options to choose from but eventually either of the parties. It's will be good enough for the robot to take because Auto V the value Footprints are not only that out. Now one thing to note is that the Bellman equation is one of the key equations in the world of reinforcement learning and Q learning. So if we think realistically our surroundings do not always work in the way we expect there is always a bit of stochastic City involved in it. So this applies to robot as well. Sometimes it might so happen that the robots Machinery got corrupted. Sometimes the robot makes come across some hindrance on its way which may not be known to it beforehand. Right and sometimes even if the robot knows that it needs to take the right turn it will not so how do we introduce this to cast a city in our case now here comes the Markov decision process now consider the robot is currently in the Red Room and it needs to go to the green room. Now. Let's now consider the robot has a slight chance of dysfunctioning and might take the left or the right or the bottom. On instead updating the upper turn in order to get to The Green Room from where it is now, which is the Red Room. Now the question is, how do we enable the robot to handle this when it is out in the given environment right. Now, this is a situation where the decision making regarding which turn is to be taken is partly random and partly another control of the robot now partly random because we are not sure when exactly the robot mind dysfunctional and partly under the control of the robot because it is still Making a decision of taking a turn right on its own and with the help of the program embedded into it. So a Markov decision process is a discrete time stochastic Control process. It provides a mathematical framework for modeling decision-making in situations where the outcomes are partly random and partly under control of the decision maker. Now we need to give this concept a mathematical shape most likely an equation which then can be taken further now you might be Price that we can do this with the help of the Bellman equation with a few minor tweaks. So if we have a look at the original Bellman equation V of X is equal to maximum of our s comma a plus gamma V of s stash what needs to be changed in the above equation so that we can introduce some amount of Randomness here as long as we are not sure when the robot might not take the expected turn. We are then also not sure in which room it might end up in which is nothing but the room it. Moves from its current room at this point according to the equation. We are not sure of the S stash which is the next state or the room, but we do know all the probable turns the reward might take now in order to incorporate each of this probabilities into the above equation. We need to associate a probability with each of the turns to quantify the robot if it has got any experts it is chance of taking this turn now if we do, so We get PS is equal to maximum of our s comma a plus gamma into summation of s - PS comma a comma s stash into V of his stash now the PS a-- and a stash is the probability of moving from room s to establish with the action a and the submission here is the expectation of the situation that the robot in curse, which is the randomness now, let's take a look at this example here. So when We associate the probabilities to each of these Stones. We essentially mean that there is an 80% chance that the robot will take the upper turn. Now, if you put all the required values in our equation, we get V of s is equal to maximum of our of s comma a + comma of 0.8 into V of room up plus 0.1 into V of room down 0.03 into a room of V of from left plus 0.03 into Vo Right now note that the value Footprints will not change due to the fact that we are incorporating stochastic Ali here. But this time we will not calculate those values Footprints instead. We will let the robot to figure it out. Now up until this point. We have not considered about rewarding the robot for its action of going into a particular room. We are only watering the robot when it gets to the destination now, ideally there should be a reward for each action the robot takes to help it better as Assess the quality of the actions, but there was need not to be always be the same but it is much better than having some amount of reward for the actions than having no rewards at all. Right and this idea is known as the living penalty in reality. The reward system can be very complex and particularly modeling sparse rewards is an active area of research in the domain of reinforcement learning. So by now we have got the equation which we have a so what? To do is now transition to Q learning. So this equation gives us the value of going to a particular State taking the stochastic city of the environment into account. Now, we have also learned very briefly about the idea of living penalty which deals with associating each move of the robot with a reward so Q learning processes and idea of assessing the quality of an action that is taken to move to a state rather than determining the possible value of the state which is being moved to So earlier we had 0.8 into V of s 1 0.03 into V of S 2 0 point 1 into V of S 3 and so on now if you incorporate the idea of assessing the quality of the action for moving to a certain state so the environment with the agent and the quality of the action will look something like this. So instead of 0.8 V of s 1 will have q of s 1 comma a one will have q of S 2 comma 2 You of S3 not the robot now has four different states to choose from and along with that. There are four different actions also for the current state it is in so how do we calculate Q of s comma a that is the cumulative quality of the possible actions the robot might take so let's break it down. Now from the equation V of s equals maximum a RS comma a + comma summation s - PSAs stash - into V of s - if we discard the maximum function we have is of a plus gamma into summation p and v now essentially in the equation that produces V of s we are considering all possible actions and all possible States from the current state that the robot is in and then we are taking the maximum value caused by taking a certain action and the equation produces a value footprint, which is for just one possible action. In fact if we can think of it as the quality of the action so Q of s comma a is equal to RS comma a plus gamma of summation p and v now that we have got an equation to quantify the quality of a particular action. We are going to make a little adjustment in the equation we can now say that we of s is the maximum of all the possible values of Q of s comma a right. So let's utilize this fact and replace V of s Stash as a function of Q so q s comma a becomes R of s comma a + comma of summation PSAs - and maximum of the que es - a - so the equation of V is now turned into an equation of Q, which is the quality. But why would we do that now? This is done to ease our calculations because now we have only one function Q, which is also the core of the Programming language. We have only one function Q to calculate an R of s comma a is a Quantified metric which produces reward of moving to a certain State. Now, the qualities of the actions are called The Q values and from now on we will refer to the value Footprints as the Q values an important piece of the puzzle is the temporal difference. Now temporal difference is the component that will help the robot calculate the Q values which respect to the change. Changes in the environment over time. So consider our robot is currently in the mark State and it wants to move to the Upper State. One thing to note that here is that the robot already knows the Q value of making the action that is moving through the Upper State and we know that the environment is stochastic in nature and the reward that the robot will get after moving to the Upper State might be different from an earlier observation. So how do we capture this change the real difference? We calculate the new Q as My a with the same formula and subtract the previous you known qsa from it. So this will in turn give us the new QA now the equation that we just derived gifts the temporal difference in the Q values which further helps to capture the random changes in the environment which may impose now the new q s comma a is updated as the following so Q T of s comma is equal to QT minus 1 s comma a plus Alpha TD. ET of a comma s now here Alpha is the learning rate which controls how quickly the robot adapts to the random changes imposed by the environment the qts comma is the current state q value and a QT minus 1 s comma is the previously recorded Q value. So if we replace the TDS comma a with its full form equation, we should get Q T of s comma is equal to QT - 1 of s comma y plus Alpha into our of S comma a plus gamma maximum of q s Dash a dash minus QT minus 1 s comma a now that we have all the little pieces of q line together. Let's move forward to its implementation part. Now, this is the final equation of q-learning, right? So, let's see how we can implement this and obtain the best path for any robot to take now to implement the algorithm. We need to understand the warehouse. Ian and how that can be mapped to different states. So let's start by reconnecting the sample environment. So as you can see here, we have L1 L2 L3 to align and as you can see here, we have certain borders also. So first of all, let's map each of the above locations in the warehouse two numbers or the states so that it will ease our calculations, right? So what I'm going to do is create a new Python 3 file in the jupyter notebook and I'll name it as learning Numb, but okay, so let's define the states. But before that what we need to do is import numpy because we're going to use numpy for this purpose and let's initialize the parameters. That is the gamma and Alpha parameters. So gamma is 0.75, which is the discount Factor whereas Alpha is 0.9, which is the learning rate. Now next what we're going to do is Define the states and map it to numbers. So as I mentioned earlier l 1 is Zero and online. We have defined the states in the numerical form. Now. The next step is to define the actions which is as mentioned above represents the transition to the next state. So as you can see here, we have an array of actions from 0 to 8. Now, what we're going to do is Define the reward table. So as you can see here is the same Matrix that we created just now that I showed you just now now if you understood it correctly, there isn't any real Barrel limitation as depicted in the image, for example, the transitional for tell one is allowed but the reward will be 0 to discourage that path or in tough situation. What we do is add a minus 1 there so that it gets a negative reward. So in the above code snippet as you can see here, we took each of the It's and put once in the respective state that are directly reachable from the certain State. Now. If you refer to that reward table, once again, which we created the above or reconstruction will be easy to understand but one thing to note here is that we did not consider the top priority location L6 yet. We would also need an inverse mapping from the state's back to its original location and it will be cleaner when we reach to the other depths of the algorithms. So for that what we're going to do is Have the inverse map location state to location. We will take the distinct State and location and convert it back. Now. What will do is will not Define a function get optimal which is the get optimal route, which will have a start location and an N location. Don't worry the code is back. But I'll explain you each and every bit of the code. It's not the get optimal root function will take two arguments the starting location in the warehouse and the end location in the warehouse recipe lovely and it will return the optimal route for reaching the end location from the starting location in the form of an ordered list containing the letters. So we'll start by defining the function by initializing the Q values to be all zeros. So as you can see here we have Even the Q value has to be 0 but before that what we need to do is copy the reward Matrix to a new one. So this the rewards new and next again, what we need to do is get the ending State corresponding to the ending location. And with this information automatically will set the priority of the given ending stay to the highest one that we are not defining it now, but will automatically set the priority of the given ending State as nine nine nine. So what we're going to do is initialize the Q values to be 0 and in the Learning process what you can see here. We are taking I in range 1000 and we're going to pick up a state randomly. So we're going to use the MP dot random randint and for traversing through the neighbor location in the same maze we're going to iterate through the new reward Matrix and get the actions which are greater than 0 and after that what we're going to do is pick an action randomly from the list of the playable actions in years to the next state will going to compute the temporal difference, which is TD, which is the rewards plus gamma into the queue of next state and will take n p dot ARG Max of Q of next 8 minus Q of the current state. We going to then update the Q values using the Bellman equation as you can see here. We have the Bellman equation and we're going to update the Q values and after that we're going to initialize the optimal route with a starting location now here we do not know what the next location yet. So initialize it with a value of the starting location, which Again is the random location. So we do not know about the exact number of iteration needed to reach to the final location. Hence while loop will be a good choice for the iteration. So when you're going to fetch the starting State fetch the highest Q value penetrating to the starting State we go to the index or the next state, but we need the corresponding letter. So we're going to use that state to location function. We just mentioned there and after that we're going to update the starting location for the The next iteration and finally we'll return the root. So let's take the starting location of n line and and location of L while and see what part do we actually get? So as you can see here we get Airline l8l 5 L2 and L1. And if you have a look at the image here, we have if we start from L9 to L1. We got L8 L5 L 2 l 1 l 8l v L2 L1 that would He does the maximum value of the maximum reward for the robot. So now we have come to the end of this Q learning session and I hope you got to know what exactly is Q learning with the analogy all the way starting from the number of rooms and I hope the example which I took the analogy which I took was good enough for you to understand q-learning understand the Bellman equation how to make quick changes to the Bellman equation and how to create the reward table the cue. Will and how to update the Q values using the Bellman equation, what does alpha do what does karma do?