Transcript for:
Machine Learning para Principiantes en 2024

this machine learning course is created for beginners who are learning in 2024 the course begins with a machine learning road map for 2024 emphasizing career paths and beginner-friendly Theory then the course moves on to Hands-On practical applications and a comprehensive end to-end project using python Todd have created this course she is an experienced data science professional her aim is to demystify machine learning Concepts making them accessible and actionable for newcomers and to bridge the gap in existing educational resources setting you on a path to success in the evolving field of machine learning looking to step into machine learning or data science it's about starting somewhere practical yet powerful in this introductory course machine learning for beginners we are going to cover the basics of machine learning and we're going to put that into practice by implementing it in a real world case study I'm d founder of Lun Tech where we are making data science and AI more accessible for individuals and businesses if you're looking for machine learning deep learning data science or AI resources then check out the free resources section in lunch. or our YouTube channel where you can find more content and you can dive into machine learning and in AI we're going to start with machine learning road map we in this detailed section we are going to discuss the exact skill set that you need to get into machine learning we're also going to cover the definition of machine learning what is a common career path and lot of resources that you can use in order to get into machine learning then we are going to start with the actual Theory we are going to touch base the basics we're going to learn what are those different fundamentals in machine learning once we have learned the theory and we have also looked into the machine learning road map we're going to put our Theory into practice we are going to conduct an endtoend a basic yet powerful case study where we're are going to implement the linear aggression model we're going to use it both for caal analysis and for Predictive Analytics for Californian house prices we're are going to find out the features that drive the Californian house values and we are going to discuss the stepbystep approach for conducting a real world data science project at the end of this course you are going to to know the exact machine learning road map for 2024 what are the exact skill set and the action plan that you can use to get into machine learning and in data science you are going to learn the basics when it comes to machine learning you're going to implement it into actual machine learning project end to end including implementing pandas numai psychic learn touch models medal tap and curn in Python for a real world data science project dive into machine learning with us start Simple Start strong let's get started hi there in this video we are going to talk about how you can get into machine learning in 2024 first we are going to start with all the skills that you need in order to get into machine learning step by step what are the topics that you need to cover and what are the topics that you need to study in order to to get into machine learning we are going to talk about what is machine learning then we are going to cover step by step what are the exact topics and the skills that you need in order to become a machine learning researcher or just get into machine learning then we're going to cover the type of exact projects you can complete so examples of portfolio projects in order to put it on your resume and to start to apply for machine learning related jobs and then we are going to also talk about the type of industries that you can get into once you have all the skills and you want to get into machine learning so the exact career path and what kind of business titles are usually related to machine learning we are also going to talk about the average salary that you can expect for each of those different machine learning related positions at the end of this video you are going to know what exactly machine learning is where is it used what kind of skills are there that you need in order to get into to machine learning in 2024 and what kind of career path with what kind of compensation you can expect with the corresponding business titles when you want to start your career in machine learning so we will first start with the definition of machine learning what machine learning is and what are the different sorts of applications of machine learning that you most likely have heard of but you didn't know that it was based on machine learning so what is machine learning machine learning is a brand of artificial intelligence of AI that helps to uh build models based on the data and then learn from this data in order to make different decisions and it's being used across different Industries uh starting from healthare till entertainment in order to improve uh the customer uh experience custom identify customer behavior um improve the sales for the businesses uh and it also helps um governments to make decisions so it's really has a wide range of applications so let's start with the healthcare for instance machine learning is being used in the healthcare to help with the uh diagnosis of diseases it can help to uh diagnose cancer uh during the co it helped many hospitals to identify whether people are getting more uh severe side effects or they are getting p uh pneumonia um based on those pictures and that was all based on machine learning and specifically comp computer vision uh in the healthcare is also being used for drug Discovery it's being used for personalized medicine for personalizing treatment plans to improve the operations of the hospitals to understand what is the amount of uh people and uh patients that hospital can expect in each of those uh uh days per week and also to estimate the amount of doctors that need to be available the amount of uh people uh that the hospital can expect in the emergency room based on the day or the time of the day and this is basically not a machine learning application then we have uh machine learning in finance machine learning is being largely used in finance for different applications starting from fraud detection in credit cards or in other sorts of banking operations um it's also being used in trading uh with specifically in combination with quantitative Finance to help traders to make decisions with they need to go short or long into different stocks or bonds or different assets just in general to estimate the price that those talks will happen Assets in the real time in the most accurate way uh it's also being used in uh retail uh it helps you understand an estimated demand for certain products in certain warehouses it also helps you understand what is the most appropriate or closest uh uh warehouses that the items for that corresponding customer should be shipped so it's uh optimizing the operations it's also being used to build different direct Commander systems and search engines like the famous Amazon is doing so every time when you go to Amazon and you are searching for project or product you will most likely see many article recommenders and that's based on machine learning because Amazon is uh Gathering the data and comparing your behavior So based on what you have bought based on what you are searching uh to other customers and those items to other items in order to understand what are the items that you will most likely will be interested in and eventually will buy it and that's exactly based on machine learning and specifically different sorts of recommended system algorithm and then we have uh marketing where machine learning is being heavily used because this can help to understand uh what are these different tactics and specific targeting uh groups that that you belong and how retailers can Target you uh in order to reduce their marketing cost and to result in high conversion rates so to ensure that you buy their product then we have machine learning in autonomous vehicles that's based on machine learning and specifically uh deep learning applications uh and then we have also um uh natural language Pro processing which is highly related to the famous Chad GPT I'm sure you are using it and that's that's based on the machine learning and specifically the large language models so the Transformers large language models where you are going and providing your text and then question and the chat GPT will provide answer to you or in fact any other uh virtual assistant or chat boats those are all based on machine learning and then we have also uh smart home devices so Alexa is based on machine learning also in agriculture uh machine learning is being used heavily these days to estimate what the weather conditions will be uh to understand what will be the uh production of different plants uh what will be the um outcome of this uh to understand and to make decisions uh also how they can optimize those uh crop uh yields to monitor uh soil health and for different sorts of applications that can just in general uh improve the uh revenue for the farmer then we have of course in the entertainment so the Vivid example is Netflix that uses the uh data uh that you are providing uh related to the movies and also based on what kind of movies you are watching Netflix is uh building this super smart recommender system to recommend you movies that you most likely will be interested in and you will also like it so in all this machine learning is being used and it's actually super powerful topic and super powerful uh field to get into and in the upcoming 10 years this is only going to grow so if you have made that decision or you are about to make that decision to get into machine learning continue watching this video because I'm going to tell you exactly what kind of skills you need and what kind of uh practice projects you can complete in order to get into machine learning in 2024 so you first need to start with mathematics you Al also need to know python you also need to know statistics you will need to know machine learning and you will need to know some NLP to get into machine learning so let's now unpack each of those skill sets so independent the type of machine learning you are going to do you need to know mathematics and specifically you need to know linear algebra so you need to know what is matrix multiplication what are the vectors matrices dot product you need to know how you can uh multiply those different matrices Matrix with vectors what are these different rules the dimensions also what does it mean to transform a matrix the inverse of the Matrix identity Matrix diagonal matrix uh those are all Concepts as part of linear algebra that you need to know as part of your mathematical skill set in order to understand those different machine learning algorithms then as part of your mathematics you also need to to know calculus and specifically differential Theory so you need to know these different theorems such as chain rule the rule of uh differentiating when you have sum of instances when you have constant multiply with an instance when you have um uh sum but also subtraction division multiplication of two items and then you need to take the uh derivative of that what is this idea of derivative what is the idea of partial derivative what is the idea of Haitian so first order derivative second order derivative and it would be also great to know a basic integration Theory so we have differentiation and the opposite of it is integration Theory so this is kind of basic you don't need to know uh too much when it comes to calculus but those are basic things that you need to know uh in order to succeed in machine learning uh then the next Concepts uh such as discrete mathematics so you need to know uh what is this idea of uh graph Theory uh what are this uh combinations combinators uh what is uh this idea of complexity which is important when you want to become a machine learning engineer because you need to understand what is this Big O notation so you need to understand what is this complexity of uh n s complexity of n complexity of n log n um and about that you need to know uh some basic um mathematics when it comes which comes from usually high school so you need to know multiplication division you need to understand uh multiplying uh uh amounts which are within the parentheses you need to understand um different symbols that represent mathematical um values you need to know this idea of using X's y's uh and then what is X2 what is y^ 2 What is X to ^ 3 so different exponents of the different VAR variables then you need to know what is logarithm what is logarithm at the base of two what is logarithm at the base of e and then at the base of 10 uh what is the idea of e so what is the idea of Pi uh what is this idea of uh exponent logarithm and how does those uh transform when it comes to taking derivative of the logarithm taking the derivative of the uh exponent those are all values and all uh topics that are actually quite basic they might sound complicated but they are actually not so if someone explains you uh clearly then you will definitely understand it from the first goal and uh for this uh to understand all those different mathematical Concepts so linear algebra calculus differential Theory and then discrete mathematics and those different symbols you need to uh go for instance uh and look for courses or um YouTube tutorials that are about uh basic mathematics uh for machine learning and AI uh don't go and look further you can check for instance Can Academy which is uh quite favorite when it comes to learning math uh both for uni students and also for just people who want to learn mathematics and this will be your guide um or you can check our resources at Lear tech. cuz we are going also to uh provide this resources for you uh in case you want to learn mathematics for your machine Learning Journey the next skill set that you need to gain in order to break into machine learning is the statistics so you need to know this is a must statistics if you want to get into machine learning and in AI in general so there are few topics that you must um study when it comes to statistics and uh those are descriptive statistics multivariate statistics inferential statistics probability distribution and some bial thinking so let's start with descriptive statistics when it comes to descriptive statistics you need to know what is side of mean uh median standard deviation variance and uh just in general how you can uh analyze the data with using this descriptive measures so distance measures but also variational measures then the next topic area that you need to know as part of your statistical Journey is the inferential statistics so you need to know those INF famous theories such as Central limit theorem the law of a large numbers uh and how you can um relate to this idea of population sample unbiased sample and also uh a hypothesis testing confidence interval statistical significance uh and uh how you can test different theories by using uh this idea of statistical significance uh what what is the power of the test what is type one error what is type two error so uh this is super important for understanding different SES of machine learning applications if you want to get into machine learning then you have probability distributions and this idea of probabilities so to understand those different machine learning Concepts you need to know what are probabilities so what is this idea of probability what is this idea of Sample versus population uh what is what does it mean to estimate probability what are those different rules of probability so conditional probability uh and um those uh probability uh values and rules that usually you can uh apply when you have uh probability of um multipliers probability of two sums um and then uh you need to understand some uh popular and you need to know some popular probability distribution function and those are perno distribution binomial distribution uh normal distribution uniform distribution exponential distribution so those are all super important distributions that you need to know in order to understand uh this idea of normality normalization uh also uh this idea of bare noly trials and uh relating uh different probability distributions to different uh uh higher level statistical concept steps so rolling a dice the probability of it how it is related to bero distribution or to binomial distribution and those are super important when it comes to hypothesis testing but also for uh many other machine learning applications so then we have the ban thinking this is super important when it comes to more advanced machine learning but also some basic machine learning you need to know what is the Bas theorem which arguably is one of the most popular statistical theorems out there comparable also to the central limit theorem you need to know what is conditional probability what is this bias theorem and how does it relate to conditional probability uh what is this uh bation uh statistics Ide at very high level you don't need to know everything in uh super detailed but you need to know um the these Concepts at least at high level in order to understand machine learning so to learn statistics and fundamental concepts of Statistics you can check out the fundamentals to statistics course at lunch. here you can learn all this required Concepts and topics and you can practice it in order to get into machine learning and to gain the statistical skills the next skill set that you must know is the fundamentals to machine learning so this covers not only the basics of machine learning but also the most popular machine learning algorithms so you need to know this uh different um mathematical side of these algorithms step by step how they work what are the benefits of them what are the demores and and which one to use for what type of applications so you need to know this uh categorization of supervised versus unsupervised versus semi-supervised then you need to know what is this idea of classification regression or uh clustering then you need to know uh also time series analysis uh you also need to know uh these different popular algorithms including linear regression also logistic regression LDA so linear discriminant analysis you need to know KNN you uh need to know uh decision treats both classification and regression case you need to know uh random Forest begging but also boosting so popular boosting algorithms like uh light GBM GBM uh so gradient boosting models and you need to know uh HG boost uh you uh also need to know um some supervised learning algorithm such as K means uh usually Ed for class string you need to know DB scan which becomes more and more popular in uh class string algorithms you also need to know hierarchal class string um and um for all this type of uh models you need to understand the idea behind them what are the advantages and disadvantages whether they can be applied for unsupervised versus supervised versus semi-supervised you need to know whether they are for regression classification or for uh class stre beside of this popular algorithms and models you also need to know the basics of uh training a machine learning model so you need to know uh this process behind training validating and testing your machine learning algorithms so you need to know uh what does it mean to uh perform hyperparameter tuning what are those different optimization algorithms that can be used to optimize your parameters such as uh GD SGD SGD with momentum Adam and Adam V you also need to know the testing process this idea of splitting the data into train validation and then test you need to know resampling techniques why are they used including the um bootstrapping and uh cross viation and there's different sorts of cross viation techniques such as one out cross validation kful cross validation validation set approach uh you also need to know um this uh idea of uh Matrix and how you can use different Matrix to evaluate your machine learning models such as uh classification type of metrics like F1 score FB Precision recall um cross entropy um and also you need to know some Matrix that can be used to evaluate regression type of problems like the uh me squared error so MC root me squared error R MC uh MAA so the absolute uh version of those different sorts of Errors um and um or the residual sum of squares for all these cases you not only need to know higher level what the those algorithms or those uh topics or concepts are doing but you actually need to know the uh mathematics behind it their benefits the uh disadvantage ages because during the interviews you can definitely expect questions that will test uh not only your high level understanding but also this uh background knowledge if you want to learn machine learning and you want to gain those skills then uh feel free to check out my uh fundamentals to machine learning course at lunch. or you can also check out and download for free the fundamentals to machine learning handbook that I published with free cord Camp then the next skill set that you definitely need to gain is a knowledge in python python is actually one of the most popular programming languages out there and it's being used across software Engineers uh AI Engineers machine learning Engineers data scientists so this this is the universal language I would say when it comes to programming so if you're considering getting into machine learning in 2024 then python will be your friend so knowing the theory is one thing then uh implementing it uh in in the actual job is another and that's exactly where python comes in handy so you need to know python in order to perform uh descriptive statistics in order to trade machine learning model or more advanced machine learning models so deep learning models you can use for training validation and uh for testing of your models and uh also for building different sorts of applications so python is super powerful therefore it's also gaining such a high uh popularity across the globe because it has so many uh libraries it has uh taner flow pie torch both that uh are must if you want to not only get into machine learning but also the advanced uh levels of machine learning so if you are considering the AI engineering jobs or machine learning engineering jobs and uh you want to train for instance deep learning models uh or you want to build large l W models or generative AI models then you definitely need to learn uh pytorch and tens flow which are Frameworks that I use in order to uh Implement different deep learning uh which are Advanced machine learning models here are few libraries that you need to know in order to uh get into machine learning so you definitely need to know pandas napai you need to know psyit learn scipi you also need to know uh nltk for the TX data you also need to know tensor flow and Pythor for bit more advanced machine learning and um beside this there are also data visualization libraries that I would definitely suggest you to practice with which are the Met plot lip and specifically the PIP plot and also the curn when it comes to python beside knowing how to use libraries you also need to know some basic data structures so you need to know what are these variables how you can create variables what what are the matrices arrays how the indexing works and also uh what are the lists what are the sets so unique lists uh What uh are the ways that you can what are the different operations you can perform uh how does the Sorting for instance work I would definitely suggest you know um some basic data structures and algorithms such as binary sort so in optimal way to sort your arrays you also need to know uh the data processing in Python so you need to understand how to identify missing data how to uh identify uh duplicating your data how to clean this how to perform feature engineering so how to combine uh multiple variables or to perform operations to create new variables um you also need to know uh how you can aggregate your data how you can filter your data how you can sort your data and of course you also need to know how you can form AB testing in your Python and how you can train machine learning models how you can test it and how you can evaluate them and also visualize the performance of it if you want to Learn Python then the easiest thing you can do is just to Google for uh python for data science or python for machine learning tutorials or blogs or you can even try out the python for data science course at Learner tech. in order to learn all these Basics and usage of these libraries and some practical examples when it comes to python for machine learning the next skill set that you need to gain in order to get into machine learning is the basic introduction to NLP natural language processing so you need to know how to work with text Data given that these days the text data is the Cornerstone of all these different Advanced algorithms such as uh gpts Transformers the attention mechanisms so those uh applications that you see as part of building chat boat or this uh p I uh applications based on Tex data they are all based on NLP so therefore you need to know this basics of NLP to just get started with machine learning so you need to know uh this idea of text Data what are those strings uh how you can clean Text data so how you can clean uh those um dirty data that you get and what are the steps involved such as lower casing uh removing punctuation tokenization uh also what is this idea of stemming lemmatization stop wordss how you can use the nltk in Pyon in order to perform this cleaning you also need to know uh this idea of embeddings and uh you can also learn this idea of uh the uh tfidf which is a basic uh NLP algorithm uh you also uh can learn this idea of word and Bings uh the sub word embeddings uh and the character embeddings if you want to learn the basics of NLP you can check out those Concepts and learn them as part of the blogs there are many tutorials on YouTube you can also try the introduction to uh NLP course at lunch. in order to learn this uh different Basics that form the NLP if you want to go beyond this uh intro till medium level machine learning and you also want to learn more advanced machine learning and this is something that you need to know after you have gained all these preview skills that I mentioned then you can gain uh this uh knowledge and the skill set by learning deep learning and also uh you can consider uh getting into generative AI topics so you can for instance learn what are the rnns what are the Ann what are the CNN you can learn what is this uh out encoder concept what are the variational outen coders what what are the uh generative adversarial networks so gens uh you can understand what is this idea of reconstruction error uh you can understand this um these different sorts of neural networks what is this idea of back propagation the optimization of these algorithms by using the different optimization algorithms such as GD HGD um HGD momentum Adam adamw RMS prop uh you uh can also go One Step Beyond and you can uh get into gener AI topics such as um uh the uh variational Auto encoders like I just mentioned but also the large language models so if you want to move towards the NLP side of generative Ai and you want to know how the ched GPT has been invented how the gpts work or the birth model uh then you will definitely need to uh get into this topic of language model so what are the end grams what is the attention mechanism what is the difference between the self attention and attention what is uh one head self attention mechanism what is multi-ad self attention mechanism you also need to know at high level this uh encoder decoder architecture of Transformers so you need to know the architecture of Transformers and how they solve different problems of uh reur neuron networks or RNN and lstms uh you can also look into uh this uh uh encoder based or decoder based algorithm such as uh gpts or Birch model and those all will help you to not only get into machine learning but also stand out from all the other candidates by having this Advanced knowledge let's now talk about different sorts of projects that you can complete in order to train your machine learning skill set that you just learned uh so there are few projects that I suggest you to complete and you can put it this on your resume to start to apply for machine learning roles the first application the project that I would suggest you to do is building a basic recommender system whether it's a job recommender system or a movie recommender system in this way you can showcase how you can use for instance text Data from those job advertisement how you can use numeric data such as the ratings of the movies in order to build a topend recommender system this will showcase your understanding of the distance measures such as cosign similarity this Cann algorithm idea and this will help you to uh uh tackle this specific uh area of data science and machine learning the next project I would suggest you to do will be to build a regression based model so in this way you will showcase that you understand this idea of regression how to work with a Predictive Analytics and predictive model that has a dependent variable response variable that is in the numeric format so here for instance you can uh estimate the salaries of the jobs based on the uh characteristics of the uh job based on this data which you can get for instance from uh open source uh web pages such as keegle and you can then uh use different sorts of regression algorithms to perform your predictions of the salaries evaluate the model and then compare the uh performance of the different machine learning regression based algorithms for instance you can use the uh linear regression you can use the decision trees regression version you can use the um uh random Forest you can use uh GBM xgo in order to Showcase and then in one uh graph to compare this uh performance of these different algorithms by using single regression uh ml modal metrics so for instance the rmsc this project will showcase that you understand how you can train a regression model how you can test it and validate it and it will showcase your understanding of optimization of this regression algorithm you understand this concept of hyperparameter unit the next project that I would suggest you to do in order to Showcase your classification knowledge so when it comes to uh predicting a class for an observation given uh the feature space would be uh to uh build a classification model that would classify emails being a Spam or not a Spam so you can use a publicly available data that will be uh describing a specific email and then you will have multiple emails and the idea is to uh build a machine learning model that would classify the email to the class zero and class one where class zero for instance can be your uh not being a Spam and one being a Spam so with this binary classification you will showcase that you know how to train a machine learning model for classification purposes and you can here use for instance logistic regression you can use also the decision Trea for classification case you can also use random Forest the uh EG she Bo for classification GBM for classification and uh with all these models you can then obtain the performance metrics such as uh F1 score or you can put the rck curve uh or the uh area under the Curve metrics and you can also compare those different classification models so in this way you will also tackle another area of expertise when it comes to the machine learning then a final project that I would suggest you to do would be uh from the unsupervised learning to Showcase another area of expertise and here you can for instance use data to your customers into good better and best customers based on their transaction history the amount of uh money that they are spending in the store so uh in this case you can for instance use K means uh DB scan hierarchy clustering and then you can evaluate your uh clustering algorithms and then select the one that performs the best so you will then in this case cover yet another area of machine learning which would be super important to show case that you can not only handle recommended systems or supervised learning but also unsupervised learning and the reason why I suggest you to uh cover all these different areas and complete this four different projects is because in this way you will be covering different expertise and areas of machine learning so you will be also putting projects on your uh resume that are covering different sorts of algorithms different sorts of uh Matrix and approaches and it will show case that you actually know a lot from machine learning now if you want to go beyond the basic or medium level and you want to be considered for medium or Advanced machine learning uh levels and positions you also need to know bit more advanced which means that you need to complete bit more advanced projects for instance if you want to apply for generative AI related or large language models related positions I would suggest you to complete a project where you are building a very basic uh large language model and specifically the pre-training process which is the most difficult one so in this case uh for instance you can build a baby GPT and I'll put a here link that you can follow where I'm building a baby GPT a basic pre-trained GPT algorithm where uh I am using a text Data uh publicly available data in order to uh uh process data in the same way like GPT is doing and the encoded part of the Transformer in this way you will showcase to your um hiring managers that you understand this architecture behind Transformers architecture behind the um uh large language models and the gpts and you understand how you can use pytorch in Python in order to do this Advanced NLP and generative AI task and finally let's now talk about the common career path and the business titles that you can expect from a career in machine learning so assuming that you have gained all the skills uh that are must for breaking into machine learning there are different sorts of business titles that you can apply in order to get into machine learning so when it comes to machine learning uh you can uh get into machine learning uh and there are different fields that are covered as part of this so uh first we have the general machine learning researcher machine learning researcher is basically doing a research so training testing evaluating different machine learning algorithms they are usually people who come from academic background but it doesn't mean that you cannot get into machine learning research without getting a degree in statistics mathematics or in um um machine learning specifically not at all so uh if you have this um desire and this passion for reading doing research uh and you don't mind reading uh research papers then machine learning res researcher job would be a good fit for you so machine learning combined with research then sets you uh for the machine learning researcher role then we have the machine learning engineer so machine learning engineer is the engineering version of the machine learning uh expertise which means that we are combining machine learning skills with the engineering skills such as productionizing pipelines or end to end robust pipeline scalability of the m model considering all these different aspects of the model not only from the performance side when it comes to the quality of the algorithm but also the uh scalability of it and when putting it in front of many users so when it comes to combining engineering with machine learning then you get machine learning engineering so if you are someone who is a software engineer and you want to get into machine learning then machine learning engineering would be the best fit for you so so for machine learning engineering you not only need to have all these different skills that I already mentioned but you also need to have this good grasp of uh uh scalability of algorithms the uh uh data structures and algorithms type of um skill set uh the uh complexity of the moral uh also system design so this one uh converges more towards and similar to the software engineering position combined with machine learning red than your pure machine learning or AI role then we have the AI research versus AI engineering position so uh the uh AI research position is similar to The Machine learning uh research position and the AI engineer position is similar to The Machine learning engineer position with only single difference when it comes to machine learning we are specifically talking about the traditional machine learning so linear regression logistic regression and also uh random Forest exy boost begging and when it comes to AI research and AI engineer position here we are tackling more the advanced machine learning so here we are talking about deep learning models such as RNN lstms grus CNN or computer vision applications and we are also talking about uh generative AI models large language models so uh we are talking about um the Transformers implementation of Transformers the gbts T5 all these different algorithms that are from uh more advanced uh AI topics rather than traditional machine learning uh for those you will then be applying for AI research and AI engineering positions and finally you have these different sorts of obervations niches from AI for instance NLP research NLP engineer or even data science positions for which you will need to know machine learning and knowing machine learning will set you apart for the source of positions so also the business titles such as data science or technical data science positions NLP researcher NLP engineer for this all uh you will need to know machine learning and knowing machine learning will help you to break into those positions and those career paths if you want to prepare for your deep learning interviews for instance and you want to get into AI engineering or AI research then I have recently published for free a full course with 100 interview questions with answers for a span of 7.5 hours that will help you to prepare for your deep learning interviews and for your machine learning interviews you can check out my uh fundamentals to machine learning course at lunch. or uh you can download the machine learning fundamentals handbook from free Cod camp and check out my blogs and also free resources at lunch. AI in order to prepare for your interviews and in order to get into machine learning let's not talk talk about the list of resources that you can use in order to get into machine learning in 2024 so to learn statistics and the fundamental concepts of Statistics you can check out the fundamental statistics course at lunch. here you can learn all this required Concepts and topics and you can practice it in order to get into machine learning and to gain this statistical skills then when you want to learn machine learning you can check the fundamentals to a learning course at lunch. to get all these basic concepts the fundamentals to machine learning and the list of comprehensive and the most comprehensive list of machine learning algorithms out there as part of this course then you can also check out the introduction to NLP course at the lunch. a in order to learn the basic concepts behind natural language preprocessing and finally if you want to Learn Python and specifically python for Ral learning you can check out the python for data science course at lunch. and if you want to get access to this different projects that you can practice your machine learning skills that you just learned you can either check out the ultimate data science boot camp that covers a specific course the uh data science uh project portfolio course covering multiple of these projects that you can train your machine learning skills and put on your resume or you can also check my GitHub account or my LinkedIn account where I cover many case studies including the baby GPT and I will also put the link to this course and to this uh case study in the link below and once you have gained all the skills you are ready to get into machine learning in 2024 in this lecture we will go through the basic concepts in machine learning that is needed to understand and follow conversations and solve main problems using machine learning strong understanding of machine learning Basics is an important step for anyone looking to learn more about or work with machine learning we'll be looking at the three concepts in this tutorial we will Define and look into the difference between supervised and unsupervised machine learning models then we will look into the difference between the regression and classification type of machine learning models after this we will look into the process of training machine learning models from scratch and how to evaluate them by introducing performance metrics what you can use depending on the type of machine learning model or problem you are dealing with so whether it's a supervised or unsupervised whether it's regression versus classification type of problem machine learning methods are categorized into two types depending on the existence of the label data in the training data set which is especially important in the training process so we are talking about the So-Cal dependent variable that we so in the section of fundamental Su statistics supervised and unsupervised machine learning models are two main type of machine learning algorithms one key difference between the two is the level of supervision during the training phase supervised machine learning algorithms are Guided by the labeled examples while as supervised algorithms are not as learning model is more reliable but it also requires a larger amount of labeled data which can be timec consuming and quite expensive to obtain examples of supervised machine learning models include regression and classification type of models on the other hand unsupervised machine learning algorithms are trained on unlabeled data the model must find patterns and relationships in the data without the guidance of correct outputs so we no longer have a dependent variable so unsupervised ml models require training data that consists only of independent variables or the features and there is no dependent variable or label data that can supervise the algorithm when learning from the data examples of unsupervised models are clust string models and outlier detection techniques supervised machine learning methods are categorized into two types depending on the type of dependent variable they are predicting so we have regression type and we have classification type some key differences between regression and classification include output type so the regression algorithms predict continuous values while the classification algorithms predict categorized values some key difference between regression and classification include the output type the evaluation metrics and their applications so with regards to the output type regression algorithms predict continuous values while classification algorithms predict categorical values with regard to the evaluation metric different evaluation metrics are being used for regression and classification tasks for example mean square is commonly used to evaluate regression models while accuracy is commonly used to evaluate classification models when it comes to Applications regression and classification models are used in entirely different types of applications regression models are often used for prediction tests while classifications are used for decision making tasks progression algorithms are used to predict the continuous value such as price or probability for example a regression model might be used to predict the price of a house based on its size location or other features examples of regression type of machine learning models are linear regression fixed effect regression exus regression Etc classification algorithms on the other hand are used to predict the categorical value these algorithms take an input and classify it to one of the several predetermined categories for example a classification model might be used to classify emails as a Spam or as not a Spam or to identify the type of animal in an image examples of classification type of machine learning models are logistic regression exus classification random Forest classification let us now look into different typee of performance metrics we can use in order to evaluate different type of machine learning models for aggression models common evaluation Matrix includes residual sum of squared which is the RSS mean squared error which is the msse the root mean squared error or rmsc and the mean absolute error which is the m AE this metrix measure the difference between the predicted values and the True Values with a lower value indicating a better feed for the model so let's go through this metrics one by one the first one is the RSS or the residual sum of squares this is a matrix commonly used in the setting of linear regression when we are evaluating the performance of the model in estimating the different coefficients and here the beta is a coefficient and the Yi is our dependent variable value and the Y head is the predicted value as you can see the RSS or the residual sum of square or the beta is equal to sum of all the squ of Y IUS y hat across all I is equal to 1 n where I is the index of the each r or the individual or the observation included in the data the second Matrix is the m or the mean squared error which is the average of the squared differences between the predicted values and the True Values so as you can see m is equal to 1 / to n and then sum across all i y i minus y head squ as you can see the RSS and the msse are quite similar in terms of their uh formulas the only difference is that we are adding a 1 / to n and then this makes it the average across all the square differences between the predicted value and the actual true valum a lower value of msse indicates a better fit the rmsc which is the root mean squared error is the square root of the msse so as you can see it has the same formula as msse only with the difference that we are adding a square roof on the top of that formula a lower value of rmsc indicates a better fit and finally the Mae or the mean absolute error is the average absolute difference between the predicted values so the Y hat and the True Values or y i a lower value of this indicates a better fit the choice of a regression metrics depends on the specific problem you are trying to solve and the nature of your data for instance the MSE is commonly used when you want to penalize large errors more than the small ones MSE is sensitive to outliers which means that it may not be the best choice when your data contains many outliers or extreme values rmsc on the other hand which is the square root of the MSC makes it easier to interpret so it's easier interpretable because it's in the same units as Target variable it is commonly used when you want to compare the performance of different models or when you want to report the error in a way that it's easier to understand and to explain the Mia is commonly used when you want to penalize all errors equally regardless of their magnitude and Mia is less sensitive to outliers compared to msse for classification models common evaluation metrics include accuracy precision recall and F1 score this metrics measure the ability of the machine learning model to correctly classify instances into the correct categories let's briefly look into this metrix individually so the accuracy is a proportion of correct predictions made by the model it's calculated by taking the correct predictions so the correct number of predictions and divide two all number of predictions which means correct predictions plus incorrect predictions next we will look into the Precision so Precision is the proportion of true positive predictions among all positive predictions made by the model and it's equal to True positive divided to True positive plus false positive so all number of positives true positives are cases where the model correctly predict a positive outcome while false positives are the cases where the model incorrectly predict a positive outcome next Matrix is recall recall is a proportion of true positive predictions among all actual positive instances it's calculated as the number of true positive predictions divided by the total number of actual positive instances which means dividing the true positive to True positive plus false negative so for example let's say we are looking into medical test a true positive would be a case where it has correctly identifies a patient as having a disease while a false positive would be a case where the test incorrectly identifies a healthy patient as having the disease and the final score is the F1 score the F1 score is the harmonic mean or the usual mean of the Precision and recall with a higher value indicating a better balance between precision and recall and it's calculated as the two times recall times Precision divided to recall plus Precision for unsupervised models such as class string models whose performance is typically evaluated using metrics that measure the similarity of the data points within a cluster and the dis similarity of the data points between different clusters we have three type of metrics that we can use homogeneity is a measure of the degree to which all of the data points within a single cluster belong to the same class A Higher value indicates a more homogeneous cluster so as you can see homogeneity of age where age is the simply the short way of describing homogeneity is equal to one minus conditional entropy given cluster assignments divided to the entropy or predicted class if you wondering what this entropy is then stay tuned as we are going to discuss this entropy whenever we will discuss the clustering as well as decision trees X Matrix is the silid score silid score is a measure of the similarity of the data point to its own cluster compared to the other clusters a higher silid score indicates that the data point is well matched to its own cluster this is usually used for DB scan or k me so here the silhouette score can be represented by this formula so the S so or the silhouette score is equal to B minus AO divided to the maximum of AO and B where s o is The Silo coefficient of the data point characterized by o AO is the average distance between o and all the other data points in the cluster to which o belongs and the B is the minimum average distance from o to all the Clusters to which o does not belong the final metrix we look to is the completeness completeness is another measure of the degree to which all of the data points that belongs to a particular class are assigned to the same cluster a higher value indicates a more compete cluster let's conclude this lecture by going through the step-by-step process of evaluating a machine learning model at a very simplified version since there are many additional considerations and techniques that may be needed depending on a specific task and the characteristics of the data knowing how to properly train machine learning model is really important since this defines the accuracy of the results and conclusions you will make the training Pro process starts with the preparing of the data this includes splitting the data into training and test sets or if you are using more advanced resampling techniques that we will talk about later than splitting your data into multiple sets the training set of your data is used to feed the model if you have also a validation set then this validation set is used to optimize your hyperparameters and to pick the best model while the test set is to use to evaluate the model performance when when we will approach more lectures in this section we will talk in detail about these different techniques as well as what the training means what the test means what validation means as well as what the hyperparameter tuning means secondly we need to choose an algorithm or set of algorithms and train the model on the training data and save the fitted model there are many different algorithms to choose from and the appropriate algorithm will depend on the specific test task and the characteristics of the data as a third step we need to adjust the model parameters to minimize the error on the training set by performing hyperparameter tuning for this we need to use validation data and then we can select the best model that results in the least possible validation error rate in this step we want to look for the optimal set of parameters that are included as part of our model to end up with a model that has the least possible error so it performs in the best possible way in the final two steps we need to evaluate the model we are always interested in a test a rate and not the training or the validation error rates because we have not used a test set but we have used the training and validation sets so this test error rate will give you an idea of how well the model will generalize to the new unseen data we need to use the optimal set of parameters from hyperparameter tuning stage and the training data to train the model again with this hyper parameters and with the best model so we can use the best fitted model to get the predictions on the test data and this will help us to calculate our test error rate once we have calculated the test error rate and we have also obtained our best model we are ready to save the predictions so once we are satisfied with the model performance and we have tuned the parameters we can use it to make predictions on a new unseen data on the test data and compute the performance metrics for the model us the predictions and the real values of the target variable from the test data and this complete this lecture so in this lecture we have spoken about the basics of machine learning we have discussed the difference between the the unsupervised and supervised learning models as well as regression versus classification we have discussed in details the different type of performance metrics we can use to evaluate different type of machine learning models as well as we have looked into the simplified version of the step-by-step process to train the machine machine learning model in this lecture lecture number two we will discuss a very important Concepts which you need to know before considering and applying any statistical or machine learning model here I'm talking about the bias of the model and the variance of the model and the trade of between the two which we call bias various trade of whenever you are using a statistical econometrical or a machine learning model no matter how simple the model is you should always evaluate your model and check its error rate in all this cases it comes down to the trade-off you make between the variance of the model and the bias of your model because there is always a catch when it comes to the model choice and the performance let us firstly Define what bias and the variant of the machine learning model are the inability of the model to capture the true relationship in the data is called bias hence the machine learning models that are able to detect the true relationship in the data have low bias usually complex models or more flexible models tend to have a lower bias than simpler models so mathematically the bias of the model can be expressed as the expectation of the difference between the estimate and the True Value let us also Define the variance of the model the variance of the model is the inconsistency level or the variability of the model performance when applying the model to different data sets when the same model that is trained using training data performs entirely differently than on the test data this means that there is a large variation or variance in the model complex models or more flexible models tend to have a higher variance than simpler models in order to evaluate the performance of the model we need to look at the amount of error that the model is making for Simplicity let's assume we have the following simple regression model which aims to use a single independent variable X to model the numeric y dependent variable that is we fit our model on our training observations where we have a pair of independent and dependent variables X1 y1 X2 Y2 up to xn YN and we obtain an estimate for our training observations fhe we can then compute this let's say fhe X1 fhe X2 up to fhe xn which are the estimat for our dependent variable y1 Y2 up to YN and if these are approximately equal to this actual values so one head is approximately equal to y1 Y2 head is approximately equal to Y2 head Etc then the training error rate would be small however if we are really interested in whether our model is predicting the dependent variable appropriately we want to instead of looking at the training error rate we want to look at our test error rate so so the error rate of the model is the expected Square difference between the real test values and their prediction where the predictions are made using the machine learning model we can rewrite this aor rate as a sum of two quantities where as you can see the left part is the amount of FX minus F hat x^ squared and the second entity is the variance of the error term so the accuracy of Y head as a prediction for y depends on the two quantities which we can call the reducible error and the irreducible error so this is the reducible error equal to FX minus f x s and then we have our irreducible error or the variance of Epsilon so the accuracy of Y head as a prediction for y depends on the two quantities which we can call the reducible error and the irreducible error in general The Fad will not be a perfect estimate for f and this inaccuracy will introduce some errors this error is reducible since we can potentially improve the accuracy of fad by using the most appropriate machine learning model and the best version of it to estimate the F however even if it was possible to find a model that would estimate F perfectly so that the estimated response took the form of Y head is equal to FX our prediction would still have some error in it this happens because Y is also a function of the error rate Epsilon which by definition cannot be predicted by using our feature X so there will always be some error that is not predictable so variability associated with the error Epsilon also affects the accuracy of the predictions and this is known as the irreducible error because no matter how well we will estimate F we cannot reduce the error introduced by the Epsilon this error contains all the features that are not included in our model so all the unknown factors that have an influence on our dependent variable but are not included as part of our data but we can't reduce the reducible error rate which is based on two values the variance of the estimate and the bias of the model if we were to simplify the mathematical expression describing the error rate that we got then it's equal to the variance of our model plus squared bias of our model plus the irreducible error so even if we cannot reduce the irreducible error we can reduce the reducible error rate which is based on the two values the variance and the squared bias so though the mathematical derivation is out of the scope of this course just keep in mind that the reducible error of the model can be described as the sum of the variance of the model and a squared bias of the model so mathematically the error in the supervised machine learning model is equal to the squared bias in the model the variance of the model and the irreducible error therefore in order to minimize the expected test error rate so on the Unseen data we need to select the machine learning meod that simultaneously achieves low variance and low bias and that's exactly what we call called bias variance tradeoff the problem is is that there is a negative correlation between the variance and the bias of the model another thing that is highly related to the bias and the variance of the model is the flexibility of the machine learning model so flexibility of the machine learning model has a direct impact on its variance and on its bias let's look at this relationships one by one so complex models or more flexible models tend to have a lower bias but at the same time complex models or flexible models tend to have higher variance than simpler models so as the flexibility of the model increases the model finds the true patterns in the data easier which reduces the bias of the model at the same time the variance of such models increases so as the flexibility of the model decreases model finds it more difficult to find the true parents in the data which then increases the bias of the morel but also decreases the variance of the model keep this topic in mind and we will continue this topic in the next next lecture when we will be discussing the topic of overfitting and how to solve the overfitting problem by using regularization in this lecture lecture number three we will talk about very important concept called overfitting and how we can solve overfitting by using different techniques including regularization this topic is related to the previous lecture and to the topics of error of the model train error rate test error rate bias and a variance of the machine learning model overfitting is important to know and also how to solve it with regularization because this topic can lead to inaccurate predictions and the lack of generalization of the model to new data knowing how to detect and prevent overfitting is crucial in building effective machine learning models questions about this topic are almost guaranteed to appear during every single data science interview in the previous lecture we discuss the relationship between model flexibility and the variance as well as the bias of the model we saw that as the flexibility of the model increases model finds the true pattern in the data easier which reduces the bias of the model but at the same time the variance of such models increases so as the flexibility of the model decreases model finds it more difficult to find the true patterns in the data which then increases the bias of the model and decreases the variance of the model let's first formally Define what the overfitting problem is as well as what the underfitting is so overfitting occurs when the model performs well in the training while the model performs worse on the test data so you end up having a low training error rate but a high test error rate and in the ideal world we want our test error rate to be low or at least that the training a rate is equal to the test error rate overfitting is a common problem in machine learning where a model learns the detail and noise in training data to the point where it negatively impacts the performance of the model on this new data so the model follows the data too closely closer than it should this means that the noise or random fluctuations of the training data is picked up and learned as concepts by the model which it should actually ignore the problem is that the noise or random component of the training data will be very different from the noise in the new data the model will therefore be less effective in making predictions on new data overfitting is caused by having too many features too complex of a model or too little of the data when the model is overfitting then also the model has high variance and low bias usually the higher is the model flexibility the higher is the risk of overfitting because then we have higher risk of having a model following the data too closely and following the noise so underfitting is the other way around underfitting occurs when our test error rate is much lower than our training error rate given that overfitting is much bigger of a problem and we want ideally to fix the case when our test theate is large we will only focus on the overfitting and this also the topic that you can expect during your data science interviews as well as something that you need to be aware of whenever you are training a machine learning model all right so now we we know what overfitting is we should now talk about how we can fix this problem there are several ways of fixing or preventing overfitting first you can reduce the complexity of the model we saw that higher the complexity of the model higher is the chance of the following the data including the noise too closely resulting in overfitting therefore reducing the flexibility of the model will reduce the overfitting as well this can be done by using a simpler model with fewer parameters or by applying a regularization techniques such as L1 or L2 regularization that we will talk in a bit kind solution is to collect more data the more data you have the less likely your model will overfit third and another solution is using resampling techniques one of which is cross validation this is a technique that allows you to train and test your model on different subsets of your data which can help you to identify if your model is overfitting we will discuss cross validation as well as other re sampling techniques later in the section another solution is to apply early stopping early stopping is a technique where you monitor the performance of the model on a validation set during the training process and stop the training when the performance starts to decrease another solution is to use assemble methods by combining multiple models such as decision trees overfitting can be reduced we will be covering many popular emble techniques in this course as well finally you can is what we call dropout dropout is a regularization technique for reducing overfitting in narrow networks by dropping out or setting to zero some of the neurons during the training process because from time to time Dropout related questions do appear during the data science interviews for people with no experience so if someone asks you about Dropout then at least you will remember that it's a technique used to solve overfitting in the setting of deep learning it's worth noting that there is no one solution that works for all types of overfitting and often a group of these techniques that we just talk about should be used to address the problem we saw that when the model is overfitting then the model has high variance and low bias by definition regularization or what we also call shrinkage is a method that shrinks some of the estimated coefficients toward zero to penalize unimportant variables for increasing the variance of the model this is a technique used to solve the overfitting problem by introducing the lethal bias in the model was significantly decreasing its variance there are three types of regularization techniques that are widely known in the industry the first one is to reach regression or L2 regularization the second one is the ler regression or the L1 regularization and finally the third one is the Dropout which is a regularization technique used in deep learning we will cover the first two types in this lecture let's now talk about re regression or L2 regularization so re regression or L2 regularization is a shrinkage technique that aims to solve overfitting by shrinking some of the modor coefficients towards zero retrogression introduces latal bias into the model while significantly reducing the model variance R regression is a variation of linear regression but instead of trying to minimize the sum of squared residuales that linear regression does it aims to minimize the sum of squared residuales added on the top of the squared coefficients what we call L2 regularization term let's look at a multiple linear regression example with P independent variables or predictors that are used to model the dependent variable y if you have followed the statistical section of this course you might also recall that the most popular estimation technique to estimate the parameter of the linear regression assuming its assumptions are satisfied is the ordinary Le squares or the OLS which finds the optimal coefficients by minimizing the sum of squared residuales or the RSS so re regression is pretty similar to the OS except that the coefficients are estimated by minimizing a slightly different cost or loss function this is the loss function of the re regession where beta J is the coefficient of the model for variable J beta0 is the intercept and x i j is the input value for the variable J and observation I Yi is a target variable or the dependent variable for observation Y and N is the number of samples and Lambda is what we call regularization parameter of the r regression so this is the loss function of OLS that you can see here and added a penalization term so it's combined the what we call RSS so if you check out the very initial lecture in this section where we spoke about different metrics that can be used to evaluate regression type of models you can see RSS and the definition of RSS well if you compare this expression then you can easily find that this is the exact formula for the RSS added with an intercept and this right term is what we called a penalty amount which basically represents the Lambda times the sum of the squar of the coefficients included in our model here Lambda which is always positive so it's always larger than equal zero is the tuning parameter or the penalty parameter this expression of the sum squared coefficients is called L2 Norm which is why we call this L2 penalty based regression or L2 regularization in this way regression assigns a penalty by shrinking their coefficients towards zero reduces the overall model variance but this coefficient will never become exactly zero so the model parameters are never said to exactly zero which means that all P predictors of the model are still intact this one is a key property of retrogression to keep in mind that it shrinks the parameters towards zero but never exactly sets them equal to zero L2 Norm is a mathematical term coming from linear algebra and it's standing for alian Norm we spoke about the penalty parameter lum LDA what we also call the tuning parameter Lambda which serves to control the relative impact of the penalty on the regression coefficient estimates when the Lambda is equal to zero the penalty term has no effect and the re regression will introduce the ordinary Le squares estimates but as the Lambda increases the impact of the shrinkage penalty grows and the r regression coefficient estimates approach to zero what is important to keep in mind which you can also see from this graph is that in r agression large Lambda will assign a penalty to some variables by shrinking their coefficients towards zero but they will never become exactly zero which becomes a problem when you are dealing with a model that has a large number of features and your model has a low interpretability retrogressions advantage over ordinarily squares is coming from the earlier introduced bias Varian trade of phenomenon so as in Lambda the penalty parameter increases the flexibility of the retrogression F decreases leading to decreased variance but increased bias the main advantages of retrogression are solving overfitting which regression can shrink the regression coefficient of less important predictors towards zero it can improve the prediction accuracy as well by reducing the variance and increasing the bias of the model Rich repression is less sensitive to outliers in the data compared to linear regression Rich regression is computationally less expensive compared to class or regression the main disadvantage of R aggression is the low modal interpretability as the P so the number of features your model is large let's now look into another regularization technique called l or regression or L1 regularization by definition l or regression or L1 regularization is a shrinkage technique that aims to solve overfitting by shrinking some of the modal coefficients towards zero and setting some to exactly zero l or regression like retrogression introduces later bias into the model while significantly reducing model variance there is however small difference between the two regression techniques that makes a huge difference in their results we saw that one of the biggest disadvantages of R regression is that it will always include all the predictors or all the p predictors in the final model whereas in case of lasso it overcomes this disadvantage so large Lambda or penalty parameter will assign a penalty to some variables by shrinking their coefficients towards zero in case of Rich aggression they will never become exactly zero which becomes a problem when your model has a large number of features and it has a low interpretability and L or regression overcomes this disadvantage of retrogression let's have a look at the loss function of L regularization so this is the loss function of OLS which is a left part of the formula called RSS combined with a penalty amount which is the right hand side of the expression the Lambda times some of the absolute values of the coefficients beta J as you can see this is the RSS that we just saw which is exactly the same as the loss function of the OLS and then we are adding the second term which basically is the Lambda the penalization parameter multiplied by the sum of the absolute value of the coefficient beta J where J goes from one till p and the p is number of predictors included in now model here once again the Lambda which is always positive larger than equal Z is a tuning parameter or the penalty parameter this expression of the sum of squared coefficients is called L1 Norm which is why we call this L1 penalty based regression or L1 regularization in this way L of regression assigns a penalty to some of the variables by shrinking their coefficients towards zero and setting some of these parameters to exactly zero so this means that some of the coefficients will end up being exactly equal to zero which is a key difference between the L regression versus the reg regression the L1 Norm is a mathematical term coming from the linear alra and it's standing for man had Norm or distance you might see here a key difference when comparing the visual representation of the L regression compared to the visual representation of the reg agression so if you look at this point you can see that there will be cases where our coefficients will be set to exactly zero this is where we have this intersection whereas in case of R regression you can recall that there was not a single intersection so the numbers where the circle was closed to the intersection points but there was not a single point when there was an intersection and the coefficients were put to zero and that's the key difference between two regression type of models between the two regularization Tech techniques the main advantages of loss or regression are solving overfitting so loss or regression can shrink the regression coefficient of less important predictors toward zero and some to exactly zero as the model filters some variables out L indirectly performs also what we call feature selection such that the resulted model is highly interpretable and with less features and much more interpretable compared to the reg aggression laso can also improve the predi accuracy of the model by reducing the variance and increasing the bias of the model but not as much as the retrogression earlier when speaking about correlation we also briefly discussed the concept of causation we discuss that correlation is not a causation and we also briefly spoke the method used to determine whether there is a causation or not that model is the infamous linear aggression and even if this model is recognized as a simple approach it's one one of the few methods that allows identifying features that have an impact or statistically significant impact on a variable that we are interested in and we want to explain and it also helps you identify how and how much there is a change in the Target variable when changing the independent variable values to understand the concept of linear aggression you should also know and understand the concepts of dependent variable independent variable linearity and statistical significant effect dependent variables are often referred to as response variables or explained variables by definition dependent variable is a variable that is being measured or tested it's called the dependent variable because it's thought to depend on the independent variables so you can have one or multiple independent variables but you can have only one dependent variable that you are interested in that is your target variable let's now look into the independent variable definition so independent variables are often referred as regressors or explanatory variables and by definition independent variable is the variable that is being manipulated or controlled in the experiment and is believed to have an effect on the dependent variable put it differently the value of the dependent variable is s to depend on the value of the independent variable for example in an experiment to test the effect of having a degree on the wage the degree variable would be your independent variable and the wage would be your dependent variable finally let's look into the very important concept of statistical significance we call the effect statistically significant if it's unlikely to have occurred by random chance in other words a statistically significant effect is one that is likely to be real and not due to a random chance let's now Define the linear regression model formally and then we will dive deep into the theoretical and practical details by definition V regression is a statistical or machine learning method that can help to model the impact of a unit change in the variable the independent variable on the values of another Target variable or the dependent variable when the relationship between the two variables is assumed to be linear when the linear regression model is based on a single independent variable then we call this model simple linear regression when the model is based on multiple independent variables we call it multiple linear regression let's look at the mathematical expression describing linear regression you can recall that when the linear regression model is based on a single independent variable we just call it a simple linear regression this expression that you see here is the most common mathematical expression describing simple linear regression so you can see that we are saying that the Yi is equal to Beta 0 plus beta 1 x i plus UI in this expression the Yi is the dependent variable and the I that you see here is the index corresponding to the E row so whenever you are getting the data and you want to analyze this data you will have multiple rows and if your multiple rows describe the observations that you have in your data so it can be people it can be observation describing uh your data then the each characterizes the specific roow the each roow that you have in your data and the Yi is then variables value corresponding to that each show then the same holds for the XI so the XI is then the independent variable or the explanatory variable or the regressor that you have in your model which is the variable that we are testing so we want to manipulate it to see whether this variable has a statistically significant impact on the dependent variable y so we want to see whether the unit change in the X will result in a specific change in the Y and what kind of change is that so beta Z that you see here is not a variable and it's called intercept or constant something that is unknown so we don't have that in our data and it's one of the parameters of linear regression it's an unknown number which the linear regression model should estimate so we want to use the linear regression model to find out this uh unknown value as well as the second unknown value which is a beta one as well as we can estimate the error terms which are represented by the UR so beta one next to the XI so next to the independent variable is also not a variable so like beta zero is an unknown parameter in linear regression model an unknown number which the linear regression model should estimate beta one is often referred as a slope coefficient of variable X which is the number that quantifies how much dependent variable y will change if the independent variable X will change by one unit so that's EX exactly what we are most interested in the beta one because this is the coefficient and this is the unknown number that will help us to understand and answer the question whether our independent variable X has a statistically significant impact on our dependent variable y finally the U that you see here or the UI in the expression is the error term or the amount of mistake that the model makes when explaining the target variable we add this value since we know that we can never exactly and accurately estimate the Target variable so we will always make some amount of estimation error and we can never estimate the exact value of y hence we need to account for this mistake that we are going to make and we know in advance that we are going to have this mistake by adding an error term to our model let's also have a brief look at how multiple linear regression is usually expressed in mathematical terms so you might recall that difference between the simple linear regression and multiple linear regression is that the first one has a a single independent variable in it whereas the letter or the multiple linear regression like the name suggest has multiple independent variables in it so more than one knowing this type of Expressions is critical since they not only appear a lot in the interviews but also in general you will see them in the data science blogs in presentations in books and also in papers so being able to quickly identify and say ah I remember saying this at once then it will help you to easier understand and follow the process and the story line so uh what you see here you can read as Yi is equal to Beta 0 plus beta 1 * X1 I plus beta 2 * X2 I plus beta 3 * X3 I plus UI so this is the most common mathematical expression describing multiple linear regression in this case with three independent variables so if you were to have more independent variables you should add them with their corresponding indices and coefficients so in this case the method will aim to estimate the model parameters which are beta 0 beta 1 beta 2 and beta Tre so like before Yi is our dependent variable which is always a single one so we only have one dependent variable then we have beta 0 which is our intercept or the constant then we have our first slope coefficient which is beta 1 corresponding to our first independent variable X1 then we have X1 I which stands for the independent variable the first independent variable with an index one and the I stands for the index corresponding to the row so whenever we have multiple linear regression we always need to specify two indices and not only one like we had in our uh single linear regression the index cor that characterizes which independent variable we are referring to so whether it's independent variable one two or three and then we need to specify which row we are referring to which is the index I so you might notice that that in this case all the indices are the same because we are uh looking into one specific role and we are representing this role by using the independent variables the error term and dependent variable so then we are adding our third term which is beta 2 * x2i so the beta 2 is our third unknown parameter in the model and the second slope coefficient corresponding to our second independent variable and then we have our third independent variable with the corresponding slope coefficient beta 3 as well as we also add like always an error term to account for the error that we know that we are going to make so now when we know what the linear regression is and how to express it in the mathematical terms you might be asking the next logical question well we know that when we know what the linear regression is and how to express it in the mathematical terms you might be asking the next logical question how do we find those unknown parameters in the model in order to find out how the independent variables in impacted the dependent variable finding this unknown parameters is called estimating in data science and in general so we are interested in finding out the possible values or the values that the best approximate the unknown values in our model and we call this process estimation and one technique used to estimate linear regression parameters is called oils or ordinary Le squares so domain idea behind this approach the OLS is to find the best fitting straight line so the regression line through a set of paired X and y's so our independent variables and dependent variables values by minimizing the sum of squared errors so to minimize the sum of squares of the differences between the observed dependent variable and its values which are the predicted values that we are predicted by our model that's exactly what we want to do by by using this linear function of the independent variables the residuals so this is too much information let's go it step by step so in linear regression we just so when we are expressing our simple linear regression we have this error term and we can never know what is the actual error term but what we can do is to estimate the value of the error term which we call residual so we want to minimize the sum of squ residuales because we don't know the errors so we want to find a line that will best fit our data in such way that the error that we are making or the sum of squared errors is as small as possible and since we don't know the errors we can estimate the Errors By each time looking at the predicted value that is predicted by our model and the True Value and then we can subtract them from each other and we can see how good our model is estimating the values that we have so how good is our model estimating the unknown parameters so to minimize the sum of squar of the differences between the observed dependent variable and its values predicted by the linear function of the independent variables so the minimizing the sum of squared residuales so uh we Define the estimate of a parameters and variables by adding a hge on the top of the variables or parameters so in this case you can see that y I had is equal to Beta Z head plus beta 1 head XI so you can see that we no longer have a error term this and we say that Yi head is the estimated value of Yi and beta zero head is the estimated value of beta 0 beta 1 head is the estimated value of our beta 1 and the XI is still our data so the values that we have in our data and therefore we don't have a hat since that does not need to be estimated so what we want to do is to estimate our dependent variable and we want to compare our estimated value that we got using our OLS with the actual with the real value such that we can calculate our errors or the estimate of the error which is represented by the UI head so the UI head is equal to Yi minus Yi head where UI head is simply the estimate of the error term or the residual so this predicted error is always referred as residual so make sure that you do not confuse the error with the residual so error can never be observed error you can never calculate and you will never know but what you can do is to predict the error and you can when you predict the error then you get a recal and what oil is trying to do is to minimize the amount of airor that it's making therefore it looks at the sum of squared residuales across all the observation and it tries to find the line that will minimize this value therefore we are saying that the O tries to find the best fitting straight line such that it minimizes the sum of squared residuals we have discussed this model when we were talking about this model mainly from the perspective of causal analysis in order to identify features that have a statistically significant impact on the response variable but linear regression can also be used as a prediction model for modeling linear relationship so let's refresh our memory with the definition of linear regression model by definition linear regression is a statistical or a machine learning method that can help to modrow the impact of a unit change in a variable the independent variable on the values of another Target variable the dependent variable when the relationship between two variables is linear we also discussed how mathematically we can express what we call Simple linear regression and a multiple linear regression so this how the uh simple linear regression can be represented so uh in case of simple linear regression you might recorde that we are dealing with just a single independent variable and we always have just one dependent variable both in the single linear regression and in the multiple linear regression so here you can see that Yi is equal to Beta 0 plus beta 1 * XI plus UI where Y is the dependent variable and I is basically the index of each observation or the row and then the beta 0 is The Intercept which is also known as constant and then the beta 1 is the slope coefficient or a parameter corresponding to the independent variable X which is unnown and a constant which want to estimate along to the beta zero and then the XI is the independent variable corresponding to the observation I and then finally the UI is the error term corresponding to the observation I do keep in mind that this error term we are adding because we do know that we always are going to make a mistake and we can never perfectly estimate the dependent variable therefore to account for this mistake we are adding this UI so let's also recall the estimation technique that we use to estimate the parameter of the linear regression model so the beta 0 and beta 1 and to predict the response variable so we call this estimation technique ORS or the ordinary Le squares NS is an estimation technique for estimating the unknown parameters in the linear regression model to predict the response or the dependent variable so we need to estimate the beta Z so we need to get the beta zero head and we need to estimate the beta one or the beta 1 head in order to obtain the Y I head so Yi head is equal to Beta Z head plus beta 1 head time x i where the um difference between the Yi head and the Yi so the true value of the dependent variable and the predicted value they are different will then produce our estimate of the error or what we also call residual the main idea behind this approach is to find the best fitting straight line so the regression line through a set of paired X and Y values by minimizing the sum of squared residuales so we want to minimize our errors as much as possible therefore we are taking their squared version and we are trying to sum them up and we want to minimize this entire error so to minimize the sum of squar residual so the difference between the observed dependent variable and its values predicted by the linear function of the independent variables we need to use the OLS one of the most common questions related to linear regression that comes time and time again in the uh data science related interviews is a topic of the Assumption of the linear regression model so you need to know each of these five fundamental assumptions of the linear regression and the OLS and also you need to know how to test whether each of these assumptions are satisfied so the first assumption is the linearity Assumption which states that the relationship between the independent variables and the dependent variable is linear we also say that the model is linear in parameters you can also check whether the linearity assumption is Satisfied by plotting the residuals to the fitted values if the pattern is not linear then the estimat will be biased in this case we say that the linearity assumption is violated and we need to use more flexible models such as tree based models that we will discuss in a bit that are able to model these nonlinear relationships the second assumption in the linear regression is the Assumption about randomness of the sample which means that the data is randomly sampled and which basically means that the errors or the residuales of the different observations in the data are independent of each other you can also check whether the second assumption so this assumption about random sample is Satisfied by plotting the residuals you can then check whether the mean of this residuales is around zero and if not then the OLS estimate will be biased and the second assumption is violated this means that you are systematically over or under predicting the dependent variable the third assumption is the exogeneity Assumption which is a really important assumption often as during the data science interviews exogeneity means that each independent variable is uncorrelated with the error terms exogeneity refers to the assumption that the independent variables are not affected by the error term in the model in other words the independent variables are assumed to be determined independently of the erors in the model exogeneity is a key Assumption of the new regression model as it allows us to interpret the estimated coefficient as representing the true causal effect of the independent variables on the dependent variable if the independent variables are not exogeneous then the estimated coefficients may be biased and the interpretation of the results may be invalid in this case we call this problem an endogeneity problem and we say that the independent variable is not exogeneous but it's endogeneous it's important to carefully consider the exogeneity Assumption when building a linear regression model as violation of this assumption can lead to invalid or misleading results if this assumption is satisfied for an independent variable in the linear model we call this independent variable exogeneous so otherwise we call it endogeneous and we say that we have a problem of endogenity endogenity refers to the situation in which the independent variables in the linear regression model are correlated with the error terms in the model in other words the errors are not independent of the independent variables endogeneity is a violation of one of the key assumptions of the linear regression model which is that the independent variables are EX geners or not affected by the errors in the model endogenity can arise in a number of ways for example it can be caused by omitted variable bias in which an important predictor of the dependent variable is not included in the model it can also be caused by the reverse causality in which the dependent variable affects the independent variable so those two are a very popular examples of the case when we can get an endogenity problem and those are things that you should know whenever you are interest in for data science roles especially when it's related to machine learning because those questions are uh being asked to you in order to test whether you understand the concept of exogeneity versus endogenity and also in which cases you can get endogenity and also how you can solve it so uh in case of omitted variable bias let's say you are estimating a person's salary and you are using as independent variable their education their number of years of experience and uh some other factors but you are not including for instance in your model a feature that would describe the uh intelligence of a person or uh for instance IQ of the person well given that those are a very important indicator for a person in order to perform in their uh field and this can definitely have um indirect impact on their salary not including these variables will result in omitted variable bias because this will then be uh Incorporated in your um error term and uh this can also relate to the other independent variables because then your uh IQ is also related to the um to the education that you have higher is your IQ usually higher is your education so in this way you will have an error term that includes an important variable so this is the omitted variable which is then uh correlated with your uh one of your or multiple of your independent variables include in your model so the other example other cause of the endogenity problem is the reverse causality and um what reverse causality means is basically that not only the independent variable has an impact on the dependent variable but also the dependent variable has an impact on the independent variable so there is a reverse relationship which is something that we want to avoid we want to have our features that include in our model that have only an impact on dependent variable so they are explaining the dependent variable but not the other way around because if you have the um the other way so you have the dependent variable impacting your independent variable then you will have the error term being related to this independent variable because there are some components that also Define your dependent variable so knowing the uh few examples such as those that can cause uh endogenity so they can violate the exogeneity assumption is really important then uh you can also check for the exogeneity Assumption by conducting a formal statistical test this is called house one test so this is an econometrical test that helps to understand whether you have an exogeneity uh violation or not but this is out of the scope of this course I will however include uh many resources related the exogeneity endogenity the omitted variable bias as well as the reverse cality and also how the house one test can be conducted so for that check out the interation guide where you can also find the corresponding free your resources the fourth assumption linear regression is the Assumption about homos skes homos refers to the assumption that the variance of the errors is constant across all predicted values this assumption is also known as the homogeneity of the variance homosa is an important Assumption of linear regression model as it allows us to use certain statistical techniques and make inferences about parameters of the model if the errors are not homoskedastic then the result of these techniques may be invalid or misleading if this assumption is violated then we say that we have heteroscedasticity hecticity refers to the situation in which the variance of the error terms in the linear regression model is not constant across all the predicted values so we have a variating variant in other words the Assumption of homos skas testing in that case is violated and we say we have a problem of heos heteros can be a real problem in V regression nurses because it can lead to invalid or misleading results for example the standard estimates and the confidence intervals for the parameters may be incorrect which means that also the statistical test may have incorrect type one error rates so you might recall when we were discussing the linear regression as part of the fundamental statis section of this course is that we uh looked into the output that comes from a python and we saw that we are getting uh estimates as part of the output as well as standard errors then the T Test so the student T test and then the corresponding P values and the 95% confidence intervals so whenever there is a heos problem the um coefficient might still be accurate but then the corresponding standard error the U student T Test which is based on the standard error and then the P value as well as the uh confidence intervals may not be accurate so you might get the uh good and reasonable coefficient but then you don't know how to correctly evaluate them you might end up discovering that um you might end up stating that certain uh independent variables are statistically significant because their coefficients are statistically significant since their P values are small but in the reality those P values are misleading because they are based on the wrong statistical uh test and they are based on the wrong standard errors you can check for this assumption by plotting the residual and see whether there is a funnel like graph if there's Fel like gra then you have a a constant variance but if there is not then you won't see this fenel like this shape that indicates that your variances are constant and if not then we say we have a problem of heos skos if you have a heteros system you can no longer use the OS and the linear regression and instead you need to look for other more advanced econometrical regression techniques that do not make such a strong assumption regarding the variance of your um residuals so you can for instance use the GLS the fgs the GMM and this type of solutions will um help to solve the hoscar problem and they will not make a strong assumptions regarding the variance in your model the fifth and the final assumption in linear regression is the Assumption about no perfect multicolinearity this assumption states that there are no exactly new relationships between the independent variables multicolinearity refers to the case when two or more independent variables in your linear regression model are highly correlated with each other this can be a problem because it can lead to unstable and unreliable estimate of the parameters in the model perfect multicolinearity happens when the independent variables are perfectly correlated with each other meaning that one variable can be perfectly predicted from the other ones and this can cause the estimated coefficient your linear regression model to be infinite or undefined and can lead your errors to be uh entirely misleading when making a predictions using this model if perfect multicolinearity is detected it may be necessary to remove one if not more problematic variables such that you will avoid having correlated variables in your model and even if the perfect multicolinearity is not present multicolinearity at a high level can still be a problem if the correlations between the independent variables are high in this case the estimate of the parameters may be imprecise and the model may be uh entirely misleading and will results in less reliable uh predictions so uh to test for the multicolinearity Assumption you have different solutions you have different options the first way uh you can do that is by using the uh di test De test is a formal statistical and econometrical test that will help you to identify which variables cause a problem and whether you have a perfect multicolinearity in your linear regression model you can PL heat map which will be based on the uh correlation metrix corresponding to your features then you will have your uh correlations per pair of independent variables plotted as a part of your heat map and then you can identify all the um pair of features that are highly correlated with each other and those are problematic features one of which should be removed from your model and in this way by uh showing the heat map you can also showcase your stakeholders why you have remove certain variables from your model whereas explaining a Diller test is much more complex because it involves more advanced econometrics and linear uh regression um explanation so if you're wondering how you can perform this de FL test and you want to prepare the uh questions related to perfect multicolinearity as well as how you can solve the perfect multicolinearity problem in your linear regression model then head towards the interview preparation guide included in this part of of the course in order to answer such questions and also to see the 30 most popular interview questions you can expect from this section in the interview preparation guide now let's look into an example coming from the linear regression in order to see how all those pieces of the puzzle come together so let's say we have collected a data on a class size and a test course for of students and we want to model the linear relationship between the class size and the test course using the linear regression model so as we have just one independent variable we are dealing with a simple linear regession and the model equation would be as follows so you can see that the test course is equal to beta0 plus beta 1 multip by class size plus Epsilon so here the class size is the single independent variable that we got in our model the test score is the dependent variable the beta0 is is The Intercept or the constant the beta one is the coefficient of Interest as this the coefficient corresponding to our independent variable and this will help us to understand what is the uh impact of a unit change in the class size on the test score and then finally we are including in our model our error term to account for the mistakes that we are definitely going to make when estimating the uh dependent variable that has course the goal is to estimate the coefficient 0 and beta 1 from the data and use the estimated model to predict the test course based on the class size so once we have the estimates we can then interpret them as follows the Y intercept the beta zero represents the expected test course when the class size is zero it represents the base score that the student would have obtained if the class size would have been zero then the coefficient for the class size the beta one represents the change in the test course associated with the one unit change in the class size the positive coefficient would imply that one unit change in the class size would increase the test course whereas the negative coefficient would uh imply that the one unit change in the class size will decrease the test course uh correspondingly we can then use this model with OLS estimate in order to predict the test course for any given class size so let's go ahead and Implement that in Python if you're wondering how this can be done then head towards the resources section as as well as the part of the Python for data science where you can learn more about how to work with pendant data frames how to import the data as well as how to fit a linear regression model so the problem is as follows we have collected data on the class size and we have this independent variable so as you can see here we have the students uncore data and then we have the class size and this our feature and then we want to estimate the Y which is the test SC so uh here is the code a sample code that will fit a linear regression model we are keeping here everything very simple we are not splitting our data into training test and then fitting the model on the training data and making the predictions with the test score but we just want to see how we can interpret the uh coefficients so keeping everything very simple so you can see here that we are getting an intercept equal to 63.7 and the coefficient corresponding to our single independent variable class size is equal to minus 0.40 what this means is that so each increase of the uh class size by one unit will result in the decrease of the test scores with 0.4 so there is a negative relationship between the two now the next question is whether there is a statistical significance whether the uh coefficient is actually significant and where the class size has actually statistically significant imp impact on the dependent variable but all those are things that we have discussed as part of the fundamental statistic section of this course as well as we are going to look into a linear regression example when we are going to discuss the hypothesis testing so I would highly suggest you to uh stop in here to revisit the fundamentals to statistic section of this course to refresh your memory in terms of linear regression and then um check also the hypothesis test uh section of the course in order to look into a specific example of linear regression when we are discussing the standard errors how you can evaluate your OLS estimation results how you can use the student T Test the P value and the confidence intervals and how you can estimate them in this way you will learn for now only the theory related to the coefficients and then you can um add on the top of this Theory once you have learned all the other sections and the other topics in this course let's finally discuss the advantages and the disadvantages of the linear regression model so some of the advantages of the linear regression model are the following the linear regression is relatively simple and easy to understand and to implement linear regression models are well suited for understanding the relationship between a single independent variable and a dependent variable also linear regression can help to handle multiple independent variables and can estimate the unique relationship between each independent variable and the corresponding dependent variable thear regression model can also be extended to handle more complex models such as pooms interaction terms allowing for more flexibility in the modeling the data also linear aggression model can be easily regularized to prevent overfitting which is a common problem in modeling as we saw uh in the beginning of this section so you can use for instance retrogression which is an extension of Vue regression you can use ler regression which is also an extension of Vue regression model and then finally linear regression models are widely supported by software packages and libraries making it easy to implement and to analyze and some of the disadvantages of the linear aggression are the following so the linear aggression models make a lot of strong assumptions regarding for instance the linearity between independent variables and independent variables while the true relationship can actually be also nonlinear so the model will not then be able to capture the complexity of the data so nonlinearity and the predictions will be inaccurate therefore it's really important to have a data that has a linear relationship for linear regression to work linear regression also assumes that the error terms are normally distributed and also homoskedastic error terms are independent across observations violations of the strong assumption will lead to bias and inefficient estimates linear regression is also sensitive to outliers which can have a disproportionate effect on the estimate of the regression coefficients linear regression does not easily handle categorical independent variables which often require additional data preparation or the use of indicator variables or using encodings finally linear regression also assumes that the independent variables are exogeneous and not affected by the error terms if this assumption is violated then the result of the model may be misleading in this lecture lecture number five we will discuss another simple machine learning technique called logistic regression which is simple but very important classification model useful when dealing with a problem where the output should be a probability so the name regression in logistic regression might be confusing since this is actually a classification model logistic regression is widely used in a variety of fields such as social sciences medicine and Engineering so let us firstly Define the logistic regression model the logistic regression is a supervised classification technique that models the conditional probability of an event occurring or observation belonging to a certain class given a data set of independent variables and those are our features the class can have two categories or more but later on we will learn that logistic regression Works ideally when we have just two classes this is is another very important and very popular machine learning technique which though named regression is actually a supervised classification technique so when the relationship between two variables is linear the dependent variable is a categorical variable and you want to predict a variable in the form of a probability so a number between zero and one then logistic regression comes in very handy this is because during the prediction process in logistic regression the classifier predicts the probability ility a value between Z and one of each observation belonging to a certain class for instance if you want to predict the probability or the likelihood of a candidate being elected or Not Elected during the election process given the set of characteristics that you got about your candidate let's say the popularity score the past successes and other descriptive variables about this candidate then logistic regression comes in very handy to model this probability so rather than predicting the response variable logistic regression models the probability that y belongs to a particular category similar to the linear regression with a difference that instead of Y it predicts the log odds so we will come about this definition of log odds and odds in a bit in statistical terminology what we are trying to do is to model the conditional distribution of the response y given the predictors X therefore logistic regression helps to predict the probability of Y belonging to a certain class given the feature space what we call probability of Y given X if you're wondering what is the concept of probability what is this conditional probability then make sure to head towards the section of fundamentals to statistics as we are going to in detail about this Concepts as well as we are looking into different examples these definitions and this Concepts will help you to better follow this lecture so here we see the probability X which is what we are interested in modeling and it's equal to e to the power beta 0 + beta 1 * x / to 1 + e^ beta 0 + beta 1 * X let's now look into the formulas for the odds and log ODS both these formulas are really important because you can expect them during your data science interviews so sometimes you will be asked to explicitly write down the odds and log ODS formulas and those are highly related to the log likelihood and likelihood functions which are the base for the estimation technique mle or the maximum likelihood estimation used to estimate the unknown parameters in the logistic agression so the log odds and the odds are highly related to each other and in logistic regression we use the odds and log ODS to describe the probability of an event occurring the odds is a ratio of the probability of an event occurring to the probability of the event not occurring so as you can see the odd is equal to PX / to 1 - PX where PX is the probability of event occurring and 1 - PX is the probability of the event not occurring so this formula is equal to E power beta 0 + beta 1 * X in our formula where we only have one independent variable and the E simply is the ERS number or the 2.72 which is a constant so we won't derive this formula by ourselves because that's out of the scope of this course but feel free to head out to the PX formula that we just saw in the previous slide and take this formula divide it to one minus 2 exactly the same expression and you can verify that you will end up with this expression that you see here for example if the probability of a person having a heart attack is 0.2 then the ads of having a heart attack will be 0.2 / to 1 - 0.2 which is equal to 0.25 the low OD also known as the logit function is a natural logarithm of the OD so as you can see here the log of px/ to 1 minus PX and this is equal to Beta 0 plus beta 1 * X so you can see that we are getting rid of this e and this is simply because of a mathematical expression that says if we take the log of the e to the power something then we end up with only the exponent part in it though this is out of the scope of this course to look into the mathematical derivation of this formula I will include many resources regarding this logarithm the Transformations and the mathematics behind it just in case you want to look into those details and do some uh extra learning so logistic regression uses the log ODS as the dependent variable and the independent variables are used to predict this log ODS the coefficient of the independent varibles represent then the change in the log OD for a one unit change in the independent variable so you might that in the linear regression we were modeling the actual dependent variable in case of logistic regression the difference is that we are modeling the logas another important Concept in logistic regression is the likelihood function the likelihood function is used to estimate the parameters of the model given the observed data sometimes during the interviews you might also be asked to write down the exact likelihood formula or the log likelihood function so I would definitely suggest you to memorize this one and to understand all the components included in this formula the likelihood function describes the probability of The observed data given the parameters of the model and if you follow the lecture of the probability density functions in the section of fundamentals to statistics you might here even recognize the bar noly PDF since the likelihood function here is based on the probability Mass function of a Baro distribution which is a distribution of a binary outcome So This is highly applicable to the case where we have only two categories in our dependent variable and we are trying to estimate the probability of observation to belonging to one of those two classes so this is the L likelihood function and this is the likelihood function we start with the likelihood function and the L the capital letter L stands for the likelihood function the L is equal the likelihood function L is equal to product across all pair of these multipliers so we have Peak side to the power Yi multiplied by 1 - pxi to^ 1 - y i where pxi is the PX that we just so only for observation I and the Yi is simply the class so Yi will either be equal to zero or one so Yi is equal to 1 then 1 minus Yi is equal to zero so we every time we are looking into the probability of observation belonging to the first class multiply by the probability of observation not belonging to that plus and we take this cross multiplications and we do that for all the observations that are included in our data and this also comes from mathematics so this stands for the product so uh given that it's harder to work with products compared to the sums we then apply the Lo likelihood uh transformation in order to obtain the Lo likelihood function instead of likelihood function so when we apply this log transformation so we take the L logarithm of this expression we end up with this log likelihood expression and here again one more time we are making use of a mathematical property which says that if we take the logarithm of the products we end up with the sum of the logarithms so we go from the products to the sums I will also include resources regarding that such that you can also learn the mathematics Behind These Transformations so the L likelihood with a lowercase L is equal to logarithm of the products p i^ y i * 1 - PX i^ 1 - Yi and when we apply that mathematical transformation then the L is equal to sum across all observation I is equal to 1 till M and then y i so the power the exponent comes to the front Yi * logarithm of the pxi plus 1 - Yi * logarithm of 1 - pxi while for linear regression we use OLS as estimation technique for logis regression in other estimation technique should be used the reason why we cannot use OLS in logistic regression to find the best fitting line is because the errors can become very large or very small and sometimes even negative in case of logistic aggression while for logistic regression we aim for predicted value between zero and one therefore for logistic regression we need to use estimation technique called maximum likelihood estimation or in short mle where the likelihood function calculates the probability of observing the data outcome given the input data in the model we just saw the likel function in the previous slide this function is then optimized to find the set of parameters that result in the largest sum likelihood so the maximum likelihood over the training data set logistic function will always produce this s-shaped curve regardless of the value of independent variable able X resulting in sensible estimation most of the time so value between 0 and one so as you can see this s-shaped cure is what characterizes the maximum likelihood estimation corresponding to the logistic regression and it will always provide not come between zero and one then the idea behind the maximum likelihood estimation is to find a set of estimates that would maximize the likelihood function so let's go through the maximum likelihood estimation step by step what we need to do first is to define a likelihood function the first step is to always Define this function for the model secondly we need to write the log likelihood function so the next step is to take the natural logarithm of the likelihood function to obtain the log likelihood function so I'm talking about this one the L likel function is a more convenient and computationally efficient function to work with and what we need to do next is to find the maximum of this L like function so this step consists of finding the values of the parameters beta 0 and beta 1 that maximize the L lik function there are many optimization algorithms that can be used to find the maximum but these are out of the scope of this course and you don't need to know them as part of becoming a data scientist and entering data science field in the fourth step we need to estimate the parameters so we are talking about the beta 0 and beta 1 once the maximum of the log likel function is found the values of the parameters that correspond to the maximum are considered the maximum likelihood estimate of the parameters and then in the next step we need to check the model Feit so once the maximum likelihood estimates are obtained we can check the goodness of fit of the model by calculating information criteria such as AIC B Bic or R squ where AIC stands for akas information criteria Bic stands for bi information criteria and r s refers to the same evaluation value that we use for evaluating linear regression in the final step we need to make predictions and evaluate the model using the maximum likelihood estimates the model can be used to make predictions on a new unseen data and the performance of the model can be then evaluated using various evaluation metrics such as accuracy precision and recall those are metrics that we have Revisited as part of the very initial lecture in the section and those are metrics that you need to know so unlike the AIC Bic that we just spoke about that evaluates the goodness of feed of the very initial estimates that come from the maximum likelihood the accuracy and precision and the recall evaluate the final model so the values that we get for the nuan in data when we make the predictions and we get the classes and those are metrics that you need to know if you're wondering what this accuracy is what this Precision recall is as well as the F1 score make sure to head towards the very initial lecture in this section where we talked about the exact definition of this metrix let's finally discuss the advantages and the disadvantages of the logistic regression so some of the advantages of logistic regressions are that it's a simple model it has a low variance it has a low bias and it provides probabilities some of the disadvantages of logistic regressions are logistic regression is unable to model nonlinear relationship so one of the key assumptions that logistic regression is making is that there is a linear relationship between your independent variable and your dependent variable logistic regression is also unstable when your classes are well separable as well logistic agression becomes very unstable when you have more than two classes so this means whenever you have more than two categories in your dependent variable or whenever your classes are well separable using logistic regression for classification purposes will not be very smart so instead you should look for other models that you can use for this task and one of such models is linear discriminate analysis so the LDA that we will introduce in the next lecture so this is all for this lecture where we have looked into the logistic regression and the maximum likelihood estimation in the next lecture we will look into the LDA so stay tuned and I will see you in the next lecture looking to step into machine learning or data science it's about starting somewhere practical yet powerful and as the simple yet most popular machine learning algorithm linear regression linear aggression isn't just a jargon it's a tool that is used both for a finding out what are the most important features in your data as well as being used to forecast the future that's your starting point in the Journey of data science and Hands-On machine learning uh work embark on a handson data science and machine learning project where we are going to find what are the drivers of Californian house prices you will clean the data visualize the key trends you will learn how to process your data and how to use different python libraries to understand what are those drivers of Californian house values you're are going to learn how to implement linear regression in Python and learn all these fundamental steps that you need in order to conduct a proper handson data science project at the end of this project you will not only learn those different python libraries when it comes to data science and machine learning such as pandas psyit learn tou models medf Le curn but you will also be able to put this project on your person website and on your resume a point size stepbystep case study and approach to build your confidence and expertise in machine learning and in data science in this part we are going to talk about a case study in the field of Predictive Analytics and causal analysis so we are going to use this simple yet powerful regression technique called your regression in order to perform causal analysis and Predictive Analytics so by causal analysis I mean that we are going to look into this correlations clation and we're trying to figure out what are the features that have an impact on the housing price on the house value so what are these features that are describing the house that Define and cause the variation in the uh house prices the goal of this case study is to uh practice linear regression model and to get this first feeling of how uh you can use a machine learning model a simple machine learning model in order to perform uh model training model evaluation and also use it for causal analysis where you are trying to identify features that have a statistically significant impact on your response variable so on your dependent variable so here is the step-by-step process that we are going to follow in order to find out what are the features that Define the Californian house values so first we are going to understand what are the set of independent variables that we have we're also going to understand what is the response variable that we have so for our multiple linear regression model we are going to understand what are this uh techniques that we uh need and what are the libraries in Python that we need to load in order to be able to conduct this case study so first we are going to load all these libraries and we are going to understand why we need them then we are going to conduct data loading and data preprocessing this is a very important step and I deliberately didn't want you to skip this and didn't want you to give you the clean data cuz uh usually in normal real Hands-On data science job you won't get a clean data you will get a dirty data which will contain missing values which will contain outliers and those are things that you need to handle before you proceed to the actual and F part which is the modeling and the uh analysis so therefore we are going to do missing data analysis we are going to remove the missing data from our Californian house price data we are going to conduct outlier detection so we are going to identify outliers we are going to learn different techniques that you can use visualization uh techniques uh in Python that you can use in order to identify outliers and then remove them from your data then we are going to perform data visualization so we are going to explore the data and we are going to do different plots to learn more about the data to learn more about this outliers and different statistical techniques uh combined with python so then we are going to do correlation analysis to identify some problematic features which is something that I would suggest you to do independent the nature of your case study to understand understand what kind of variables you have what is the relationship between them and whether you are dealing with some potentially problematic variables so then we will be uh moving towards the fun part which is performing the uh multiple theine regression in order to perform the caal NES which means identifying the features in the Californian house blocks that Define the value of the Californian houses so uh finally we will do very quickly another uh implementation of the same multiple uh multiple linear regression in order to uh give you not only one but two different ways of conducting linear regression because linear regression can be used not only for caal analysis but also as a standalone a common machine learning regression type of model therefore I will also tell you how you can use psych learn as a second way of training and then predicting the C for house values so without further Ado let's get started once you become a DAT a scientist or machine learning researcher or machine learning engineer there will be some cases some Hands-On uh data science projects where the business will come to you and we'll tell you well here we have this data and we want to understand what are these features that have the biggest influence on this Auto factor in this specific case in our case study um let's assume we have a client that uh is interested in identifying what are the features that uh Define the house price so maybe it's someone who wants to um uh invest in uh houses so it's someone who is interested in buying houses and maybe even renovating them and then reselling them and making a profit in that way or maybe in the long-term uh investment Market when uh people are buying real estate in a way of uh in inting in it and then longing for uh holding it for a long time and then uh selling it later or for some other purposes the end goal in this specific case uh for a person is to identify what are this features of the house that makes this house um to be priced at a certain level so what are the features of the house that are causing the price and the value of the house so we are going to make use of this very popular data set that is available on kagal and it's originally coming from psyit learn and is called California housing prices I'll also make sure to put the link uh of this uh specific um data set uh both in my GitHub account uh under this repository that will be dedicated for this specific case study as well as um I will also point out the additional links that you can use to learn more about this data set so uh this data set is derived from 1990 um US Census so United uh States census using one row Paris sensus block so a Blog group or block is the smallest uh geographical unit for which the US cus Bureau publishes sample data so a Blog group typically has a population of 600 to 3,000 people who are living there so a household is a group of people residing with within a single home uh since the average number of rooms and bedrooms in this data set are provided per household this conss may be um May take surprisingly large values for blog groups with few households and many empty houses such as Vacation Resorts so um let's now look into uh the variables that are available in this specific data set so uh what we have here is the med Inc which is the median income in blog group so uh this um touches the uh financial side and uh Financial level of the uh block uh block of households then we have House age so this is the median house age in the block group uh then we have average rooms which is the average number of rooms uh per household and then we have average bedroom which is the average number of bedrooms per household then we have population which is the uh blog group population so that's basically like we just saw that's the number of people who live in that block then we have a uh o OU uh which is basically the average number of household members uh then we have latitude and longitude which are the latitude and longitude of this uh block group that we are looking into so as you can see here we are dealing with aggregate data so we don't have the uh the data per household but rather the data is calculated and average aggregated based on a block so this very common in data science uh when we uh want to reduce the dimension of the data and when we want to have some sensible numbers and create this crosssection data and uh cross-section data means that we have multiple observations for which we have data on a single time period period in this case we are using as an aggregation unit the block and uh we have already learned as part of the uh Theory lectures this idea of median so we have seen that there are different descriptive measures that we can use in order to aggregate our data one of them is the mean but the other one is the median and often times especially if we are dealing with skute distribution so if we have a distribution that is not symmetric but it's rather right cuute or left skewed then we need to use this idea of median because median is then better representation of this um uh scale of the data um compared to the mean and um in this case we will soon see when representing and visualizing this data that we are indeed dealing with a skewed data so um this basically a very simple a very basic data set with not too many features so great um way to uh get your hands uh uh on with actual machine learning use case uh we will be keeping it simple but yet we will be learning the basics and the fundamentals uh in a very good way such that uh learning more um difficult and more advanced machine learning models will be much more easier for you so let's now get into the actual coding part so uh here I will be using the Google clap so I will be sharing the link to this notebook uh combined with the data in my python for data science repository and you can make use of it in order to uh follow this uh tutorial uh with me so uh we always start with importing uh libraries we can run a l regression uh manually without using libraries by using matrix multiplication uh but I would suggest you not to do that you can do it for fun or to understand this metrix multiplication the linear algebra behind the linear regression but uh if you want to um get handson and uh understand how you can use the new regression like you expect to do it on your day-to-day job then you expect to use um instead libraries such as psychic learn or you can also use the statsmodels.api libraries in order to understand uh this topic and also to get handson I decided to uh showcase this example not only in one library in Cy thir but also the starts models and uh the reason for this is because many people use linear regression uh just for Predictive Analytics and for that using psyit learn this is the go-to option but um if you want to use linear regression for causal analysis so to identify and interpret this uh features the independent variables that have a statistically significant impact on your response variable and then you will need to uh use another Library a very handy one for linear regression which is called uh stats models. API and from there you need to import the SM uh functionality and this will help you to do exactly that so later on we will see how nicely this Library will provide you the outcome exactly like you will learn on your uh traditional econometrics or introduction to linear regression uh class so I'm going to give you all this background information like no one before and we're going to interpret and learn everything such that um you start your machine Learning Journey in a very proper and uh in a very um uh high quality way so uh in this case uh first thing we are going to import is the pendence library so we are importing pendis Library as PD and then non pile Library as NP we are going to need pendes uh just to uh create a pendis data frame to read the data and then to perform data wrangling to identify the missing data outliers so common data wrangling and data prosessing steps and then we are going to use npy and npy is a common way to uh use whenever you are visualizing data or whenever you are dealing with metrices or with arrays so pandas and nonp are being used interchangeably so then we are going to use meth plot lip and specifically the PIP plat from it uh and this library is very important um when you want to visualize a data uh then we have cburn um which uh is another handy data visualization library in Python so whenever you want to visualize data in Python then methot leip and Cy uh cburn there are two uh very handy data visualization techniques that you must know if you like this um cooler undertone of colors the Seaburn will be your go-to option because then the visualizations that you are creating are much more appealing compared to the med plot Le but the underlying way of working so plotting scatter plot or lines or um heat map they are the same so then we have the STS mods. API uh which is the library from which we will be importing the uh as uh that is the temple uh linear regression model that we will be using uh for our caal analysis uh here I'm also importing the uh from Psychic learn um linear model and specifically the linear regression model and um this one uh is basically similar to this one you can uh use both of them but um it is a common um way of working with machine learning model so whenever you are dealing with Predictive Analytics so we you are using the data not for uh identifying features that have a statistically significant impact on the response variable so features that have an influence and are causing the dependent variable but rather you are just interested to use the data to train the model on this data and then um test it on an unseen data then uh you can use pyit learn so psyit learn will uh will be something that you will be using not only for linear regression but also for a machine learning model I think of uh Canon um logistic regression um random Forest decision trees um boosting techniques such as light GBM GBM um also clustering techniques like K means DB scan anything that you can think of uh that fits in in this category of traditional machine learning model you will be able to find Ayler therefore I didn't want you to limit this tutorial only to the S models which we could do uh if we wanted to use um if we wanted to have this case study for uh specifically for linear regression which we are doing but instead I wanted to Showcase also this usage of psychic learn because pyic learn is something that you can use Beyond linear regression so for all these added type of machine learning models and given that this course is designed to introduce you to the world of machine learning I thought that we will combine this uh also with psychic learning something that you are going to see time and time again when you are uh using python combined with machine learning so then I'm also uh importing the uh training test plate uh from the psychic learn model selection such that we can uh split our data into train and test now uh before we move into uh the uh actual training and testing we need to first load our data so so therefore uh what I did was to uh here uh in this sample data so in a folder in Google collab I uh put it this housing. CSV data that's the data that you can download uh when you go to this specific uh page so uh when you go here um then uh you can also uh download here that data so download 49 kab of this uh housing data and that's exactly what I'm uh downloading and then uploading here in Google clap so this housing. CSV in this folder so I'm copying the path and I'm putting it here and I'm creating a variable that holds this um name so the path of the data so the file uncore path is the variable string variable that holds the path of the data and then what I need to do is that I need to uh take this file uncore path and and I need to put it in the pd. read CSV uh which is a function that we can use in order to uh load data so PD stands for pandas the short way of uh naming pandas uh PD do and then read uncore CSV is the function that we are taking from Panda's library and then within the parentheses we are putting the file uncore path if you want to learn more about this Basics or variable different data structures some basic python for data science then um to ensure that we are keeping this specific tutorial structured I will not be talking about that but feel free to check the python for data science course and I will put the link um in the comments below such that you can uh learn that if you don't know yet and then you can come back to this tutorial to learn how you can use python in combination with linear regression so uh the first thing that I tend to do before moving on to the the actual execution stage is to um look into the data to perform data exploration so what I tend to do is to look at the data field so the name of the variables that are available in the data and that you can do by doing data. columns so you will then look into the columns in your data this will be the name of your uh uh data fields so let's go ahead and do command enter so we see that we have longitude like attitude housing unor median age we have total rooms we have total bedrooms population so basically the the um amount of people who are living in the in those households and in those houses then we have households then we have median income we have median housecore value and we have ocean proximity now you might notice that the name of these variables are a bit different than in the actual um documentation of the California house so you see here the naming is different but the underlying uh explanation is the same so here they are just trying to make it uh nicer and uh represent it in a better uh naming but uh it is a common um thing to see in Python when we are dealing with uh data that uh we have this underscores in the name approvation so we have housing uncore median AG which in this case you can see that it says house um age so bit different but their meaning is the same this is still the median house age in the block group so uh one thing uh that you can also uh notice here is that the um in the official uh documentation we don't have this um one extra variable that we have here which is the ocean proximity and this basically uh describes the uh Clos cless of the house from the ocean which of course uh for some people can definitely mean a increase or decrease in the house price so I basically um we have all these variables and next thing that I tend to do is to look into the actual data and one thing that we can do is just to look at the um top 10 rows of the data instead of printing the entire uh data frame so when we go and uh execute this specific part of the code and the command you can see that here we have the top 10 rows uh of our data so we have the longitude the latitude we have the housing median age you can see we are see some 41 year 21 year 52 year basically the number of years that a house the median age of the house is 41 21 52 and this is per block then we have the number of total bedrooms so we see that uh we have um in this blog uh the total number of rooms that this houses have is 7,99 so we are already seeing a data that consists of these large numbers which is something to take into account when uh you are dealing with machine learning models and especially with line regression then we have total bedrooms um and we have then population households median income median house value and the ocean proximity one thing that you can see right of the bed is that uh we have longitude and latitude uh which have some uh unique uh characteristics um and longitude is with minuses latitude is with pluses uh but that's fine for the linear regression because what it is basically looking is uh whether a variation in certain independent variables in this case longitude and latitude but that will cause a change in the dependent variable so just to refresh our memory what this linear regression will do in this case um so we are dealing with multiple inine regression because we have more than one independent variables so we have as independent variables those different features that describe the house except of the house price because median house value is the dependent variable so that's basically what we are trying to figure out we want to see what are the features of the house that cause so Define the house price we want to identify what are um the features that cause a change in our dependent variable and specifically uh what is the uh change in our median house price uh volue if we apply a one unit change in our independent feature so if we have a multiple linear regession we have learned during the theory lecture that what linear regression tries to use during causal analysis is that it tries to keep all the independent variables constant and then investigate for a specific independent variable what is this one unit uh change uh increase uh in the specific independent variable will result in what kind of change in our dependent variable so if we for instance change by one unit our uh housing median age um then what will be the correspond in change in our median household value keeping everything else concent so that's basically the idea behind multi multiple linear regression and using that for this specific use case and in here um what we also want to do is to find out what are the uh data types and whether we can learn bit more about our data before proceeding to the next step and for that I tend to use this uh info uh function in panel so given that the data is a penis data frame I will just do data. info and then parentheses and then this will uh show us what is the data type and what is the number of new values per variable so um as we have already noticed from this header which we can also see here being confirmed that ocean proximity is a variable that is not a numeric value so here you can see nearby um also a value for that variable which unlike all the other values is represented by a string so this is something that we need to take into account because later on when we uh will be doing the data prop processing and we will actually uh actually run this model we will need to do something with this specific variable we need to process it so um for the rest we are dealing with numeric variables so you can see here that longitude latitude or all the other variables including our dependent variable is a numeric variable so float 64 the only variable that needs to be taken care of is this ocean uncore proximity uh which um we can actually later on also see that is um categorical string variable and what this basically means is that it has these different categories so um for instance uh let us actually do that in here very quickly so let's see what are all the unique values for this variable so if we take the name of this variable so we copied from this overview in here and we do unique then this should give us the unique values for this categorical variable so here we go so we have actually five different unique values for this categorical string variable so this means that this ocean proximity can take uh five different values and it can be either near Bay it can be less than 1 hour from the ocean it can be Inland it can be near Ocean and it can be uh in the Iceland what this means is that we are dealing with a feature that describes the distance uh of the block from the ocean and here the underlying idea is that maybe this specific feature has a statistically significant impact on the house value meaning that it might be possible that for some people um in certain areas or in certain countries living in the uh nearby the ocean uh will be increasing the value of the house so if there is a huge demand for houses which are near the ocean so people prefer to uh leave near the ocean then most likely there will be a positive relationship if there is a uh negative relationship then it means that uh people uh if uh in that area in California for instance people do not prefer to live near the ocean then uh we will see this negative relationship so we can see that um if we increase uh the uh if if people uh if the house is in the uh um area that is uh not close to Ocean so further from the ocean then the house value will be higher so this is something that we want to figure out with this line regression we want to understand what are the features that uh Define the value of the house and we can say that um if the house has those characteristics then most likely the house price will be higher or the house price will be lower and uh linear aggression helps us to not only understand what are those features but also to understand how much higher or how much lower will be the value of the house if we have the certain characteristics and if we increase the certain characteristics by one unit so next we are going to look into uh the missing data in our data so in order to have a proper machine learning model we need to do some uh data processing so for that what we need to do is we need to check for the uh missing values in our data and we need to understand what is this amount of new values per data field and this will help us to understand whether uh we can uh remove some of those missing values or we need to do imputation so depending on the amount of missing data that we got in our data we can then understand which all those Solutions we need to take so here we can see that uh we don't have any n values when it comes to longitude latitude housing median age and all the other variables except of one variable one independent variable and that's the total bedrooms so we can see that um out of all the observations that we got the total uh underscore bedrooms variable has 207 cases when we do not have the corresponding uh information so when it comes to representing this numbers in percentages which is something that you should do as your next step we can see that um out of uh the entire data set uh for total underscore bedrooms variable um only 1. n n uh 3% is missing now this is really important because by simply looking at the number of times the uh number of missing uh observations perir data field this won't be helpful for you because you will not be able to understand relatively how much of the data is missing now if you have for a certain variable 50% missing or 80% missing then it means that for majority of your house blocks you don't have that information and including that will not be beneficial for your Morel nor will be it accurate to include it and it will result in biased uh Morel because if you have for the majority of observations uh no information and for certain observations you do that inform you have that information then you will automatically skew your results and you will have biased results therefore if you have uh for the majority of your um data set that specific uh variable missing then I would suggest you choose just to drop that independent variable in this case we have just one uh% uh of the uh house blocks missing that information which means that this gives me confidence that uh I would rather keep this independent variable and just to drop those observations that do not have a total uh underscore bedrooms uh information now another solution could also be is uh to instead of dropping that entire independent variable is just to uh use some sort of imputation technique so uh what this means is that uh we will uh try to find a way to systematically find a replacement for that missing value so we can use mean imputation median imput ation or more model based more advanced statistical or econometrical approaches to perform imputation so for now this out of the scope of this problem but I would say look at the uh percentage of uh observations that for which this uh independent variable has missing uh values if this is uh low like less than 10% and you have a large data set then uh you should uh be comfortable dropping those observations but if you have a small data set so you got only 100 observations and for them like 20% or 40% is missing then consider from imputation so try to find the values that can be um used in order to replace those missing values now uh once we have this information and we have identified the missing values the next thing is to uh clean the data so here what I'm doing is that I'm using the data that we got and I'm using the function drop na which means drop the um uh observations where the uh value is missing so I'm dropping all the observations for which the total underscore bedrooms has a null value so I'm getting rid of my missing observations so after doing that I'm checking whether I got rid of my missing observations and you can see here that when I'm printing data do is n do sum so I'm summing up the number of uh Missing observations no values per uh variable then uh now I no longer have any missing observations so I successfully deleted all the missing observations now the next state is to describe the data uh through some descriptive statistics and through data visualization so before moving on towards the caal analysis or predictive analysis in any sort of machine learning traditional machine learning approach try to First Look Into the data try to understand the data and see uh whether you are seeing some patterns uh what is the mean uh of different um numeric data fields uh do you have certain uh categorical values that cause an un unbalanced data those are things that you can discover uh early on uh before moving on to uh the model training and testing and blindly believing to the numbers so data visualization techniques and data exploration are great way to understand uh this uh data that you got before using that uh in order to train in t machine learning model so here I'm using the uh traditional describe function of pendas so data. describe parentheses and then this will give me the descriptive statistics of my data so here what we can see is that in total we got uh 20 , 640 observations uh and then uh we also have a mean of uh all the variables so you can see that per variable I have the same count which basically means that for all variables I have the same number of rows and then uh here I have the mean which means that um here we have the mean of the uh variables so per variable we have their mean and then we have their standard deviation so the square root of the variance we have the minimum we have the maximum but we also have the 25th percentile the 15 percentile and the 75th percentile so the uh percentile uh and quantiles those are uh statistical terms that we oftenly use and the 25th percentile is the first quantile the 15 percentile is the second quantile or the uh median and the 75th percentile is the third quantile so uh what this basically means is that uh this percentiles help us to understand what is this threshold when it comes to looking at the um observations uh that fall under the 25% uh and then above the 25% so when we look at this uh standard deviation standard deviation helps us to interpret the variation in the data at the unit so scale of that variable so in this case the variable is median house value and we have that the mean is equal to 206 ,000 approximately so more or less that uh range 206 K and then the standard deviation is 115k what this means is that uh in the data set we will find blocks that will have the median house value that will be uh 200 uh 6K 206k plus 115k which is around 321k so there will be blocks where the median house value is around 321k and there will also be blocks where the um median house value will be around uh 91k so 206,000 minus 115k so this the idea behind standard deviation this variation your data so next we can interpret the idea of this uh minimum and the maximum of your data in your data fields the minimum will help you to understand what is this minimum value that you have per data field numeric data field and what is the maximum value so what is the range of values that you are looking into in case of the median house value this means what um are the uh what is this minimum median house value per uh block and uh in case of Maximum what is this um highest value per block when it comes to Medan house value so this can uh help you to understand um when we look at this aggregated data so the median house value what are the blocks that have the uh cheapest uh houses when it comes to their valuation and what are the most expensive uh blocks of houses so we can see that uh the cheapest um block uh where in that block the median house value is uh 15K so 14,999 and the house block with the um highest valuation when it comes to the median house value so uh the median um valuation of the houses is equal to $500,000 And1 which means that when we look at our blocks of houses um that uh the median house value in this most expensive blocks will be a maximum 500k so uh next thing that I tend to do is to visualize the data I tend to start with the dependent variable so this is the variable of interest the target variable or the response variable which is in our case the median house value so this will serve us as our dependent variable and what I want you to do is to upload this histogram uh in order to understand what is the distribution of median house values so I want to see that when when looking at the data what are the um most frequently appearing median house values and uh what are this uh type of blocks that have um unique less frequently um appearing uh meded house values by plotting this type of plots you can see some outliers some um frequently appearing values but also some values that uh go uh and uh are lying outside of the range and this will help you to identify and learn more about your data and toid identify outliers in your data so in here I'm using the uh curn uh Library so given that earlier I already imported this libraries there is no need to import here what I'm doing is that I'm setting the the GD so which basically means that I'm saying the background should be white and I also want discrete so this means those discrete behind then I'm initializing the size of the figure so PLT this comes from met plotly P plot and then I'm setting the figure the figure size should be 10x 6 so um this is the 10 and this is the six then we have the main plot so I'm um using the uh his plot function from curn and then I'm taking from the uh clean data so from which we have removed the missing data I'm picking the uh variable of interest which is the median house value and then I'm saying upload this um histogram using the fors green color and then uh I'm saying uh the title of this figure is distribution of p and house values then um I'm also mentioning what is the X label which basically means what is the name of this variable that I'm putting on the xaxis which is a median house value and what is the Y label so what is the name of the variable that I need to put on the Y AIS and then I'm saying pl. show which means show me the figure so that's basically how in Python the visualization works we uh first need to write down the the actual uh figure size uh and then we need to uh Set uh the function uh in the right variable so provide data to the visualization then we need to put the title we need to put the X label y label and then we need to say show me the visualization and uh if you want to learn more about this visualization techniques uh make sure to check the python for data science course cuz that one will help you to understand slowly uh and in detail how you can uh visualize your data so in here what we are visualizing is the frequency of these median house values in the entire data set what this means is that we are looking at the um number of times each of those median house values appear in the data set so uh we want to understand are there uh certain uh median house values that appear very often and are there certain house values that do not appear that often so those can be may be considered outliers uh because we want in our data only to keep those uh most relevant and representative data points we want to derive conclusions that hold for the majority of our uh uh observations and not for outliers we will be then using that uh representative data in order to run our linear regression and then make conclusions when looking at this graph what we can see is that uh we have a certain cluster of um median house values that appear quite often and those are the cases when this frequency is high so you can see that uh we have for instance houses in here in all this block that appear um very often so for instance the median house value U of A about 160 170k this appears very frequently so you can see that the frequency is above 1,000 those are the most frequently appearing Medan house values and um there are cases when the um so you can see in here and you can see in here houses that uh whose median house value is not appearing very often so you can see that their frequency is low so um roughly speaking those houses they are unusual houses they can be considered as outliers and the same holds also for these houses because you can see that for those the frequency is very low which means that in our population of houses so California house prices you'll most likely see houses uh blocks of houses whose medium value is between let's say um 17K up to to uh let's say uh 300 or 350k but anything below and above this is considered as unusual so you don't often see a houses that are um so house blocks that have a median house value less than uh 70 or 60k and then uh also uh houses that are above um 370 or 400k so do consider that uh we are dealing with 1990 um a year data and not the current uh prices because nowadays uh Californian houses are much more expensive but this is the data coming from 1990 so uh do take that into account when interpreting this type of data visualizations so uh what we can then do is to use this idea of inter quantile range to remove this outl what this basically means is that we are looking at the lowest 25th uh% percentile so uh we are looking at this first quantile so 0.25 which is a 25th percentile and we are looking at this upper 25th um percent which means the third quantile or the 75th percentile and then we want to basically remove those uh by using this idea of 25th percentile and 75th percentile so the first quantile and the third quantile we can then identify what are the um uh observations so the blocks that have a median house value that is below the uh 25th per H and above the 75% he so basically we want to uh get the middle part of our data so we want to get this data for which the median house uh value is above the 25th percentile so U above all the uh median house values that is above the uh lowest 25% uh percent and then we also want to remove this very large median house values so we want to uh keep in our data the so-called normal uh and representative blocks blocks where the Medan house uh value is above the lowest 25% and smaller than the largest 25% what we are using is this statistical uh term called inter Quan range you don't need to know the name but I think it would be just work to understand it because this is a very popular way of uh making a datadriven uh removal of the outliers so I'm selecting the um 25th percentile by using the quantile function from pandas uh so I'm saying find for me the um value that divides my entire uh block of observations so block observations to observations for which the Medan house value is below the um the um 25th percentile and above the 25th percentile so what are the largest 75% and what are the smallest 25% when it comes to the median house value and we will then be removing this 25% so that I will do by using this q1 and then uh we will be using the uh Q3 in order to remove the very large median house Valu so the uh upper 25th percentile and then uh in order to um calculate the inter quanti range we need to uh pick the Q3 and subtract from it the q1 so just to understand this idea of q1 and Q3 so the Quantas better let's actually print this uh q1 and this uh Q3 so let's actually remove this part for now and they run it so as you can see here what we are finding is that the uh q1 so the 25th percentile or first quantile is equal to 19,500 so it basically is a number in here what it means is that um we have uh 25% um of the um observations the smallest observations have a median house value that is below the uh $119,500 and the remaining 75 uh% of our observations have a meeting house value that is above the $190,500 and then the uh Q3 which is the third quantile or the 75th percentile it describes this threshold the volume where we make a distinction between the um uh lowest median house values the first 75th uh% of the lowest uh median house values versus the uh most expensive so the highest median house values so what is this upper uh 25% uh when it comes to the median house value so we see that that distinction is 264,000 save $700 so it is somewhere in here which basically means that when it comes to this uh to this blocks of uh houses the most expensive ones with the highest valuation so the 25% top rated median house values they are above 264,000 that's something that we want to remove so we want to remove the observations that have a smallest median house value and the largest median house values and and usually it's a common practice when it comes to the inter quantile uh range approach to multiply the inter quantile range by 1.5 in order to um obtain the lower bound and the upper bound so to understand what are the um thresholds that we need to use in order to remove the uh blocks uh so observation from our data where the med house value is very small or very large so for that we will be multiply the IQR so inter quanti range by 1.5 and when we uh subtract this value from q1 then we will be getting our lower bound when uh we will be adding this value to Q3 then we will be using and getting this threshold when it comes to the uh upper bound and we will be seeing that um after we uh clean this uh outliers from our data we end up uh getting um smaller data so this means that uh previously we had uh 20K so 20,43 3 observations and now we have 9,369 observations so we have roughly removed um like about 1,000 or bit over 1,000 observations from our data so uh next let's look into some other variables for instance the median uh income and um one other technique that we can use in order to identify outliers in the data is by using the box plots so I wanted to showcase the different approaches that we can use in order to visualize the data and to identify outliers such that you will be familiar with uh different techniques so let's go ahead and plot the uh box plot and box plot is a statistical um way to represent your data uh the central boook uh represents the inter Quant range so um that is is the IQR uh and with the uh with the bottom and the top edges they indicate the 25th percentile so the first quantile and the 75% H so the third quantile respectively the length of this box that you see here uh this dark part is basically the 50% of your data for the median income and uh this uh median uh line inside this box um this is the uh the one with uh contrast in color that represents the median of the data set so the median is the middle value when data is sorted in an ascending order then we have this whiskers in our box Flo and this line of whiskers extends from the top and the bottom of the box and indicate this range for the rest of the data set excluding the outliers they are typically this 1.5 IQR above and 1.5 times um IQR uh below the q1 something that we also saw uh just previously when we were removing the outliers from the median house volum so in order to um identify the outliers you can quickly see that we have all these points that um lie above the 1.5 time IQR above the um third quantile so the 75% H and um that's something that you can also see here and this means that those are uh blocks of houses that have unusually high median income that's something that we want to remove from our data and therefore we can use the uh exactly the same approach that we used previously for the median house value so we will then identify the uh 25th percentile or the first quantile so q1 and then Q3 so the third quantile or the 75th percentile then we will compute the IQR um and then we will be obtaining the lower bound and the upper upper bound using this 1.5 um as a scale and then we will be using that this lower bound and upper bound to then um use this filters in order to remove from the data all the observations where the medium income is above the lower bound and all the observations that have a median income below the upper bound so we are using lower bound and upper bound to perform double filtering we are using two filters in the same row as you can see and we are using this parenthesis and this end functionality to tell to python well first look that this condition is satisfied so the observations have a median income that is above this lower bound and at the same time it should hold that the observation so the block should have a median income that is below the upper bound and if this uh block this observation in the data satisfies to two of this criteria then we are dealing with a good point a normal point and we can keep this and we are saying that this is our new data so let's actually go ahead and execute this code in this case we can see too high as all our out layers lie in this part of the box putot and then we will end up with the clean data I'm taking this clean data and then I'm putting it under data just for Simplicity and uh this data now uh is much more clean and uh it's better representation of the population something that ideally we want because we want to find out what are the features that uh describe and Define the house value not based on this unique and rare houses which are too expensive or which are in the blogs that have uh very high income uh people but rather we want to see the uh the uh true representation so the most frequently appearing data what are the features that Define the house value of the prices uh for common uh houses and for common areas for people with average or with normal income that's what we want to uh find so uh the next thing that I tend to do uh when it comes to especially regression nases and caal nases is to plot the correlation heat map so this means that uh we are getting the um uh correlation Matrix pairwise correlation score uh for each of this pair of variables in our data when it comes to the linear regression one of the uh assumptions of the linear regression that we learned during the theory part is that we should not have a perfect multicolinearity what this means is that there should not be a high correlation between pair of independent variables so knowing one should not help us to automatically Define the value of the other independent variable and if the correlation between the two independent variables is very high it means that we might potentially be dealing with multicolinearity that's something that we do want to avoid so hit map is a great way to identify whether we have this type of problematic independent variables and whether we need to drop any of them or maybe multiple of them to ensure that we are dealing with proper linear regression model and the assumptions lession model is satisfied now when we look at this correlation heat map um and uh here we use the curn in order to plot this as you can see here the colors can be from very light so white from till very dark green where uh the light means um there is a negative strong negative correlation and very dark uh green means that there is a very strong positive correlation so uh we know that correlation a value Pearson correlation can take values between minus one and 1 minus one means uh very strong negative correlation one means very strong positive correlation and um usually when uh we are dealing with correlation of the variable with itself so a correlation between longitude and longitude then uh this correlation is equal to one so as you can see on the diagonal we have there for all the ones because those are the pairwise correlation of the variables with themselves and then um in here uh all the values under the diagonal are actually equal to the uh mirror of them in the upper diagonal because the variable between so the correlation between uh the same two variables independent of how we put it so which one we put first and which one the second is going to be the same so basically correlation between longitude and ltitude and correlation latitude and longitude is the same so um now we have refreshed our memory on this let's now look into the actual number and this heat map so as we can see here we have this section where we um have uh variables independent variables um that have a low uh positive correlation with the uh remaining independent variables so you can see here that we have this light green uh values which indicate a low positive relationship between pair of variables one thing that is very interesting here is the middle part of this heat map where we have this dark numbers so the numbers uh below the diagonals are something we can interpret and remember that below diagonal and above diagonal is basically the mirror we here already see a problem because we are dealing with variables which are going to be independent variables in our model that have a high correlation now why is this a problem because one of the assumptions of linear regression like we saw during the theory section is that we should not have a multiple uh colinearity so multicolinearity problem when we have perfect multicolinearity it means that we are dealing with independent variables that have a high correlation knowing a value one variable will help us to know automatically what is the value of the other one and when we have a correlation of 0.93 which is very high or 0.98 this means that those two variables those two independent variables they have a super high positive relationship this is a problem because this might cause our model to result in uh very large standard errors and also not accurate and not generalizable model that's something we want to avoid and and uh we want to ensure that the assumptions of our model are satisfied now um we are dealing with independent variable which is total underscore bedrooms and households which means that number of total bedrooms uh pair block and the uh households is highly correlated positively correlated and this a problem so ideally what we want to do is to drop one of those two independent variables and and uh the reason why we can't do that is because uh those two variables given that they are highly correlated they already uh explain similar type of information so they contain similar type of variation which means that including the two just it doesn't make sense on one hand it's uh violating the moral assumptions potentially and on the other hand it's not even adding too much volum because the other one already shows similar variation so um the total underscore bedrooms basically contains similar type of information as the households so we can as well um so we can better just drop one of those uh two independent variables now uh the question is which one and that's something that we can uh Define by also looking at other correlations in here because we uh have a total bedrooms uh having a high correlation with households but we can also see that the total underscore rooms has a very high correlation with our households so this means that there is yet another independent variable that has a high correlation with our households variable and then this total underscore rooms has also High uh correlation with the total underscore bedroom so this means that um we can decide which one is has um more frequently uh High correlation with the rest of independent variables and in this case it seems like that the largest two numbers in here are the um this one and this one so we see that the total bedroom has a 0.93 as correlation with the total underscore rooms and uh at the same time we also see that hotel bedrooms has also um very high correlation with the household so 0.98 which means that total underscore bedrooms has the highest correlation with the remaining independent variables so we might as well drop this independent variable but before you do that I would suggest to do one more quick visual check and it is to look into the total uncore bedroom correlation with the dependent variable to understand how strong of a relationship does this have on the response variable that we are looking into so we see that the uh total underscore bedroom uh has this one 0.05 correlation with the response variable so the median house value when it comes to the total rooms that one has much higher so I'm already seeing from here that uh we can feel comfortable uh excluding and dropping the total underscore bedroom from our data in order to ensure that we are not dealing with perfect multicolinearity so this exactly what I'm doing here so I'm dropping the um total bedrooms so after doing that we no longer have this uh total bedrooms as the column so before moving on to the actual CA analysis there is one more step that I wanted to uh show you uh which is super important when it comes to the POS analysis and some uh introductory econometrical stuff so uh when you have a string categorical variable there are a few ways that you can deal with them one easy way that you will see um on the web is to perform one H encoding which basically means transforming all this uh string values so um we have a near Bay less than 1 hour ocean uh Inland near Ocean Iceland to transform all these values to some numbers such that we we have for the ocean proximity variable values such as 1 2 3 4 5 one way of doing that can be uh something like this but better way when it comes to using this type of variables in linear regression is to transform this a string uh category type of variable to what we're calling dami variables so dami variable means that this variable takes two possible values and usually uh it is a binary Boolean variable which means that it can take two possible values zero and one where one means that the condition is satisfied and zero means condition is not satisfied so let me give you an example in this specific case we have that the ocean proximity has five different values and ocean proximity is just a single variable then uh what we will do is we will use the uh get underscore D function in Python from pandas in order to uh go from this one variable to a five different variable per each of this category which means that now we will have new variables that uh will uh basically be uh whether uh it is uh nearby or not whether it's less than 1 hour uh uh from the ocean uh variable whether it's Inland whether it's near Ocean or whether is an island this will be a separate binary variable a dummy variable that will take value 0 and one which means that we are going from one string categorical variable to five different dami variables and in this case um each of those dami variables that you can see here we are creating five dami variables each of each for uh each of those five categories and then uh we are combining them and uh from the original data we will then be dropping the ocean prox IM data so on one hand we are getting rid of this string variable which is a problematic variable for linear regression when combined with the pyler library because cyler cannot handle this type of um data when it comes to linear regression and B we are making our job easier when it comes to interpreting the results so uh interpreting linear regression for CER nazes uh is much more easy when we have dami variables then when we have a one string categorical variable so just to give you an example if we are creating from this string variable uh five different dami variables and those are those five different dami variables that you can see in here so this means that if we are looking at this one category so let's say uh ocean _ proximity under Inland it means that for all the rows where we have the value equal to zero it means this criteria is not satisfied which means that uh ocean proximity uh underscore Inland is equal to zero which means that the house blob we are dealing with is not from Inland so that criteria is not satisfied and otherwise if this value is equal to one so for all these rows when the ocean proximity Inland is equal to one It means that the criteria is satisfied and we are dealing with house blocks that are indeed in the Inland one thing thing to keep in mind uh when it comes to uh transforming a string categorical variable to um set of DS is that you always need to drop at least one of the categories and the reason for this is because we learned during the theory that uh we should have no perfect multicolinearity this means that um we cannot have five different variables that are perfectly correlated and if we include all these values and this variables it means that um when uh we know that the uh uh block of houses is not near the bay is not less than 1 hour ocean is not Inland is not near the ocean automatically we know that it should be the remaining category which is Inland so we know that for all those blocks the um uh ocean proximity underscore uh uh irand uh Iceland will be equal to one and that's something that we want to avoid because because that is the definition of perfect multicolinearity So to avoid one of the oils assumptions to be violated we need to drop one of those categories so uh we can see in here uh that's exactly uh what I'm doing I'm saying so let's go ahead and actually drop one of those variables so let's see first what is the set of all variables we got so we got less than one hour uh ocean Inland Iceland new bay and then uh new ocean let's actually drop one of them so let's drop the Iceland and uh that we can do very simply by let me see I is not allowing me to add a code in here so we are doing data is equal to uh and then data do drop and then the name of the variable we within the uh quotation marks and then uh X is = to 1 so in this way I'm basically dropping one of the uh daming variables that uh I created in order to avoid the perfect multicolinearity assumption to be violated and once I go ahead and print the columns now we should see uh this uh column uh disappearing here we go so we successfully deleted that variable let's go ahead and actually get the head so now you can see that we no longer have a string in our data but instead we got four additional binary variable out of a string categorical variable with five categories all right now we are ready to do the actual work uh when it comes to the training a machine learning model uh or statistical model we learn during the uh theory that we always need to split that data into train uh and test set that is the minimum in some cases we also need to do train validation and test such that we can train the model on the training data and then optimize the model on validation data and find out what is the optimal set of hyperparameters and then uh use this information to uh apply this fitted and optimized model on an unseen test data we are going to skip the validation set for Simplicity especially given that we are dealing with a very simple machine learning model as linear regression and we're going to split our data into train and test and here uh what I'm going to do is first I'm creating this list of the name or variables that we are going to use in order to um train our machine learning bottle so uh we have a set of independent variables and a set of dependent variable so in our multiple linear regression here is the set of uh independent variables that we will have so we have long itude latitude housing median Edge total rooms population households median income median house value and the four different categorical dami uh four different uh dami variables that we built from the categorical variable then um I am specifying that the uh Target variable is so the target so the response variable or the dependent variable is the um median house value this is the value that we want to uh uh Target because we want to see what are the features and what are the independent variables out of the set of all features that have a statistically significant impact on the uh dependent variable which is the median house value because we want to find out what are these features um describing the houses in the block that cause a change cause a variation in the um t Target variable such as the Medan house value so here we have X is equal to and then uh from the data we are taking all the features that have the following names and then we have the uh Target which is a midin house uh house value and that's uh the column that we are going to subtract and select from the data so we are doing data filtering so here we are then selecting and what I'm using here is the train test complete function from the psych learn so you might recall that in the beginning we spoke and imported this uh model selection um library and from the cyler model selection we imported the train _ testore Spate function now this is a function that you are going to need quite a lot in machine learning because this a very easy way to uh split your data so um in here uh the arguments of the this function is first the uh Matrix or the data frame that contains the independent variables in our case X so here you fill in X and then the second uh argument is the dependent variable so uh the Y and then we have test size which means um what is the uh proportion of um observations that you want to put in the test and what is the proportion of observation that you um don't want to put basically in the training if you are putting 0.2 it means that you want your test size to be uh 20% of your entire 100% of data and the remaining 80% will be your training data so if you provide your point two to this argument then the function automatically understands that you want this 80 20 division so 80% training and then 20% test size and then finally you can also uh add the random State because the split is going to be random so the data is going to be randomly selected from the entire data and to ensure that your results are reproducible and uh the next time you are running this um notebook you will get the same results and also to ensure that me and you get the same results we will be using a random State and a random state of 111 is just um random number there I liked and decided to use here so uh when we go in um use this and run this command you can see that we have a training set size 15K and then test size uh 38k so when you look at these numbers you will then get a verification that you are dealing with 20% versus 80% thresholds so then we go and we do the training one thing to keep in mind is that here we are using the SM Library uh NSM function that we imported from the uh stats model. API so this is one one uh function that we can use in order to conduct our uh Cal analysis and to train Le regression model so uh for that what we need to do so uh when we are using this Library uh this Library doesn't automatically add the uh first uh column of ones uh in your uh set of independent variables which means that it only goes and looks at what are the features that you have provided and those are all the independent variables but we learned from the theory that uh when it comes to linear regression we always are adding this intercept so the beta0 if you go back to the theory lectures you can see this beta0 to be added to both to the simple linear regression and to the multipar regression this ensures that we look at this intercept and we see what is this average uh in this case median house value if all the other features are um equal to zero so um therefore given that the this specific stats models. API is not adding this uh constant um column to the beginning for intercept it means that we need to add this manually therefore we are saying sm. addore constant to the exrain which means that U now our uh x uh table or X data frame uh add a column of ones uh to the features so let me actually show you uh before doing the uh training because I think this also something that you should be aware of so if we do here a pause so I'm going to do xcore train underscore uh constant and then I'm also going to print um the same um feature data frame before adding this constant such that you see what I mean so as you can see here this is just the same set of all columns that form the independent variables the features so then when we add the constant now after doing that you can see that now we have this initial column of ones this is th such that we can have uh uh beta Z at the end which is the intercept and we can then perform a valid multiple linear regression otherwise you don't have an intercept and this is just not what you're looking for now the psychic learn Library does this automatically therefore when you are using uh this tou models. API you should add this constant and then I use the pyit learn without adding the constant and if you're wondering why to use this specific model as uh we already discussed about this just to refresh your memory we are using the T models. API because this one has this nice property of visualizing the summary of your result results so your P values your test your standard errors something that you definitely are looking for when you are performing a proper causal analysis and you want to identify the features that have a statistically significant impact on your dependent variable if you are using a machine learning model including linear regression only for Predictive Analytics so in that case you can use the psychic learn without worrying about using STS models. API so this is about adding constant uh now we are ready to actually uh fit our model or train our model therefore what we need to do is to use sm. OLS so OS is the ordinar squares estimation technique that we also discussed as part of the theory and we need to provide first the dependent variable so Yore train and then the um feature set which is xcore train uncore constant so then what we need to do is to do that feed par paresis which means that take the OS model and use the Yore train as my dependent variable and xcore Trainor constant as my independent variable set and then fit the OLS algorithm and linear regression on this specific data if you're wondering why y train or X train and what is the differ between train and test and sure to go and revisit the training um Theory lectures because there I go in detail into this concept of training and testing and how we can divide the data into train and test and uh this Y and X as we have already discussed during this tutorial is simply this distinction between independent variables defined by X and the dependent variable defined by y so y train y test is the dependent variable data for the training data and test data and then EXT train ex uh test is simply the training data features so ex train and then test data features X test we need to use x train and Y train to fit our data to learn from the data and then once it comes down to evaluating the model we need to uh use the fitted model from which we have learned using both the dependent variable and the independent variable set so y train X train and then uh once we have this model uh that is fitted we can apply this to unseen data exore test we have can obtain the predictions and we can compare this to the true y so Yore test and to see how different the Y uh underscore test is from the Y predictions for this unseen data and to evaluate how moral uh is performing this prediction so how moral is uh managing to identify the median uh house values and predict median house uh values based on the uh um fitted model and on an unseen data so exore test so this is just a background info and some refreshment and now um in this case we are just uh fitting the data on the training uh dependent variable and then training uh independent variable edit a constant and then we are ready to print the summary now let's now interpret those results first thing that we can see is that uh all the coefficients and all the independent variables are statistically significant and how can I say this well um if we look in here we can see the column of P values this is the first thing that you need to look at when you are getting this results of a caal analysis in linear oppression so here we are seeing that the P value is very small and just to refresh our memory P value says what is this probability that you have obtained too high of a test statistics uh given that this is just by a random chance so you are seeing statistically significant results which is just by random chance and not because your uh n hypothesis is false and you need to reject it so that's one thing in here you can see you can see that we are getting much more so first thing that you can do is to verify that you have used the correct dependent variable so you can see here that the dependent variable is a median house value the model that is used to estimate those coefficients in your model is the OS the method is the Le squares so Le squares is simply uh the uh technique that is the underlying approach of minimizing the sum of uh uh squared residuals so the least squares the date that we are running this analysis is the 26th of January of 2024 uh so we have the number of observations which is the number of training observations so the 80% of our original data we have R squ which is the um Matrix that showcases what is the um goodness of fat of your model so r s is a matrix that is commonly used in linear regression specifically to identify how good your model is able to fit your data with this linear regression line and the r squ uh the maximum of R squ is one and the minimum is zero 0.58 uh in this case approximately 59 it means that uh all your data that you got and all your independent variables so those are all the independent variables that you have included they are able to explain 59% so 0.59 out of the entire set of variation so 59% of variation in your response variable which is the median house value you are able to explain with a set of independent variables that you have provided to the model now what does this mean on one hand it means that you have a reasonable enough information so anything above 0.5 is quite good which means that more than half of the uh entire variation in your median house value you are able to explain but on the other hand it means also that there is approximately 40% of variation so information about your house values that you don't have in your data this means that you might consider going and looking for extra additional information so additional independent variables to add on the top of the existing independent variables in order to increase this amount and to increase the amount of information and variation that you are able to explain with your model so the r squ this is like the best way to uh explain what is the quality of your regression model another thing that we have is the adjusted R squ adjusted R squ and R squ in this specific case as you can see they are the same so 0 um 59 this usually means that uh you're fine when it comes amount of features that you are using once you overwhelm your model with too many features you will notice that the adjusted R squ will be different than your R squ so adjusted R squ helps you to understand whether your Motel is performing well only because you are adding so many of you of those variables or because really they contain some useful information CU sometimes the r squ it will automatically increase just because you are adding too many independent variables but in some cases those independ variables they are not useful so they are just adding to the complexity of the model and possibly overfitting your model but not providing any edit information then we have the F statistics here which corresponds to the F test and uh F test um it comes from statistics uh you don't need to know it but I would say uh check out the fundamentals to statistics course if you do want to know it because it means that uh you are testing whether all these independent variables Al together whether they are helping to explain your uh dependent variable so the median house value and uh if the F statistics is very large or the P value of your F statistics is very small so 0.0 it means that all your independent variables jointly are statistically significant which means that all of them together helped you explain your uh uh median house value and have a statistically significant impact or your median house value which means that you have a good set of independent variables so then we have the log likelihood not super relevant in this case you have the AIC Bic which stand for AAS information criteria and bation information criteria those are also not necessary to know for now but once you advance in your career in machine learning it might be useful to know at higher level for now think of it like um value that helps to understand this uh information that you gain when you are adding this set of independent variables to your model but this is just optional ignore it if you don't know it for now okay let's now go into the fun part so in this Mata uh part of the summary uh table we got first the set of uh independent variables so we have our constant which is The Intercept we have the longitude latitude housing median age total roles population households median income and the four dami variables that we have created then we have the coefficients corresponding to those independent variables those are basically the beta0 beta 1 head beta 2 head Etc which are the um parameters of the linear regression model that our oils method has estimated based on the data that we have provided now before interpreting this independent variables the first thing you need to do as I mentioned in the beginning is to look at this P value column this showcases the set of all independent variables that are statistically significant and usually this table that you will get from a Sato API is at 5% significance level so the alpha the threshold of statistical significance is equal to 5% and any P value that is smaller than 0.05 it means you are dealing with a statistically significant independent variable now the next thing that you can see here in the left is the T statistics this P value is based on a t test so this T Test is simply stating as we have learned during the theory and you can also check the fundamental to statistics course from lunar tech for more detailed understanding of this test but for now this T Test um States a hypothesis whether um each of these independent variables individually has a statistically significant impact on the dependent variable and whenever this uh T Test has a p value that is smaller than the 0.05 it means you are dealing with statistically significant uh independent variable in this case we are super lucky all our independent variables are statistically significant then the question is whether we have a positive statistical significant or negative that's something that you can see by the signs of these numbers so you can see that longitude has a negative coefficient latitude negative coefficient housing median age positive coefficient Etc negative coefficient means that this independent variable causes a negative change in the dependent variable so more specifically when we look for instance the um let's say which one should we look uh let's say the uh total uh underscore rooms when we look at the total underscore rooms and it's minus 2.67 it means that when we look at this total number of rooms and we increase the number of rooms uh by uh one additional unit so one more room added to the total underscore rooms then the uh house value uh decreases by minus 2.67 now you might be wondering but how is this possible well first of all the value the coefficient is quite small so in one hand it's it's not super relevant as we can see the uh relationship between them is not super strong because the U margin of this um coefficient is quite small but on the other hand you can explain that at some point when you are adding more rooms it just doesn't add any value and in fact in some cases just decreases the value of the house this might be the case at least this is the case based on this data we can see that if there is a negative coefficient then one unit increase in that specific independent variables all else constant will result in um uh in this case for instance in case of the total rooms uh 2.67 decrease in the median house value everything else constant we are also referring to this ass set that is parus in econometric which means that everything else constant so one more time let's refresh our memory on this so ensure that we are clear on this if we add one more room to the total number of rooms then the median house value will decrease by $267 and this when the longitude latitude house median age population households median income and all the other criterias are the same so if we have uh for instance this negative value this means that we are getting a decrease in the median house value if we have an increase by one unit in our uh total number of roles now let's look at the op opposite uh case when the coefficient is actually positive and large which is the hous in median age this means is if we have two houses they have uh exactly the same characteristics so they have the same longitude latitude they have the same total number of rooms population housing households median income they are uh the same in terms of the distance from the ocean then um if one of these houses has one more additional year added on the uh median age so housing median age so it's one year older then the house value of this specific house is higher by $846 so this house which has one more additional median age has $ 846 higher median house value compared to the one that has all these characteris ICS except it has just the um uh house median age that uh is one year less so one more additional uh year in the median age will result in 846 uh increase in the mediate house value everything else constant so this is regarding this idea of negative and imp positive and then the margin of coefficient now let's look at one dami variable and um explain the idea behind it and how we can interpret it and uh it's it's a good way to understand how the dond variables can be interpreted in the context of linear regression so one of the independent variables is the ocean proximity Inland and the coefficient is equal to- 2108 e plus 0.5 this simply means - 210 K uh approximately and um what this means is that if we have two houses they have exactly the same characteristics so their longitude latitude is the same house median age is the same they have the same total number of rooms population households median income all these characteristics for this two blocks of houses is the same with a single difference that one block is located in the um Inland when it comes to Ocean proximity and the other block of houses is not located in the Inland so in this case the reference so the um category that we have removed from here was the Iceland you might recall uh so if the block of houses is in the Inland that their value is on average uh smaller and less by 210k when it comes to the median house value compared to the block of houses that has exactly the same characteristics but it's not in the Inland so for instance is in the uh Iceland so uh when it comes to this dumi variables where there is also an underlying reference variable which you have deleted as part of your string categorical variable then you need to reference your dami variable to that specific category this might sound complex it is actually not I would say uh it's just a matter of practicing and trying to understand what is this approach of D variable it means that you either have that criteria or not in this specific case it means that if you have two blocks of houses with exactly the same characteristics and one block of houses is in the Inland and the other one is not in the Inland for instance is in the Iceland then the block of houses in the Inland will have on average 210,000 less uh median house value compared to the block of houses that is the IND for instance in the Iceland uh when it comes to the ocean proximity which kind of uh makes sense because in California people might prefer living uh in the isoland location in the houses might have more demand when it comes to the Iceland location compared to the um Inland locations so the longitude uh has a statistically significant impact on the uh median house value latitude house median age has an impact and causes a a statistically significant difference in the Medan house value if there is a change in median age the total number of rooms have an impact on the median house volume and the population has an impact households median income as well as the uh proximity from the ocean and this is because all their P values is uh zero which means that they are smaller than 0.05 and this means that they all have a statistically significant impact on the median house value in the Californian house in market now when it comes to the uh interpretation of all of them uh we have interpreted just few uh for the sake of Simplicity and ensuring that this uh this entire case study doesn't take too long but what I would suggest you to do is to uh interpret all of the uh coefficients here because we have interpreted just the housing median age and the um the total number of rooms but you can also interpret the population uh as well as the median income and uh we have also interpreted one of those Dy variables but feel free also to interpret all the other ones so by doing this you can also uh Even build an entire case study paper in which you can explain in one or two pages the results that you have obtained and this will showcase that you have an understanding of how you can interpret the linear gressional results another thing that I would suggest you to do is to uh add a comment on the standard error so let's now look into the standard errors we can see a huge standard error that we are um making and this is the direct result of the fourth assumption that was violated now this case study is super important and useful in a way that it showcases what happens if some of your um assumptions are satisfied and if some of those assumptions are violated so in this specific specific case the Assumption related to the uh uh the errors having a constant variance is violated so we have a heos SK assist the issue and that's something that we are seeing back in our results and this is a very good example of the case that even without checking the assumptions you can already see that the standard error is very large and uh you can see here that given that the standard ER is large this already gives a hint that most most likely our heteroscedasticity uh is present and our homoscedasticity assumption is violated you keep in mind this um idea of um large standard errors that we just saw because we are going to see that this becomes a problem also for the um performance of the model and we will see that we are obtaining a large error due to this and uh one more comment when it comes to the total rooms and the housing median age in some cases the linear regression results might not seem logical but sometimes they actually is an underlying explanation that can be provided or maybe your model is just overfitting or biased that's also possible and uh that's something that uh you can do by checking your ois assumptions and uh before uh going to that stage I wanted to briefly showcase to you this um idea of predictions so we have now fitted our model on the uh uh training data and we are ready to perform the predictions so we can then use our fitted model and we can then uh use the test data so ex test in order to perform the predictions so to uh use a data to get new house mediate house values for the um blocks of houses for which we are not providing the uh corresponding Medan house price so on aning data we are uh re um applying our model that we have already fitted and we want to see what are these predicted median house values and then we can compare these predictions to the true median house values that we have but we are not yet exposing them and we want to see how good our model is doing a job of estimating and finding these unknown median house values for the test data so for all the blocks of houses for which we provided the characteristics in the X test but we are not providing the Y test so uh as usual like in case of training we are adding a constant with this library and then we are saying model. fitted model uncore fitted so the fitted model and then that predict and providing the test data and those are the test predictions now uh once we do this we can then get the test predictions and uh if we print those you can see that we are getting a least of house values those are the house values for the um um blocks of houses which were included as part of the testing data so the 20% of our entire data set uh like I mentioned just before in order to ensure that your model is performing well you need to check the OS assumptions so uh during the um Theory section we learned that there are a couple of assumptions that your model should satisfy and your data should satisfy for OLS to provide uh B unbiased and um efficient uh estimates which means that they are accurate their standard error is low something that um we are also seeing as part of the summary results and uh your estimates are accurate so the standard error is a measure that showcases how efficient your estimat are which means um do you have a high variation uh can the coefficients that you are showing in this table very a lot which means that you don't have accurate um coefficient and your coefficient can be all the way from one place to the other so the range is very L large which means that your standard error will be very large and this is a bad sign or you are dealing with an accurate estimation and uh it's more precise estimation and in that case the standard there will be low uh and unbias estimate means that your estimates are are a true representation of the pattern between each pair of independent variable and the response variable if you want to learn more about this IDE of bias unbias and then efficiency and sure to check the U fundamental statistics course at lunar Tech because it explains very clearly this Concepts in detail so here I'm assuming that you know or maybe you don't even need it but I would suggest you to know at higher level at least then uh let's quickly do the checking of oiless assumption so the first assumption is the linearity Assumption which means that your model is linear in parameters one way of checking that is by using your already fitted model and your uh predicted model so the Y uh uh test which are your true house median house values for your test data and then test predictions which are your uh predicted median house values for nonen data so you are using the uh True Values and the predicted values in order to um plot them and then to also plot the best fitted line in an ideal situation when you would make no error and your model would give you the exact True Values um and then see how well your um uh how linear is this relationship do we actually have a linear relationship now if the observed versus predicted values where the observed means the uh real uh test test wise and the predicted means the test predictions if this pattern is kind of linear and matching this perfect linear line then you have um assumption one that is satisfied your linearity assumption is satisfied and you can say that your uh data uh and your model is indeed linear in parameters then uh we have the second assumption which states that your uh sample should be random and this basically translates that the uh expectation of your error terms should be equal to zero and uh one way of checking this is by simply taking the residuales from your fitted model so model on score fitted and then that's residual so you take the residuales you obtain the average which is a good estimate of your expectation of errors and then this is the mean of residuales so the average uh residuales where the residual is the estimate of your true error terms and then uh here what I do is just I just round up uh to the two decimals behind uh the uh the point this means that uh we are getting uh this average amount of uh errors or the estimate of the errors which we are referring as residuales and if this number is equal to zero which is the case so the mean of the residuales in our model is zero it means that indeed the um uh expectation of the uh error terms at least the estimate of it expectation of the residuales is inde equal to zero another way of checking the um uh second assumption which is that the um moral uh has a is based on the random sample and the sample we are using is random which means that the expectation of the error terms is equal to zero is by plotting the residuales versus fitted values so uh we are taking the resid from the fitted model and we are comparing to the fitted values that comes from the model uh and we are looking at this um graph this scatter plot which you can see in here and we're looking where this um pattern is uh symmetric uh around the uh threshold of zero so you can see this line kind of comes right in the middle of this pattern which means that on average we have residuales that are across zero so the mean of the residuales is equal to zero and that's exactly what we were calculating also here therefore we can say that we are indeed dealing with a random sample this FL is also super useful when it comes to the fourth assumption that we will come a bit later so for now let's check the third assumption which is the Assumption of exogeneity so exogeneity means that uh each of our independent variables should be uncorrelated from the error terms so there is no omitted variable bias there is no um reverse causality which means that the uh independent variable has an impact on the dependent variable but not the other way around so dependent variable should not have an impact and should not cause the independent variable so for that there are few ways that we can deal with uh with this uh one way is just straightforward to compute the uh correlation coefficient between between each of these independent variables and the residuales that you have obtained from your fitted model the just simple uh technique that you can use in a very uh quick way to understand what is this uh correlation between each pair of independent variable and the residuals which are the best estimates of your error terms and in this way you can understand that there is a correlation between your independent variables and your error terms another way you can do that and this is more advanced and bit more um towards the econometrical side is by using this test which is called the Durban uh view housan test so this uh Durban view housan test is um a more professional more advanced way of uh using an econometrical test to find out whether you have um exogene so exogeneity sup is satisfied or you have endogenity which means that one or multiple of your your independent variables is potentially correlated with your error terms uh I won't go into detail of this test uh I'll put some explanation here and also feel free to uh check any uh introductory to econometrics course to understand more on this Duran Vu housan test for exogeneity assumption the fourth assumption that we will talk about is the homos skasis homosa assumption states that the error terms should have a variance that is constant which means that when we are looking at this variation that uh the model is making uh across uh different observations that uh when we look at them the variation is kind of constant so uh we have all these uh cases when the uh in observations for which the residuals are bit small in some cases bit large we have this miror when it comes to this figure with what we are calling heteros skos which means means that homos assumption is violated our error terms do not have a variation that is constant across all the observations and we have a high variation and different variations for different observations so we have the heteros issue we should consider a bit more um flexible approaches like uh GLS fgs GMM all bit more advanced econometrical algorithms so uh the final part of this case study will be to show you how you can do uh this all but for machine learning traditional machine learning site by using the psychic learn so uh in here um I'm using the um standard scaler function in order to uh scale my data because we saw uh in the summary of the table um that we got from the stats uh mos. API that our data is at a very high scale because the uh median house values are those large numbers the uh age uh the median age of the house is in this very large numbers that's something that you want to avoid when you are using the linear regression as a Predictive Analytics model when you are using it for interpreting purposes then you should keep the skilles because it's easier to interpret those values and to understand uh what is the difference in the median price uh of the house when you compare different characteristics of the box of houses but when it comes to using it for Predictive Analytics purposes which means that you really care about the accuracy of your predictions then you need to uh scale your data and ensure that your data is standardized one way of doing that is by using the standard scaler function uh in the pyit learn. preprocessing uh and uh the way I do it is that I initialize the scaler by using the standard scaler and then parenthesis which are just import from this psychic learn library and then uh I am uh taking this scaler I'm doing that fitore transform exrain which basically means take the independent variables and ensure that we scale and standardize the data and standardization simply means that uh we are standardizing the data that we have to ensure that um some large values do not wrongly influence the predictive power of the model so the the model is not confused by the large numbers and finds a wrong variation but instead it focuses on the a true variation in the data based on how much the change in one independent variable causes a change in the dependent variable here given that we are dealing with the supervised learning algorithm uh the exrain uh scaled will be then containing our standardized uh features so independent variables and then each test SC will contain our standardized test features so the Unseen data that the model will not see during the training but only during prediction and then what we will be doing is that we will also use the um y train and Y train uh is the dependent variable in now supervised model and why train corresponds to the training data so we will then first initialize the linear regression here so linear regression model from pyit learn and then uh we will initialize the model this is just the empty linear regression model and then we will take this initialized uh model and then we will fit them on the uh training data so exore trained uncore scale so this is the trained features and then the um uh dependent variable from training data so why train uh do you knowe that I'm not scaling the dependent variable this is a common practice cuz you don't want to uh standardize your dependent variable rather than you want to ensure that your features are standardized because what you care is about the variation in your features and to ensure that the model doesn't mess up when it's learning from those features less when it comes to looking into the impact of those features on your dependent variable so then uh I am fitting the uh model on this training data so uh features and independent variable and then I'm using this fitted uh model the LR which already has learned from this features and dependent variable during supervised training and then I'm using the X test scale so the test standardized uh data in order to uh perform the prediction so to predict the immediate house values for the test data unseen data and you can notice that here in no places I'm using white test white test I'm keeping to myself which is the dependent variable True Values such that I can then compare to this predicted values and see how well my motor was able to actually get the predictions now uh let's actually also do one more step I'm importing from the psyit learn the Matrix such as mean squared error uh and I'm using the mean squared error to find out how well my motel was able to predict those house prices so this means that uh we have on average we are making an error of 59,000 of dollars when it comes to the median house prices which uh dependent on what we consider as large or small this is something that we can look into so um like I mentioned in the beginning the uh idea behind linear regression using IND specific uh course is not to uh use it in terms of pure traditional machine learning but rather than to perform um causal analysis and to see how we can interpret it when it comes to the quality of the predictive power of the model then uh if you want to improve this model this can be considered as the next step you can understand whether your model is overfitting and then the next step could be to apply for instance the um lasso regularization so lasso regression which addresses the overfitting you can also consider going back and removing more outliers from the data Maybe the outliers that we have removed was not enough so you can also apply that factor then another thing that you can do is to consider bit more advanced machine learning algorithms because it can be that um although the um regression assumption is satisfied but still um using bit more flexible Motors like random Forest decision trees or boosting techniques will be bit more more appropriate and this will give you higher predictive power consider also uh uh working more with this uh scaled uh version or normalization of your data as the next step in your machine Learning Journey you can consider learning bit more advanced machine learning models so now when you know in detail what is linear regression and how you can use it how you can train and test a machine learning Model A simple one yet very popular one and you also know what is logistic progression and all these Basics you're ready to go on to the next step which is learning all the other popular traditional machine learning models think about learning decision trees for modeling nonlinear relationships think about learning bagging boosting random forest and different sours of optimization algorithms like gradi and descent HGD HGD with momentum Adam Adam V RMS prop and what is the difference between them and how you can Implement them and also consider a learning clustering approaches like K means uh DB skin hierarchial clust string doing this will help you uh to get more hands on and go to this next step when it comes to the machine learning once you have covered all these fundamentals you are ready to go one step further which is getting into deep Le thank you for watching this video If you like this content make sure to check all the other videos available on this channel and don't forget to subscribe like and comment to help the algorithm to make this content more accessible to everyone across the world and if you want to get free resources make sure to check the free resources section at lunch. and if you want to become a job ready data scientist and you are looking for this accessible boot camp that will help you to make your job ready data scientist consider enrolling to the data science boot camp the ultimate data science boot camp at l. you will learn all the theory the fundamentals to become a jbre data scientist you will also implement the learn theory into real world multiple data science projects beside this after learning the theory and practicing it with a real world case studies you will also prepare for your data science interviews and if you want to stay up to date with do recent developments in Tech what are the headlines that you have missed in the last week what are the open positions currently in the market across the globe and what are the tech startups that are making waves in the tech and sure to subscribe to the data science Nai newsletter from [Music] lunarch