[Music] thank you let's kick it off with the introduction to machine learning on the next few slides we're going to have a look at how machine learning differs from traditional rule-based programming so what we have here on the top of the slide is a traditional rule-based programming approach where on the input side we have data and a set of hard-coded rules and those rules would generally be formulated by domain experts and then we can take the rules and apply them to existing data set to generate some answers or a summary of what's in the data and this is not new in fact people were doing this for many many years and what we have here is a quote from Arthur Samuel who was an IBM researcher and he was living in this world where already traditional route-based programming existed and he found it quite cumbersome to come up with all these hard-coded olds so he was envisioning a world where computers would learn from experience rather than having to be explicitly programmed to perform a certain task which in this case was applying the rules so how does this domain shift then lead to the emergence of machine learning or we have data on one side and now instead of having the domain expert create the rules we have a machine learning algorithm a mathematical construct that helps us find these rules and when we apply the algorithm to the data what comes out is a learned model and in this model applies the rules to a new and unseen set of data so it actually generalizes beyond the data that it's seen as input and with that we can now generate answers or we call to call them predictions another way to think about machine learning would be the quote that we have here from Tom Mitchell where he's saying a computer program is said to learn from experience e with respect to some class of tasks T and performance measure P if its performance at tasks in t as measured through the performance P improves with experience e so this very much captures the same nature of things where we have some sort of performance evaluation e we have a certain task that we want the machine to learn that could be classifying things or translating text and by providing experience by providing sample data points eventually we would hope that the computer would learn from those sample data points and come up with a good model so the important ingredients really to machine learning are therefore data and then on the other side the algorithm or the tasks that we want to perform so we're going to have a look at both of these things and we're going to start with data also called features first so features they're usually denoted as lowercase x if we're referring to a single feature or uppercase X if we're talking about a collection of features and very simply put you can think about features as a table of content where every column refers to a particular attribute what we have here is an example tabular data set where the First Column refers to experience in years second column is each bracket and then the third column is gender in this example so as we can see every row here would describe one individual person and the columns are the various attributes and in the case of a tablet data set we can see that the raw values inside the table are either strings or numbers so both are possible and this would be a typical tabular data set oh usually if we want to differentiate between the individual features we would put a little subscript with the number of the column that we want to refer to so now we have X subscript one for the First Column and X subscript 2 for the second column and as you can imagine there could be arbitrary many attributes so we can just keep counting as we go obviously tabular data is not the only data format we could also be working with image data in the case of image data the raw values are actually the light intensities in the individual pixels so when working with images we actually need to break it down into the pixel values and restructure that image to actually end up working with it but it's certainly possible to do nowadays we have the computational power we have enough storage and machines that are powerful enough to actually work with image data as well so what you're seeing here is just a screenshot of an example image data set where you usually have a folder with all the images that you want to learn something from what would be image tasks that we could learn well we could classify what's on the image or we could detect different scenes or objects on the image those are just examples obviously there are many more different computer vision or image related tasks as well so if we can work with images you might be wondering can we also work with language and the answer is yes we certainly can't so now we have a example language data set here this is an example of a translation task where you would have a source language and then a target language that you want to translate into and here you can see we're trying to learn how to translate from English to German and there could be additional annotations or tags Associated depending on what kind of information you want to translate or whether you want to pay particular attention to how the translation should work so in the case of language data then the raw values are now going to be strings text Snippets words paragraphs or whole documents could be considered one source that we want to translate and if we have all of these different data types of the question and usually becomes what can we combine them and yes certainly that's also possible that would be called multimodal data and that's just a combination of everything that we've seen so far a special type of feature is the so-called label or the label column and labels are usually denoted as lowercase Y and the labels are usually the answer that we want to generate using a trained model so the whole point of providing data to an algorithm and learning the model is usually to predict a certain Target or certain outcome something that we're interested in and we need historical labels or information of what the connection between the features and the labels is to ultimately learn those rules there is a bit of a caveat where labels are not always provided so one of two ways this could go either we go out and label the data that we already have or we perform a completely different type of machine learning that actually works without labels which is something that we'll look at a little bit later in this lecture as well so in the case where we do have labels available we need to distinguish further between numerical and categorical labels numerical labels will be labels what we're trying to predict a continuous numeric value such as insurance price prediction in the case of categorical as the name suggests we are trying to predict different categories those can either be binary yes or no or it could be multiple classes such as would be the case for disease type prediction and now we're going to move on to a machine learning algorithm example so again to remind you the two key ingredients to machine learning model are on one side the data and then on the other side the algorithm and the algorithm example that we have here is we want to generate a prediction that is a weighted combination of the features that we have and to make this a bit more concrete let's say our goal is to predict a healthcare score that could then be used to set the insurance price of Any Given individual foreign and we're going to introduce a little bit of mathematical notation now where you would see on the left hand side now we have a y hat so this indicates now this is going to be a prediction that we're going to make and Y hat is a function of X of the features that we have in our data set and we said we're going to use a weighted combination of our features so we have the addition of all the individual features multiplied each feature with individual rate components so the features here in this equation are going to be our measurable pieces of information and those generally need to be in numerical representation so that means even if we have text values or maybe categorical entries in our table we first need to convert it into a numerical format and we're going to look at that when we get to the data prep and data processing before the model train stage so this is just to let you know that if you do have a feature that is maybe categorical or text value we do need to convert that into a numerical format first and we'll see an example of that in just a moment so let's just say feature X1 corresponds to the number of Hospital admissions for any given individual so we would have a historical record of different individuals that would have potentially different number of Hospital admissions and now we're going to look at one particular example where X1 is equal to five so if we wanted to make a prediction this is now going to be an unseen and new patient and we're going to predict what is the health score for that particular individual and we asked them how many times have you been admitted to the hospital in the last X years well then if they give us the answer five we can use that and plug it into our equation that we have at the top so obviously there's still quite a few missing pieces first of all we need to figure out what these W values are the X we can just go and ask the patient that's coming in the W's well those are actually being learned during what is called the model training stage and that uses the historical records where we have our table of X our features and the corresponding outcomes from other patients where we already know what the healthcare score is and we can use those to learn the relationship between X and Y and that gives us ultimately the weights which determines how much influence a given feature has on the output or the prediction these weights are also called Model parameters in a parametric model and as I just mentioned they're being learned during the models training stage so we don't have those double use up front we actually need to learn them by looking at many data examples and this is what the learning in machine learning is all about where we look at many examples we look at the features we look at the labels if we have access to them and then we find out what those W's are and they're going to be different techniques on how to find those W's we're going to look at one of them but just to let you know that there are many different methods on how you can actually find those W's so let's say we perform the training we looked at all the historical records and we found that W1 the weight for the number of Hospital admissions is indeed a very high positive number W1 was found to be 100. we ask our patient the individual again what is your lifestyle look like do you have a healthy lifestyle yes or no and now you're going to see one example of this categorical answer yes or no that's not intrinsically a numerical value but we can easily represent it as ones and zeros and they exist methods to do that if you have many more categories as well for now we just have a yes no answer so we can easily convert that into one and zero so this particular person is indicating that yes they do have a healthy lifestyle and again we already have learned in this example what the W values are and it so happens that the value for W4 wait for healthy lifestyle is uh medium large number but keep in mind here this is now going to have a negative sign so these weights these W's they can be positive or negative they can be very large or very small so what this would mean then is for our prediction the healthcare score would actually be brought down if somebody has a healthy lifestyle so that is something good so if we think about insurance Price may be going upwards depending on how high the score is if somebody indicates that they do have a healthy lifestyle well that would bring to score or the price down and for now we're going to stop here with the example obviously there would be many more features about the individual that we could ask them like their age or maybe other history that they have and as I said we would have learned the W's from looking at historical examples for now we're just going to leave it as is so you can see what the predicted score would be so we have a base value 200 that is the offset we'll see that again in just a moment and then a hundred that was the wait for our first feature times five because that is how many Hospital admissions the individual indicator that they had in the recent years and then minus 25 that was the X4 weight so W4 was -25 and they did have a healthy lifestyle yes so we multiply by one and what we get is 675 plus minus an error term because obviously we didn't consider all the features now and in any case we do need to keep in mind that these predictions they always come with a certain error Associated to them as well and this equation that we looked at in this example is actually a very famous one it's the so-called linear regression and it's the building block and very essential to all of machine learning so even when you start talking about more advanced models like neural networks they actually are composed of many linear regressions put together in very Advanced ways to make and build a full neural network so let's come back to the different types of machine learning and we already actually introduced the key factor that distinguishes between the different families of machine learning or the different types of machine learning because it all comes down ultimately to the labels so if the data is indeed coming with labels either numerical or categorical then we say it's supervised machine learning and the model learns by looking at the examples the features in combination with the labels so this is supervised machine learning on the other hand if we only have X collection of feature available and there are no labels well then this constitutes unsupervised machine learning and here the goal or the idea is to find patterns or similarities in the data that we do have access to so there's not going to be an external label it's just about finding patterns commonalities and similar characteristics in the data sets that we have and as we said earlier the labels if we do have them they can either be numerical or categorical values and that then forms again a split between what is called regression and classification in the case of supervised learning and in the unsupervised case obviously we don't have any labels so one example of a type of learning now would be so-called clustering algorithms or clustering as a problem type and on this slide you also have examples of what each of those would be so for regression let's say it's a insurance price prediction that we want to conduct for classification it could be a disease type of prediction and for clustering where it's about finding similarities and similar characteristics it could be something like skill and experience grouping in a hiring or admission scenario and each of these different problem types now comes with a whole list of algorithms that we can use to solve or tackle that problem we already had a look at linear regression as one machine learning algorithm example for regression but there exist many more so you can see the list here we have que nearest neighbors neural Nets decision trees in a case of classification it's support Vector machines and again you would actually see some repetition here okay nearest neighbors neural Nets and decision trees because some of these different algorithms you can modify them slightly and either make them predict a continuous numerical value or you can make them predict a class or a category what you should also notice though is then when we look at clustering there is a completely different list now of algorithms we have principal component analysis or PCA collaborative filtering or k-means clustering so as you can see here there's a long list of algorithms and in fact there are many more so if you were to look up regression classification or clustering algorithms you would find a very very long list of algorithm nowadays and what we're going to have a look at now is a deeper dive at each of the different problem types and we're going to start with regression in fact this is the one that we already looked at in our machine learning algorithm example and this is just a visual and a table again so in the bottom right corner you have a example data set where we've also highlighted the label the healthcare score the thing that we want to learn how to predict and then we have a set of features that describe the different individuals like the hbmi smoker and ethnicity in the top you have a visualization of one feature x1h versus the healthcare score obviously you could add more features as well but then it becomes difficult to visualize the reason why we have this chart there is just to illustrate how to generate a prediction in the case of regression it would be the line of best fit so everything on the Orange Line those are predicted values and if you have a new patient or a new individual coming in you have a certain age for them you just ask them for the age you can just look up corresponding to the age what is the respective Healthcare score and then that would be your prediction in the case of classification now we once again have a line but this line is a different line now because that's actually a decision boundary which is separating the different outcomes in this case here positive and negative outcomes yeses or no's approved or not approved so this line that you see here is actually a separation a boundary between the two different predicted classes and once again we have an example data set here the labels plus and minus approved are not approved and then it could be a similar set of features so you can see here the difference between the classification and regression case is really just what the label is all about and the final example that we have now is the clustering example and you should note now that on the data table there is no indication of any label whatsoever we only have the features and now we're plotting two features H versus PMI and Visually you can already see that there are certain groups emerging there is a cluster which also eventually leads to the name clustering of data points that are more closely together than the other data points so I would argue that in this particular example we can see three clusters one in the top left one Center bottom and one to the right and one algorithm example that could help us find those clusters programmatically especially when it comes to having more Dimensions so now we can do it because it's a 2d plot and we can look at it but keep in mind that usually you would have 50 100 maybe hundreds of features and then you cannot visualize what's going on anymore so you need the help of an algorithm to eventually find these clusters so one kind of algorithm that could help us with that task would be cayman's clustering and we can have a look at the outcomes that Caymans would produce and in fact it is confirming the hypothesis that we had from visually looking at it indeed there are three clusters to be found in this particular data set little caveat here the K and K means actually stands for how many clusters we want to have as an output so we can actually tune how many clusters we want if I set K to 5 well then k-means would find the five groups in that particular data set so finding the correct K is actually a bit of Art and Science on its own and we can have a look at that later on as well [Music]