🌲 使用Azure实现决策树分类算法

yes is how to implement specifically a classification algorithm uh using and today and to do that we'll be specifically using the decision tree I think you have already had a a sort of a theory primer on decision trees and that's what we'll be using in today's lab okay so let's get on with it I'll share my screen just a moment okay just give me one minute Dr de group I have a question do we need a Windows machine for uh for today's session or any machine no no today we are going to be using any machine is fine because we going to be using Microsoft azard ml Studio Classic okay thank you it wouldn't matter thank you e e e sorry Dr W you muted we can hear you okay so I have already uh you know signed into studio. aure ml.net okay I already signed into studio. aure ml.net and uh once you sign in you'll see this is the landing page that says experiments uh in your case depending on how many other experiments you have created here will be showing here I've created a bunch of them so there are many are showing here but what I want you to do is go to the lower left left hand corner of the screen and then click on this plus new button here okay that's says new and once you click on this new button and then click on the blank experiment type here so you'll see several types the first one being blank experiment I want you to click on this blank experiment type okay you do that and then uh you know it will open up the HML Studio workspace on the right hand side a sort of a blank workspace and then the panel on your left which is quite familiar at this point I think you're already very familiar with this okay uh okay and uh so hopefully uh I just want to see some quick yeses or NOS uh on the chat window it would be nice if you can just let me know whether you've been able to open this uh interface here okay whether You' been able to land yourself on the blank experiment and whether you able to do that so if you are that's nice then that's really good and then today uh so as I said earlier today uh today's experiment is about implementing a classification algorithm and specifically using a decision tree okay so we'll be studying a particular use case taking a particular use case and then applying a uh binary classification okay to it by utilizing a decision tree okay a decision Tree and more so specifically a decision tree Ensemble okay a collection of decision trees so uh I think you already uh gone through the theory of that I think you've already uh completed attending the lecture on that uh on the decision tree uh for classification so and the reason and that is why we'll be using that today I'm trying to understand how to formulate the entire pipeline end to end in our use case so uh yeah so I see many yeses here which is really good uh I think uh the majority of you have been able to open it so now once you have done that is is there a data set that you have given already for this class sir no so the data set now what I want you to do is go to save data sets Okay on the left hand panel as always you will see save data sets click on samples under samples today we'll be using two data sets Okay but two related data sets these data sets are related to each other okay but uh and we will see how to how to utilize uh you know multiple data sets so today what I the data set that I would like you to use are the airport codes data set which is the second one here if you see it's the second one uh in my drop down here uh it's the airport codes data set that's one of them and I'll ask you to uh you know introduce this other one which is called flight on time performance okay so if you see under samples all of your data sets are listed alphabetically okay so if you uh this one starts with a and uh so this as for now I would just like you to drag this airport codes data set here okay just drag this airport codes data set onto your workspace area for now we'll see how to do the other later one the other one later okay for now I would just like you to drag the airport Cod data set on to the workpace now once you've done that uh the other data set we are going to be using is called as flight ontime performance data set now these two data sets are related to each other okay uh If This Were a database they would be almost like two tables of data okay there would be similar to two tables of data which have related information between them so what what is this specific use case that we are going to study today and what is the specific use case that we are going to be implementing for which we are going to be implementing the decision Tre algorithm today so let's look at this data set what is this data set about so this data set uh is related to the ontime flight performance uh across uh multiple Airlines okay uh in uh across uh different states in the United States so I think you know us is about uh roughly 408 or 52 roughly State a number of states toal number of states is roughly that in the United States in the continental United States and this data set was uh compiled by uh you know the Federal Aviation Administration which basically is a regulatory Authority you know regulating the operations of multiple carriers multiple Airlines across the continental United States okay now uh if you if you look at this data set so sometimes you know uh so why is this data set important in the sense that uh you know Airlines and uh regulatory authorities uh do like to compile this kind of data because then they can analyze this data and they can perform different types of analyzes as well as predictions on this data set and then uh try to find out you know for example which are those carriers or Airlines which are not maintaining their schedules you know and uh which are the defaulters ETC and that kind of information can be gathered that can be that kind of analyzis can be made and uh appropriate actions or measures can then be taken by those regulatory authorities against those Airlines or in you know and so that they can then start maintaining the on time schedules now if you look at this data set uh this data set as usual has several columns about 18 columns in total so let's go through this data set to understand what it is we are dealing with now uh if you see we this is a fairly large data set in the sense it has 54,000 rows so almost nearly half a million records half a million rows okay fairly large data set but all of this data is basically related to uh you know Airlines and their mode of operation and their uh you know their departure times Etc and whether they have been able to uh keep their ontime schedules or whether they've not been able to keep their ontime schedules you know basically how closely they have been able to adhere to their uh ontime uh flight operations okay now to start with so if you see there are 18 column in total now this is part of a much larger data set uh but this has only been collected for one particular year if you see if you click on the year column you will only see there is only one unique value here okay so it is only been compiled this data set is only basically it's a subset of a much larger data set which is only been taken for only one year which is 2011 for quarter number four okay and again this is if you see here the unique value is one for the quarter column the quarter attribute and then uh which means that it only me this data only pertains to a specific quarter number four uh for that specific year for the specific year of 2011 and then a specific and also a specific month okay a specific month number so now uh again you have day of the month so uh each every single row here every single row of data that you see pertains to a particular flight pertains to a particular flight specific flight so this data every row data or data corresponding to every row or record pertains to a particular one single flight now if you see here um again you have data for the specific day of the month and then the specific day of the week okay so uh if zero say if zero is uh Sunday then Monday then Monday is one okay and then Tuesday is two Wednesday is three Thursday is four and so on okay now uh so essentially again these uh the day of the month the day of the week again these are categorical you know features okay they are representative of categorical features we need categories so again now here we next the next column we look at is carrier carrier here specifically refers to the airline and the name of the airline okay uh so different airlines use different carrier codes okay so these are nothing but it's a twl carrier code uh pertaining to a particular or pertain different airlines across multiple Airlines okay now okay now then we come to this column which says origin airport ID so this is the airport ID it's an identification number that identifies a particular airport okay identification number that identifies a particular airport in the United States so every you know if there are thousand airports in the United States then there are roughly uh you know these many number of codes okay so it's Unique in the sense airport IDs are unique representative of a specific airport in the United States now uh here again destination airport ID also uh the difference between origin airport ID and the destination airport ID is is with respect to a particular flight okay the difference between origin airport ID and the destination airport ID is with respect to a specific flight so if a specific flight has departed from the origin airport with this particular ID and it has is uh landed at that uh at some of airport having this specific airport ID and this is what is meant by destination airport ID so the origin and the destination is in context of a specific flight that has departed from this origin airport ID and then arrived at the specific destination airport ID now you see here there is something called a CRS departure time so it's basically uh the departure time of the flight okay the schedule the schedule departure time of the flight so it is uh in hours okay so 1435 hours 1330 hours okay 1435 hours means 2:35 p.m. again now uh this being just uh the departure time this particular departure time column has multiple different number of departing times right so there are multitude of different uh different departure times so theyve all been clubbed into unique categories or bins so they basically this column is just a binning of this column and it's a binning into these unique 19 bins okay so if the by which I mean if the departure time here is 1435 then it is B into this interval of 1400 between 1400 to 1459 if a particular airport ID is 1330 so if a particular departure time is showing as 3030 hours then it is in this particular bin okay 1300 to 1359 so like this there are uh it has been categorized into n a total of 19 bins so that is what it's referred to as departure time Block in essence this is one block okay it is one uh block of uh multiple departure times uh then uh this column that says departure delay is also is what is very important for solving uh this use case now uh before we go further at this point uh we will since I as I already told you now this we are solving a classification problem today right so if we are solving a classification problem it is essentially a supervised learning problem that means there must be the presence of a Target variable or a label column so we'll eventually I will show you which is the label column here okay now I'll contining onwards with this discussion of uh the description of each of these attributes now here is one such column that says departure delay and if you see departure delay is the value is a numeric and what it represents is it represents uh the number of minutes uh that a flight has either arrived on time or uh has experienced a delay okay so uh here two here represents that the so if you see two here is basically plus two if it is positive it represents that the flight has experienced a delay it means that the specific flight here has experienced a delay of 2 minutes okay is delayed by 2 minutes minus 4 here represents just the opposite in the sense that the flight has arrived early or earlier than it scheduled departure time by 4 minutes that means the flight has arrived 4 minutes early okay this minus 4 represents that the flight has arrived 4 minutes early plus anything that is positive here represents a delay by these many minutes so two here represents a delay of 2 minutes zero here doesn't represent any delay it's arrived on time minus one here represents that the flight has arrived early by one minute okay compared to scheduled uh departure sorry the flight is actually departed 1 minute ahead of its departure time okay this we're talking about departure DeLay So minus one means the flight is in fact departed 1 minute ahead of its scheduled departure time and so on now uh when it comes to these regulatory authorities in the United States or even other countries for example India it is the bgca the director general of civil AV it controls you know this was the regulating Authority for all the airlines in India uh you know if there is a uh and what these regulatory authorities want to see is the adherence of the flight uh uh departure times departure and arrival times across all these different airlines okay and whether uh they're oper operationalizing all of these you know uh as maintained in the their guidelines so what they want to see here is they are not interested to see if the flight is just 2 minutes late okay for example if the flight is just 2 minutes late they wouldn't mind and they're not interested to see even if the flight is 6 minutes late but it raises a red PL when the flight is 17 minutes late in other words when the flight is departing 17 minutes Beyond it's scheduled departure time okay if the flight is uh you know see in other words the threshold here is 15 minutes so uh this data that has been compiled here raises a red flag and the flag is it's flagged as one in this column so this is that flag column which is called as the departure D 15 departure D 15 indicates a column where if any flight has experienced a departure delay of more than 15 minutes it will raise a red FL okay in other words the it will change to a one so here the flight here has experienced a delay of 17 minutes exceeding uh 15 minutes so it has been flat as one here 26 minutes Beyond it schedule departure time right the flight is departed 26 minutes Beyond it schedule departure time raising a flag of one and so on and so forth okay so that is what these you know authorities will the negative values here in departed delay is what exactly it's like departed early yeah early exactly just the opposite the negative values as I said earlier the negative values are they indicate that this particular flight has depart 4 minutes earlier then it's schedule departure date uh departure time okay so if it is I mean if it is departing ahead of its departure time then then then there are no red planks of course that's that's quite ideal that's what the uh you know these authorities would like to see but uh come a time when there is uh you know uh departure significant departure delay of 17 minutes Beyond 15 minutes like 26 minutes here 23 minutes here here it will raise a red flag okay and then that is flaged as one thing that is what they're concerned about so this depth departure Del 15 column basically essentially becomes a categorical column of only two categories 0 and one where 0 is okay you know the zero is is uh benign there is no problem with the zero but the problems are with the ones which are one because they indicate a flag delay of more than 15 minutes okay and similarly now this was for departure delay the delay that was that is being experienced by a particular flight during its departure and then the same thing happens for the arrival time here so you have the arrival times the the scheduled arrival times of the flights and then they bin again they are bined into these 19 unique categories okay say, 1550 is the scheduled arrival time of a particular flight then it is been into this interval of, 1500 to 1559 and so on now again here you see arrival delay just like this are departure delay and departure Del as the flag column for departure delay likewise you have the arrival delay in number of minutes so this arrival delay what it signifies is the total number of minutes that a flight has experienced or has been delayed uh compared to its scheduled arrival time so if here there is something that is say minus 6 minus 6 is okay because that means that the flight is actually arrived 6 minutes earlier compared to its actual arrival time which is good which is uh ideal you know which is designed okay if it is minus 12 means 12 me ahead of it scheduled arrival time 14 minutes ahead of it scheduled arrival time but again the problem comes when uh there are significant delays experienced and Beyond 15 minutes okay something like 29 minutes so here it is plus 29 plus 29 means that this particular flight has arrived 29 minutes Beyond its scheduled arrival time raising a flag of one that means there's a significant delay almost half an hour and then another flight of 88 minutes so these regulat authorities would like to see you know what kind of or they would be interested in this kind of data because uh you know they would like to see uh you know go into more uh intrusive detail and see what caused these delays and then sort of remove those bottlenecks try to come up with processes enhancements uh to those processes to see how they could be optimized so these delays can be minimized so uh this column it says arrival delay and then you have arrival D 15 arrival D 15 again is the flat column so this becomes a categorical column again with two with only two categories where zero represents always is okay but then one here uh represents a red flag where a particular flight has experienced a delay of more than 15 minutes okay and that's why you have this term this number 15 just to say uh okay uh here are here the list of flights which have experienced a delay of more than 15 minutes now if you see on the right hand side here there are two more columns categorical column which are canceled and diverted okay now uh again these give you information as to whether a flight was cancelled these are all binary columns again whether a flight was cancelled or not cancelled and whether the flight was diverted or not diverted okay so the question is which is going to be our label column for implementing this use case okay now in our case we are going to select the arrival Del the AR D15 as our now this is as I said it's a binary categorical variable with only two values either zero or one right so this becomes our class F this becomes our Target variable okay so we are interested to for this specific use case we are interested to see or solve uh the problem of you know analyzing which are those flights which are experiencing a delay of more than 15 minutes arrival delay of more than 15 minutes or which are those specific flights which are uh you know uh which are delayed by more than 15 minutes okay that is a specific use case we are going to be solving today and therefore this becomes our Target variable our Target V people also could have been departure del1 15 but then that will change our specific use case uh to see to say that uh then this if this was the target variable then we would be solving the use case of trying to understand or trying to uh find out and analyze which flights are causing a delay of more than 15 minutes uh compared to the actual departure times okay so this will become our this will become our Target variable or our L column also called as the class column okay and it's a binary class with only two values 0 and one uh now of course you see this data set has missing values okay so there are these many missing values in these columns certain other columns may have missing values okay so in fact this data set does represent uh Cate uh uh mixture of numeric and categorical attributes uh many of which have missing values many of which have missing values here if not all okay so now if we are again if you're solving a use case where we trying to understand uh how many flights or or which are those flights that are getting cancelled okay trying to predict and also trying to predict whether a flight is likely to get canceled or not then this would become our Target variable or in the case of diverted if that is what we want to study then that would become our Target variable but here the specific use skills I said we trying to solve is trying to understand and then predict which are those flights which are likely to cause uh given the data which are those flights which are likely to cause a arrival delay of more than 15 minutes okay now so this becomes our Target variable this becomes our class column or our Target variable okay now uh so now that you've seen our data set and You' understood what this data set is about still as I said earlier you know we have two data sets here which are very related to each other and uh this data set however uh does not contain all of the fields that you would like to work it does not contain all of the attributes of the fields that you would like to work with okay so some of those uh other columns that we also want to add therefore we would like to add a bunch of columns to this specific data set that is called as flight on time performance which is you'll find this data set uh towards where it says you know the the markings are with f somewhere here flight on time performance this is the one here so we would like as I said we'd like to add a bunch of columns to this specific data set here and uh these uh for which against which actually The Columns that we would like to add are pertaining to the airport codes okay if you see here there is this column that says origin airport ID and then destination airport ID now this ID doesn't mean anything to us uh you know readily uh I mean apparently we cannot just directly interpret what the the name of the city the state of the airport name of the airport just by looking at this origin airport ID we have to find from somewhere right you have to find a sort of a reference look up for to see to see where this uh what is the name of the airport or what is the name of the state city and what is the name of the airport for against this particular airport code so that information of the airport codes is given in this data set which is airport codes data set so therefore we would like to sorry therefore we' actually like to merge these two data sets so that we can uh all get all those columns from the airport course datas if you look at this airport Cod data set you'll see exactly what I said you'll see that information there see you'll see against the against the airport any given or specific airport code or airport ID okay any given airport ID there is a corresponding name of the city to which where that uh uh the location where that airport ID uh where the airport corresponding with this airport ID is located and corresponding to the state and then the name of the corresponding name of the airport okay so for Anchorage which is in Alaska AK means Alaska it is a Ted Stevens Anchorage International Airport so these three columns also the city the name the the city the state and the name of the airport is also what we would like to also include in this uh table as well against the airport codes in the table that means against each of these airport codes which is the origin airport code and the destination airport code we would like to add three columns for the origin against the origin airport ID so we would like to add uh we would like to understand the name of the city the state and the name of the airport for this specific origin airport so we would like to add something like origin City origin State and origin name name of the airport so three three columns against this origin airport ID again again for the destination airport ID also we would like to add an additional three columns the destination airport city uh where this particular airport is located with this airport ID so 12191 is located somewhere and we want the we want to understand or know the name of that City the state and the name of the a so again three columns for we need to add three columns for the origin airport ID and three columns for the destination airport ID in total we' be adding a total of six columns to this already uh this is our main primary data set so we would like to uh and that specific Way by means of which we are going to merge that data into this table is called as a it's a it's called as a join it's called as a join operation okay if you're familiar with databases then uh join typically comes from uh you know uh uh structured query language uh and typically it's a database concept where you have different tables of data uh with you know different uh which are clubbed into different under different heads of information okay some so one table may have employee name the other table uh may have employee uh employment details the other table may have employee professional details okay and the other one other table may have have employee salary details so it's not a good idea to store everything into one single monolithic table that's a very bad design you talk about SQL when you're talking about databases single monolithic design doesn't work because then you have a humongous amount of information just stored into one single monolithic table and it becomes very difficult to retrieve when you're trying to retrieve records from that uh table it becomes a humongous effort okay trying to scramble through all the humongous amount of data in the table so therefore that's why you always tend to keep uh different heads of information into different tables and then join those tables as and when required okay now join operation said having said that join operations are not uh are pretty expensive but then uh uh you know uh there's no other way really okay so uh here is also that's why we are also going to perform we'll also eventually need to perform a joint operation between airport codes data set and flight on time performance so that we can retrieve those six columns from the airport codes data three columns for the origin origin airport ID and three columns for the destination airport ID that is our eventual code okay so now to do that let's go ahead and first what I want you to do is uh you would have already dragged the airport codes data set again let's have a quick look at this table this airport codes data set has only the airport ID the the city the name of the city where that airport airport with this airport ID is located the name of the state where it is located and the name actual name of the airport okay so uh what I want to do is drag in edit metad data here now this edit meta data normally you know uh usually or before in Prior classes we have often used Ed metata but to separate categorical variables from their numeric variables right but here uh this edit meta with this edit meta data we are going to do something else okay we are not going to do that this is not for separation of categorical from numeric this edit metadata is to do something else I will and as we keep doing it we'll understand this by doing it so I want what I want you to do now is drag edit metadata either you can search for it also here or you can go to uh data Transformations here okay manipulation and then you will see edit metad data here please drag edit meta data onto this here so this is edit meta data here you'll see edit meta data on your left hand panel under data transformation manipulation uh once you've done that join the airport codes data set with the edit meta data okay once you've done that now what we going to do here is uh here you launch the column selector click on edit meta data and launch the column selector and uh initially you will see all of these four columns click on by name first uh at least you know that by now so uh you know you have to click on name right you know that so and then you will see all of these four columns uh available here now from these we'll only select city state and name because those are the three columns eventually that we want to import into the other table okay only those are the only three columns that you would like to import into the other table airport ID this airport ID is already present it's already present where it's already present in flight on time performance if you look at flight on time performance okay the airport ID is already there for origin airport ID and destination airport ID so a ort ID is already present in this table but what we would like to import into this table is the city the state and the name of the airport which is present in this table here and you want to import those three columns for the origin airport ID again import those three columns from the airport cods data set for the destination airport ID so three columns for the origin airport ID and ke columns for the destination airport ID now so going coming back here on edit metadata please launch the column selector and therefore you will see uh initially you will see all the four columns just drag city state and name onto the right hand side city state and name this means the city the state and the name where this airport with that particular airport ID is locate okay and just uh confirm your selection and uh there is something else also we have to do here we'll say for the new column names now these what are these new column names and why are we doing this because uh as I said earlier we are introducing an additional three columns city state and name we're introducing additional three columns which there is city state and name into this main into our primary table here so we need to understand whether this city state and name is for whether the city state and name name is for the origin airport ID okay whether it is for the origin airport ID or the destination airport ID whether it is for the origin airport ID or for the destination airport ID so first we will import the three columns from the airport codes data set table for the origin airport ID so first we will make one joint using a single joint we will first import the three columns corresponding to the origin airport ID the city state and the name corresponding to origin nput ID after that once the join is done those three columns will now be added so 18 will become 21 now again we will do repeat the same process for the destination airport ID wherein we will add city state and name for the destination airport ID and that will take the column count to 21 + 3 24 okay so 18 because we are adding and all all all in all we adding actually six columns in total so 18 + 6 is 24 so eventually we should land up with a total of 24 columns so now let us first make the first joint which is we are only doing this for the for adding the city state and name for the origin airport ID okay so for that we'll have to generate new column names so that we have to generate new column names so let us uh you know generate this new column names of origin City so we'll call it please typee here uh I'll enlarge this a little bit uh so please type here origin uncore City let's type uh it doesn't have to be this way you can type for example something like uh origin City uh and you can type something like let's say anything of your choice so origin but that is representative of the name origin City okay or origin you you know what I mean so origin state from origin uh state you know the Java way of naming things right doesn't matter so you can stick with what I have here I say origin City so let's rename the three columns to the First Column being City the second column is state so for City you know this order has to be preserved if it is city state and name here then you have to say city state and this will be this will be substituted for name origin City will be substituted for City origin state will be substituted for St State and origin airport origin uncore airport will be substituting name okay so I'd like all of you to type the new column names here the three column names so it will be origin _ City comma you can type origin uncore State comma origin _ airport okay and that's it we have completed our configuring the edit metad data our edit metad data is now configured is it not need to be change saved somewhere the new column name sorry sorry can you can you the new column name needs not to change saved somewhere or like just as no that's it once you have once you've done this can you can save it also yeah but uh that that's optional I mean once you have named it here that's all then you can run it Professor I have a question regarding this uh this joining of this table right so yesterday Professor F was uh teaching us about the principal component analysis where in like okay we are reducing the dimension of the variable by joining uh or like merging multiple Dimension yes but uh this is not PCA PCA is more of it's something entirely different okay this is in fact doing the reverse of that we are adding more Dimension to the I basically data set right no no we are adding because we need that information with okay okay and we also want to this is another way to learn how to also apply this kind of data transformation in as this is also a data pre-processing and data transformation step where in sometimes you need to add joints because your entire data is not there in that particular table then what do you do you have to import data elsewhere so you'll have to use this kind of data transformation then okay okay yeah now the one the the specific PCA that you're talking about is something entirely different is where we have say a bunch of features like we have 100 features or let's say we have a total of thousand features th okay now what PCA is typically called as a dimensionality deduction method wherein we it is typically used for reducing the number of Dimensions reducing the number of features is mainly used for dimensional reduction wherein we projected along three principal coordinate axes they're called PC axes pc1 PC2 and PC3 we project it in hyper Dimension so that in lower dimensions then we can look at the data and that data gets separated basically it's a extension of what is called as SBD which is called a singular value decomposition and uh essentially what it does is it's you can think of it like something like feature selection wherein we are reducing a a high dimensional set data set of thousand features into maybe a subset of only 30 features or 50 features okay okay and that gives us a way to reduce the total number of features into a smaller subset so which is easily handleable and that and then trying to in the process maybe trying to eliminate features which are redundant features which are multicollinear okay they may have multicolinearity in other words they are correlated to each other they derivatives of each other things like that you want to eliminate those kind of features which cannot make which are not likely to make a substantial contribution to the classification so the ideas here in this table that making no sense right so is that the T the dimension that we can reduce the airport Cod right airport codes ID no no we are not yeah we just some some of the features just by visual inspection will directly drop them if it don't make sense you know as a feature right something like employee ID right can be directly dropped so if there is any such we'll drop it but here airport codes I is uh yeah is something that we can draw we can exclude I'll come to that later okay sure because something that is you know something like employee ID or student ID is just to keep a track of the record number it's just an index to the record into the record into the row number there something like a row ID that is acting just as for identification purpose only it is only acting as a row ID indexing into the row and it's Unique but it's just all it's all it is doing is just indexing into the row and doesn't have any implication as a feature there is no point in including it so then you need to just directly you can drop it and not include it for the rest of the pipeline got it yeah now but for the more uh subtle uh cases where uh you you don't understand whether it is really this feature is correlated with this other feature or whether is this feature really going to make any substantial contribution or not you have you are in doubt uh for those kind of things we use PC we use more advanced you know maybe like feature selection techniques like sequential forward search sequential backward search uh those kind of elimination procedures those kind of those are called as feature selection methods like fure score fure scoring those are uh feature selection methods in machine learning we use where we can call out features which are not uh which the algorithms those algorithms are saying that they're not going to be really useful for classification got itk okay so now yeah so the point of doing this in edit M dat is just to rename our original column because the original columns here are in the airport codes data set are named as uh city state and name but when we add import these columns into the flight on time performance we like to call them as origin for the we are doing it for the origin so origin City origin State origin underscore City please don't forget a comma uh this specific syntax and then origin underscore State comma and then origin _ airport okay right uh let's right click on that and say run selected okay all goes well it will run now let's introduce another edit metad data a second one so again you'll go to you can go to data transformation and then drag another edit met data onto the right hand side here okay here and then now this one is for the destination airport ID so for this edit met data was for importing the three columns for the against the origin airport ID in this table this edit metor data is to import the destination the three columns for city state and name against the destination ort ID so for this Ed meor data again we need to uh configure it so can launch the column selector and then select city state and name these are you can see the four columns that uh these four columns are coming from where from the airport codes data set so join the airport course data set with this edit meter data okay join the airport Cod data set with the edit data how many columns does the airport codes data set have here if you visualize 1 2 3 four four columns okay you have airport ID and then city state and name so again for this edit metadata launch the column selector and then just move city state and name because those are the three columns we would like to import for the destination airport ID okay please select city state and name for the uh and move them to the right under selected columns and confirm your selection Fe and this time is when i r the first edit metadata it was asking for an window called inner Jo inner joint joint type or match case uh are you doing it on ML Studio Classic Azure ml Studio Classic yeah so I mean no no the only options here are just these that's it uh there is no such option I have sh so Professor he's talking about the join data when you go in the join data that I I have not gone into the join data guys please follow me have I asked you to do anything no please follow follow specifically we are only on the second edit metadata component here please don't get ahead otherwise it will it will it may you know confuse even more so join data we should not join right now not right now I'm only discussing edit meta data right now the second one thank you okay yeah please carry please just move along with me here I'll come to that so for the second edit M data component for the column names just say destination uncore city comma destination and DK when you say d you can't say destination too long just a d City this you know like you said this naming is up to you how you want to do it Des City if you want to put in something like Java the way we name these different functions ET in Java you can do that or any any other kind of naming system but make sure you put in a comma in between and then this order has to be maintained city state and name if it is city state and name here then you have to say City comma State comma name I said for the name I said Dore okay once that is done you can right click and say run selected if all goes well it will run without any hes when we are running this edit metadata Set uh this light on Time Performance block is getting removed I think yeah join anywhere no no no no that will get removed that's why I didn't ask you to even introduce this do you remember okay I even have I have I even have never have asked you to introduce flight onent performance so far because initially you had asked to copy these two blocks right on performance and Airport only airport C data set okay if I said that I'm sorry but only airport C data set if you all right because the moment you start executing ex metata this will vanish yes it's vanished okay mind so now yeah okay so now we have now let's put in a note here guys so here uh what we do is we'll uh you know like I said uh you can put in double click just double click on this component it will allow us to kind of put in a comment saying what this edit meta is doing right usually we more we are more familiar with edit data as in trying to do things with categorical variables but here that is not the case so what we'll say is uh uh you say this intermitted component is uh renaming uh The Columns the renaming State uh City State uh city state and name columns uh uh uh okay that's it sorry for we will say origin airport ID okay let's put that there that means we this is only only to rename city state and name columns for against the origin airport ID and similarly for this inp putting for not origin it's a destination input ID okay for Destination airport ID okay and if you click on this right arrow it will enlarge it that all right so now uh so now we have essentially just executed this edit met this these two edit met data here now now we are in a position to perform our first joint okay so now we're going to position to perform our first join operation wherein you need to drag now we need to drag this data set flight on time performance please drag and put it here okay uh flight content performance again you'll find this data set where in samples here just look for a b CDE e f everything is alphabetically sorted and then you will see flight on time performance on the left flight on time performance raw okay just drag this so this is the one flight on time performance raw now so now therefore we would like to import these three columns which is city state and name and where are those three columns deciding those three columns are deciding in the airport sces data set but we already renamed then okay so uh if you visualize this edit metadata here now what do you see see originally what do you see here in the airport codes data set if you visualize the original column names are city state and name the original column names are city state and name against the aort ID but since we have renamed City to origin City origin State and origin name so that's why the after the result of the Ed meta data you can confirm that by visualizing here and you'll see the colum names are now renamed to origin City origin State and origin a okay and that's what we wanted so now we are going to perform a join operation between the table on the left and the table on the right hand side so to that end we need to drag the join data component so please drag the join data component you'll see you'll find the join data component again under data transformation manipulation and you'll see join data at any point if you want to also do it even faster you know you can say join data directly type it in the search box it will give you that you can drag it here join data okay so please do that and uh now we need to configure the join component here so now what are the columns that we want to join for the left hand side table and the right hand side table so you see uh this join component will demand two things it demands us to enter the join key column for the left hand table now what is the join key column the join key the join key column here is the airport ID it is the this column means something like if you know in SQL you know that is a column uh with unique values in it something like that you that you say it's a key column it's something like a primary key okay and based on the primary key maybe you can make a join with some other table that also has the same primary key there but that is a forign key for this primary key in this table the essential idea is that there is a common column there is a common column in both the tables what is the common column here the common column is if you see in this table I'll show you in this table you see that common column is uh is the origin airport ID these are airport IDs again in the right hand side table which is this okay you will see you visualize you will see the airport IDs again so uh in the right hand side table it is called as airport underport ID okay in the left hand side table here it is called as origin airport ID but essentially they have the same column they have the same values theut codes cannot be different they have to be the same so that is that becomes our common column that becomes a common column based on which we are going to perform the joint and that common column here is referred to as the key column okay join key it's called as a join key now uh so let us first uh launch the column selector and if you see on the top top uh you know it says join key columns for l l here Azure refers to the uh join made in the connector one Whatever table you input to join data on the left hand side becomes L and whatever you input to the right hand side on join data becomes R so if join data says join key columns for L it is referring to this as L and this as R L for left R for the right hand side table okay so if it says join key columns for L you will launch the column selector and uh see immediately it just pulled up all of these columns are coming from which table it's coming from all of these columns are coming from this table here on the L which is which is similar to or which is in fact now L okay this L here so in join data you need to join the we need to launch the column selector we need to First in join data we need to First select the key column on this table and as I said earlier the key column is what the airport ID but the airport ID here is called by origin airport ID so we'll launch the column selector and we'll select origin airport ID okay if origin aort ID somewhere here select it and drag it to the right this is the common key column that is present in both the tables and based on which the join will be form okay so let's select that confirm the selection and then we need to now next we need to join the key column for r r means for this table for the airport for data set table this is that okay so for this the uh if you see here the airport code the airport ID is called as airport uncore idid it's called as Airport _ ID so we come back to join data join key columns for R here you will select airport so launch the column selector and you will see these four columns and these four columns are coming from where from the where are these four columns coming from yes can you tell me where are these four columns coming from here airport the airport code table airport code dat set exactly right okay so and then here uh again we need to drag the airport ID to the right hand side that is the common key column that is already also there in the other table in the leftand side table okay so you need to drag this to the right and Airport ID make sure it is under selected columns and then confirm your selection here so now we have specified the key column for both L and R okay now uh we need to figure out what is going to be the joint type okay so uh here we are not going to what here so here what we want is see this being our main data set which is our main data set here this airport codes data set or the flight on time performance draw app on time app on Time Performance law is basically our primary data set that data set we just adding some extra information from the airport course data set so this being our original data set our primary data set what we are going to do is we are going to perform a what is called as a left outer joint so we want to do that because we want to preserve so Our intention is that we would like to preserve all of the records all of the rows in this table but just add only those available only add those available columns for which the data is available in this table right hand side okay but we don't want to lose any data on the left hand side we want to still maintain the same number of records as we have originally in this data set but only add available data from those three columns state name uh city state and name which is available in the airport Cod data we don't want to lose any information on the left hand side table so to do that we perform what is called as a left outter joint so left what left outer joint does is there's something also called as inner joint you don't want to do an inner joint inner join here means if you perform inner join what that would mean is an inner join will only take the intersection of the data of the two tables which will only mean that it will take uh only those matching IDs it will only take those matching airport codes from the left hand and the right hand side table and only show that data but then if those if there may be some airport codes for which the data may not be available in theut COD data set and we'll be losing that information we don't want that so therefore what we want to do is we want to perform a left outer join when the moment we perform a left outer join on the left hand side and the right hand side table what that does is it displays all the records on the left hand side table but it displays data against those matched airport codes okay those matched airport codes that are matched on the right hand side table it displays those three data for those three columns but if there are some extra airport codes if there are some extra uh non-match Records if there are some extra non-match rows on the left hand side it could not find a match on the right hand side table so it doesn't know what the city state and name for those airport codes are it will leave them as null it will leave them as nulls okay but it will still include them in the result set it will still include them in the result but doing in join only is something different as I said so here what we want to do is we'll perform a left outer joint okay all of these are the these are very basic joint Concepts steming from databases essentially what is the other full full outer full outer join P out join is where all the records of the left hand side table will be displayed in the in the resulting data in the result set as well as all the records on the right hand side table will also be displayed in the result okay both both will be displayed common factor redu as well as as well as the as well as those which are match records will also get dis but for those which are unmatched records on the left hand side will be nulls and those record which are unmatched on the right hand side will also be n okay so uh Professor could you please click on the first join data uh first join data uh which is this one the okay okay fine good thank you so uh now there's one more there's one more C to this which is it says here keep right key columns in joint table and we don't want that okay what this is is what it will do is it is asking us whether we want to keep uh we already have the airport on the origin airport ID if you remember we already have the origin airport ID in this data set here origin airport ID so we don't again need to add the destination airport ID we don't need to add the airport uncore ID from this table again so that is what this is for it is asking us do you want to keep the right key column what is the right key column here the right key column is airport uncore ID it is asking us to see if we want to keep this airport also include this airport uncore ID column into this database into this uh table sorry but we don't want to because there will be a repetition we already have origin airport ID we just want to include city state and name okay so to that effect we will have to uncheck this box but there will be a redundancy which you don't want to have okay so now let's do that and now let's run the join okay let's right click on it and say run selected now as a result of this join operation what should happen is what should happen What should happen is the three columns which is uh origin City origin State and origin name origin airport should get added to this data set so originally this data set if you remember had 18 columns it had 18 columns so now it will increase to 20 it should increase to 21 columns okay if all goes well I have to see just click on right click on join data click on the result data set visualize you'll see the resulting data set yeah so it is 21 column you see here and the reason is because uh yeah so if you if you move to the right hand side you will see the three columns have been added so from which table have these uh three columns been added the origin City origin State and origin Airport from which table has this been added airport airport C so now now for all the origin for all the uh sorry for all the origin airport IDs we have their city state and name okay so the city state and name is corresponding to a particular origin airport ID so 13495 represents New Orleans from the state of LA and it's called Louis Armstrong New Orleans International now we want to repeat the same process to include the destination city state and Airport for this time the destination airport IDs we don't know what the destination airport ID 1219 is right so we want to include the city state and name for Destination a which means that we want to include an extra additional three columns okay which are going to be destination unor city destination _ State and destination airport or however you named but essentially three extra columns so we we have to perform another join here right and this time we'll have to perform a join of the already joined data we have to now include the already joined data that will become the table on the left so that will become our L and the and this other edit metadata which is which we have used to rename the city state and name for Destination app the second met data join those two into this join data component here and now okay so where will you find join data join data is here under data transformation manipulation join data see join data here and now you can add these two uh please connect them connect this table on the left which is join data here okay this is a table on the left and the table on the right edit met data join these two and here is where we are going to perform the second join by means of which we are going to introduce another extra three columns called as destination city destination State and destination name into this already uh joined data here already joined table of data so for this joint data we need to configure right we need to configure this joint data now so please click on that and then join key columns for L becomes destination airport ID so join so launch the column selector and these are all the columns from the join data so you will see the origin city state and Airport because we already using the already join data so these three columns have have been added already okay so from here we'll need to select the destination airport ID because this time remember we are using we are doing it we repeating the same step for for extracting the CT State and name for the destination airport ID so please select destination airport ID move that to the right hand side here and then confirm selection on the right hand side it's already configured now uh yeah so we have configured the first one which is join key columns for L destination best airport ID and then join key columns for R what is r this airport code data set right again the same ID so you will launch the column selector again you will see a set of four columns which is airport ID these are the columns where are these columns coming from again from the airport SC data set you see airport ID destination city state and airport so click on Airport ID and move that to the right hand side why because this again is our this is our common column as I said for creating the join this is the common column that is present in both the tables with different names but essentially the same uh set of codes and the same column essentially in both tables based on which we are we are going to perform the joint by by doing what so when we when we perform the join operation essentially tries to make a match a direct match of the values of that column it is going to match the destination airport ID from the left hand side table to the airport ID on the right hand side table from the airport Cod data set so for all the matches it is going to then import the DAT import the three columns against those matches okay let's uh introduce that and then right click and then confirm the selection okay and again we want to do a instead of in let's do a left out join yeah okay now once you've done that let's right click and run selected so run the second join now if you remember the after the first join we already had 21 columns now after the second join now we'll end up with a total of 24 columns so on the second joint column we still have to uncheck the keep the right column yes yeah sorry yeah I forgot to mention that you'll also have to keep the right key columns in joint table unchecked otherwise what will happen otherwise see what happens I'll show you here if I keep this as checked yeah if I keep this as checked what if I run the join okay let's run it like this add airport ID once again yeah it'll add airport uncore ID which is redundant you already have that information it's not required okay so let's join that yeah and this time if you see the result data set here see what will happen is uh it will add that uh airport ID see it has added airport ID from the airport codes data set we already have origin city state and origin airport you have destination city destination state state and destination airport that is okay because these are the three columns we actually wanted wanted to add to this data set but in addition because we kept that box checked it has also added airport ID which is redundant because we also have the airport unor ID is already there in the destination airport ID okay we need not add it again so because it's redundant we want to uncheck this box and then uh run the join data okay so let's do that and uh yeah so now let's check it it's likly it's likely to have removed that column airport ID from the uh resulting result set here you see the destination origin city state airport destination city state airport the airport uncore ID has been removed so now we have have arrived wherein we have now been able to successfully merge the data uh between uh or related data between two different tables and be able to successfully perform this kind of a data transformation or data pre-processing step in Azure okay and uh so now we are ready now to move ahead where in now we can introduce the select columns in data set okay now let's introduce the select columns in data set you can directly try instead of searching here you just type select columns in data set and you see and just uh you see under data transformation it will directly appear just drag and drop here and then join these two please and now we want to launch the column selector question Professor yeah yeah so in the on time in the flight ontime performance if we do a visualize it actually shows um there's a specific column um where there are set of missing data yes for example our Target variable is arrival delay 15 yeah yes so there is a missing value of 4,000 717 so as of now till now we are not bother about the kind of missing Val we are still continuing yes yes so now all we have done is we've just performed a type of data transformation wherein we have just merged data from two different tables because finally it is one table that we want to use one table of data So eventually we are going to remove the missing data or want to of course of course this is the first of the Transformations the remain after doing the okay sorry after doing the transformation after that probably in later operations we are going to re yes so this is just the first of the data transformation operations right and the the the rest of the feature engineering steps will follow first okay thank you yeah all right so now uh we still have like he said it's and uh we have still we still have missing Valu which we going to deal with we have to deal with that eventually okay so now uh let's introduce the select columns in data set here now here is where we launch the column selector and uh you know we'll leave out these four columns which we are which is not required for solving the specific use case the origin airport ID okay and the destination airport ID is something you can directly drop because these don't make any sense towards uh being features these cannot become features in any any any sense of the word so we can drop this you can also draw canceled and diverted because we are not using them as uh you know for any as features neither can be used and neither neither can they be used as a Target column so we can remove these four we can drop these four columns here and include the rest of the columns here on the right hand side okay please do that and uh so again the only things we leaving out is the origin airport ID the destination airport ID cancelled and diverted cancelled and diverted are categorical columns remember they're binary categorical variables which are not relevant to the use case that we solving because use case we solving is trying to understand or predict which flights are likely to cause a delay of in their arrival times which which flights are likely to cause an arrival delay so should there be a analysis we have to do on the diverted flight then we should be using this yes then that changes the use case if you are trying to answer the question of trying to predict you know which flights are likely to get diverted or that kind of analyis then that would be your target variable okay so let's leave these four out please uh and uh include the rest and just say all right uh just confirm the selection and then run selected okay right so yes so okay that's run now now is where if you see the data set uh and we looked at it of course in detail uh but I just would like to call your attention to the fact that you have a mix and match of majority of the attributes are categorical right you see here majority of the attributes are all C C orical here but some are numeric like departure delay this is in minutes uh right and then uh course departure D 15 again is categorical Z or one depending on zero is good and one is there is a red CL but departure delay columns like departure delay and arrival delay are numeric so these are the only two numeric columns we have here all of the rest of the columns if you look at are all categorical descriptors categorical VAR so now here comes a point where we need to separate we can separate the categorical from the numeric counterparts so can introduce edit metadata you can just type edit metadata on the left hand side here just type edit meta data okay and then just drag edit metadata here uh join select columns with edit metad data now now when you click on edit metata we need to configure this right as always so uh let's launch the column selector and we'll need to understand which are the numeric descriptives which are the categorical okay all right so here uh so here uh yeah the only columns you want to leave out are so these are all the categorical variables on the right hand side okay you need to include all of these categorical variables on the right hand side only exclude uh CRS departure time the numeric attributes here okay the for numeric attributes need be excluded and everything else need to be included everything else needs to be included with the exclusion of these just these four numerical attributes okay can you go back go [Music] back yeah we on this one now edit metadata yeah we are on edit metadata right yeah yeah so let's make a comment of also here and say uh segregating okay segregating the categorical categorical attributes attributes and that's it so segregating the categorical attributes okay uh could you show the launch selector what yeah there's a launch column selector we only want to in exclude these four numerical attributes everything else move to the right hand side because these are all categorical everything everything else here is all categorical I'm not seeing this after I joined the select column in data set to by you have to click on by name by name I'm not see okay but if you're not seeing it then you should have run this I miss that yeah you have to run it otherwise you won't see that you have to run it as well as you have to join these two so we just put in a comment because this data you see is different from the one that you used earlier so the intention of using this edit met data is in is in a different way this time right is to uh the purpose behind it is to segregate the categorical attributes excuse me one question so yeah in that select column column selector okay the column select there are still some numeric columns like year quarter month day of month where in this are you talking about edit meta data yeah yeah edit met data correct edit metad data edit metad data everything all these are categorical if you know all types if you uh filter by so you have this don't don't filter I'll tell you why if you filter if you use these filters see what happens I'll tell you what okay and here it is good glad you raise the question because uh here say let's go to the results the the data set see here uh any any column any column that I click on will have a on the right hand side will show a feature type something called as a feature type and it will show either it is a numeric if it thinks it's a numeric so if it is numbers see actually here in in the true sense is really a categorical variable right but okay but uh but just because there numbers aure will treat it as oh got it got it okay yeah so that's that's what is look at the meaning yeah okay yeah you look at the the aure doesn't understand the interpretation it only looks at the data type it just u in that sense it will be all numeric in right because even that yes Lo so whatever is numeric is numeric for Azure okay yeah so that's why we cannot filter it that's why when you launch the column selector for edit metadata do not filter it by just numeric and string you get different uh totally different uh compl different set of I think it was an important difference thank you so much for thato yeah yeah very welcome very yeah so uh we only want to exclude these four numeric descriptors on the left hand side and include everything else on the right hand side because these are all categorical attributes one more question after seguing this four attribute on the left hand side the right hand side the number of columns getting selected for me is 15 uh okay uh well then in the check your select columns in data setting uh select columns in data set for you it should have 20 columns here okay let me see yeah pleas take that because then you'll have 16 and if you leave out the four then you'll end up with 16 columns for edit metadata as categorical descriptors that means that we have 16 categorical descriptors okay so for selection there's a I got it there's a 19 ah see yeah so that's why so one you might have left out one on the left hand side your left hand side might be five columns Avail so er got added yeah yeah so all right so now okay now we have now we not done yet okay so we have launched the column selector and we have made the segregation of the numeric versus the categorical and now after that please click on categorical we still need to click on make categorical so we need to explicitly tell Azure or explicitly instruct Azure to convert these to the corresponding categorical encoding okay I mean at the back in the background everything is happening ultimately everything that aard is going to pass to the uh computer is finally everything is going to be numeric right computers don't understand string how to work with string so uh what is going to happen internally is aure going to Azure is going to uh take uh get the list of categorical descriptors and then perform their equivalent binary encodings whether by one one not encoding or whether by dumy encoding or figure out some kind of inputting binary inputting to convert this data into the equivalent binary inputting which are just binary they binary column vectors just binary column vectors okay and uh that's how it is going to encode it okay so we need to so to that so that is that those computations or those conversions of this categorical to the equivalent numeric encoding or binary encoding is all abstracted away from us those are all happening at the back end Azure is taking is going to take care of all of that you have these orchestrators in Azure and all these kinds of things back end things you know that uh can take care of all of these things and uh so that is all going to be orchestrated at the back end in Azure okay it is all abstracted away from us so for categorical all we need to do is just just explicitly uh select make categorical and say okay these are the categorical variables I have identified and that's it so once we done with that can you please click on the launch column selector the Met dat yeah yes sir okay yeah just wanted to thank you sir okay uh yeah so now we are done with this so let's just right click on this and say run selected okay okay uh all goes well it will run without any errors okay uh all right right so uh no now we still remember we still have not done any data imputations so and we have seen that our data set actually contains missing values right so the time for us to clean the data set um so uh we need to introduce what is called This Clean missing data so just type clean missing data on the left hand side here okay and uh you will see under data transformation manipulation you'll see clean missing data let's drag it here okay join edit met data with clean data this is the component in Azure that will help us to impute missing data for both numeric as well as categorical descriptors categorical variables categorical features so uh let's click on clean using data we need to configure this so let's click on this and say launch the column selector uh my name okay uh actually you can also move Des why not but uh yeah uh actually yeah so we want to uh clean missing data this clean missing data is for what not applicable or zero or whatever yeah for any missing unfill unfilled data unfill data for it may for month year okay and we also need to do for all the columns as well introduce everything here just select whatever you have to the right hand side all the columns may have missing values I mean any of those columns any of those 24 columns now 21 + 3 right 24 columns we have missing values so we need to perform data amputations uh for them uh okay so no but uh actually what we want to do here is not that way uh yeah we only want to now perform the data imputation for uh categorical data okay so uh we have I think uh and here [Music] also right so departure and rival are the only two uh numeric here they are the only as I said earlier these are the only two numeric columns we have so going to introduce everything to the right hand side all the these are all the 18 categorical columns here okay and uh let's click on that and uh because these are categorical what we are going to do is say replace with mode here so the clearing mode is going to be replaced with mode sorry Professor I'm not following so are we are we cleaning missing data for we are cleaning no we are cleaning missing data for categorical only here okay so we have left behind departure delay and arrival delay yeah these in the only two numeric descriptors excluding these two everything else becomes categorical so here we are we are only dealing with data amputation for categorical variables only partal delay and arrival delay yeah okay save these two you can include all the rest all of them are categorical features you can see here carrier Etc all are categorical features so move everything to the right hand side and then just uh I mean these are the only two exceptions dep departure delay arrival da everything else gets moved to the right hand side and then uh confirm your selection okay so now because we are imputing categorical variables so for cleaning mode we can perform data imputation using the mode which is very common which is very common place for categoric variables either nominal or ordinal okay sometimes for ordinal categoric varibles we can also perform data imputation using the median because ordinal being there is some ordering in their values you could perform median based imputations also for ordinal but you know mode works for everything so mode works for all types of categorical variables so we can replace with mode and then right click and run selected after that we'll have to inspect the data set to see whether there are missing values or not or whether all missing values have been taken care of right we'll have to confirm that first so let us confirm now okay this is run okay hopefully for you two it is run and now let's go to the clean data set visualize and see if all the data imputations have taken place successfully now click on every single uh categorical variable and see if the missing value say zero now you see it will be saying see earlier it would have set string here feature type is string now it has changed to a categorical feature here that means internally the encoding has already been done by Azure okay now you can click on each of these different categorical descriptors see here the missing value should say zero so because we have a small set of features we can just quickly inspect it visually also but this were say a thousand you know aligned feature then some other method would have to work uh day of the month you know arrival delay still has missing values no that is because they didn't yeah that is because it's a numeric descriptor for which we have not performed the imputation yet okay okay right so arrival delay you see yeah arrival delay is likely to have missing values still because we have not performed any remember we have not performed any data imputations for uh numeric descriptors arrival delay and uh departure delay somewhere here uh departure yeah departure delay also see departure delay also has about 4,000 missing values right because these two being numeric departure delay Ral delay we'll have to perform the data amputation for them separately but all others seem to have no missing values at all okay I think I've gone through all of them and look like to me that there is an IM in okay so our data set has been successfully imputed a categorical descriptors have been successfully imputed now we're going to perform do the same thing by using Sorry by using uh clean missing data once more okay let's I'm going to break this join here and say clean missing data uh so let's type here just type clean missing data that's easy way with this work um here and if you want you can put in a comment here saying that uh you can actually put in a comment here uh by saying uh inut categorical uh infs this could have been done in the previous step itself Miss data that the two item that we have ignored no we you see yeah so what you could have done is see here in the previous clean missing data you could have uh also included those two extra I will depart delay right correct yeah see yeah coming to the point here so you could have actually included these two columns also and that would have been perfectly okay and then uh you could have also replaced with Mod but what I chose to do is I just you know for just for it's for illustration purpose only right but yeah just to show you that this can also be done it can also be done is that you can have a particular type of imputation for category 2 and a different type of imputation can also be done on numeric okay just to clear clarify that aspect I've done this so I introduced so for the first one you can say impute categorical attributes and for the second missing data you can say uh for the second missing data sorry like [Music] this okay uh yeah so the second using data you can say input numeric attributes okay just to tell you what each each clean missing data is doing there's no confusion here when you visit let save your work at this point done quite a bit of work so far okay with a meaningful name something like experiment Pon Time Performance classification uh so for the second uh C missing data is where we are going to impude the numeric attributes so we launch the column selector and we'll select just those two uh we just select those two numeric attributes which are uh departure delay and rival delay these are the only two numeric attributes here and then uh confirm our selection and this time for the cleaning mode we can also replace with median okay median and mod you know both are robust to uh any outliers introduction outl so they don't VAR significantly uh even based on outlier the presence of the outliers only in rare boundary cases they can get subjected to the effect in rare boundary cases okay so uh yeah let's do that and then here uh so you see I have selected only these two numeric attributes here this time which are departure delay and arrival delay here okay and I replaced them imputed them by using the medium so let's right click on that and say run selected and this time we'll need to again examine inspect there are only two numeric attributes okay so we can visually inspect them to see whether there are any missing values present even after the imputation if the imputation is not happen properly so let's right click on that go to the clean data set clean data set visualize so uh here you can see sorry here you can see uh departure delay okay now you see the missing values are zero they have been imputed okay by using their median and also uh one other is called as a rival delay yeah and this also has been successfully ined so okay that's that now so we have taken care of and cleaned our data set uh so at this point our data set is a reasonably good quality uh now we will still need to introduce some feature engineering step uh and one of which is the majorly normalization right so we will introduce what is called as in the search experiment items we'll say because we we we still have numeric data two numeric attributes and remember that normalization can only be it's only applicable for numeric data okay it's only applicable for numeric data so we will go ahead and try to search for normalized data here okay and you trag this you'll find normalized data under data transformation scale and reduce now normalization as I told you earlier is also called as feature scaling is also called as feature scaling because you're you're basically Bringing Down the scale of all of the features down to a common scale you're constraining the entire dynamic range of the data to a common scale constraining it to between certain bound between 0 and one or minus one and one so that the disparities of the scales between those features no longer exist and you know any of these features don't appear to be significantly important to the model in assign when it assigns starts assigning weights to those models more important to those features uh even though the features may not be important just by virtue of the scale these features may appear to be overly important to the model than some other features which may actually lend more importance so the model may be fooled into thinking that these are features are which are substantially more important compared to other smaller features because the scales are smaller but apparently those those features are more contribut maybe more contribut towards the classification or regation or any of these machine learning models okay so that's why you want to perform data normalization right uh and the other point is uh data normalization also helps us in removing the disparities across the columns these different features having different units different ranges different scales of course and that's why it's called as a feature scaling also okay it's called as a future scaling step You're Bringing Down the disparities say one column maybe in thousands with the other column maybe in just few point you know 2 three something like 23 versus some other feature which which is 10,000 so 10,000 versus 23 is a huge disparity we want to bring that down by performing normalization the data is homogeneous so we want to homogenize the entire data set into a a common scale where everything is bounded between a certain range between 0 and one or minus one and one depending on the type of normalization that used and uh so what are the major advantages of normalization a major advantage one major advantage is that it helps to it helps the model to perform significantly better and also arrive at a convergence much faster it helps the model to converge much quicker okay so that's why this is a major engineering uh this is a major feature engineering step so among the different types of normalization you know all of these logistic log normal T these constraint the for example T constraints it between minus one and one okay logistic transforms it uh it constrains a dynamic range data between 0 and one okay similarly for minmax also between 0 and one something slightly different is something called Z scoring Z scoring is basically it's a method of standardization uh where we convert you know regular normal distribution into a equivalent standard normal distribution but uh if you don't have any prior assumptions about the data if you don't know any prior assumptions about the underlying distribution of the data then it is recommended to use something more generic which is minmax which does not assume which does not make any assumptions about the underlying data distribution is the data normally distributed if you know if the data is normally distributed then you can use Z code Z makes that assumption the underlying data set is normally distributed but if that assumption is not valid and if you don't know nothing about the General Distribution of the data set then it is preferable to go with Min okay uh also I think I've gone through this use Z for constant columns when check right this is basically to do what it is to okay uh here you'll see again but you know to perform this normalization you need to launch the column selector here so please launch the column selector and we need to choose which are the numeric columns right which are the numic columns uh departure delay and arrival delay these need these two need to be selected departure delay and arrival delay being the only two numeric descries okay one thing sorry one thing I forgot to show you before this is after the data imputation oh no I'm good I'm good thank you so uh let's launch the column selector and uh let's uh include the departure delay and the arrival delay these two numeric attributes to the right hand side and uh click on the confirm your seduction okay once you confirm your seduction here right click and just say un select this will start the process of normalization it will convert the entire range it will convert feature after feature attribute after attribute will take every feature it normalize it and it'll Brint down all the values to between 0 and one it'll take the next feature it will take all of those values and bring all the values down to between 0 and one and feature by feature it is going to normalize each and every feature such that the entire dynamic range of the data set is between zero and one okay it is a homogeneization so we want to get rid of all these heterogeneities in the data soall the disparities and uh and minimize those henties by homogenizing everything into a common scale that is normalization so once we have done that that if you right click on this you will see in the transform data set let's look at whether or not they be normalized how do we know so here in the transform data set let's go to visualize and you will see you click on those two numeric columns can departure delay here see so now you see the values have been normalized here if you see the range is now new it's a new Range entirely new range of between 0 to one minimum of zero and maximum of one by range I mean the max minus the mean okay range minimum Min and Max so uh that's very small standard deviation 0.018 okay so this is min match normalization again Z scoring normalization if you would have done zcore normaliz ation what zport does is there is no mean and Max there's no fixed mean and Max but what it does is it it converts it to zero mean and brings the standard deviation applies a standard deviation of one standard deviation becomes one mean mean becomes zero in zcore okay but any we have done min max so departure delay attribute you see this and then the also the arrival delay attribute the other numeric feature also construed between 0 and one okay now so the time has come for us to now do what is to split the data into the corresponding two partitions which is the training and the test set so to this end to that end sorry to that end let's introduce the split data component here which is the type split data when the data transformation sample and split you see split data here just drag that to the right hand side join normalized data with split data let's join normalize data with split data okay and uh you'll see here uh so for split data how do we configure this so the fraction of rows in the first output data set so we need to put so we'll put 0.95 here so that 95% of the data set comes out of node number one and if 95% comes out of node number one uh that is mainly our training data set sort of so to speak and then the 01us .95 that 05 or 5% of the data set is going to come out of node number two that is going to become our test data set as you can see here it has gone to the score modeling okay so uh let's put 0.95 here and putting some random seed okay 1 2 3 3 4 5 4 5 6 uh seven something some random seed okay and and for the stratified let's make it a stratified split so what does stratified split do you see stratified split is either true or false right so distributes the output value in the commonly across training and asset uh yes almost yes so what it what what is stratification it's called as stratified sampling in stratified something what we do is if the original data set is imbalanced as it is here if the original is highly imbalanced C because we have a large number of good cases versus a very few small number of bad cases 0 versus one in our Target variable and that is the case that then we have what is called an imbalanced data set you have an imbalance data set let's say uh you have 80% of your data is class zero and 20% of your data is class belongs to class one then this 8020 proportion has to also we want that this 20 proportion also be maintained across the training and the test sets so that it's a fair assessment of the model otherwise it to get going to get skewed towards the majority class or maybe the minority class majority class mainly because here it is at20 we don't want that those things to happen so you want to maintain the same proportion of uh samples as is there in the original data set and replicate the same proportion across the training and the test sets so that is stratification that's why we call it stratified sample so it'll ask us to choose a name and usually for classification problems you can select the uh stratified column as the label column so you can launch the column selector and select ER D15 ER D15 is our Target variable as I said earlier it is our class column of Target variable and uh it's always good to select a stratification column such that that that the values in that column segregates the entire data set into two distinct categories okay so usually for classification problems you can select the target variable because Target variable being categoric variable effectively segregates the entire data set into different categories so you can uh select AR D15 confirm your selection okay and that's it we ready let's right click and say run selected for split data the target variable is arrival time and departure oh sorry arrival dat and depart uh Target variable is arrival D 15 that that categorical column D5 do you have a doubt with that no h sorry yeah initially in somewhere we have selected arrival no no but you you want to know the target variable right no I mean arrival delay and departure delay that we have selected in one of those those are those are numeric attributes which are normalized here okay you see here departure delay and arrival delay newes which have normalized okay we normalized that right we normalized okay now this is for the stratified split and selection of the stratification key coln is where we are selecting ER D5 which is a categorical column which is only zero and one binary categorical column Z okay anyway yeah yeah that's right so we have to figure out okay how many are delayed by 15 minutes and how many are not yes okay so that against that column we are doing performing the stratification okay so uh again but what we do here is uh since you know uh we will use as I said earlier we are going to use the decision tree classifier for classification a decision tree Ensemble model since you familiar with are you familiar with decision trees have you had a theory class on that yes right okay so that's very good so and that's why we want to introduce we want to perform what you have learned is uh we want to put that in action here and we want to U create a decision Tre classification model but however we would also like to hypertune its parameters so we want to perform parameter hyper tuning and because of which we cannot just leave split data here we'll have to take this training data set out of the split data and split it once more to create one two and then the third okay this third that you see here that I'm pointing to is a test data set and a second split data will enable us to First achieve again a different split so altogether we'll have three partitions all together we'll have three partitions this first partition will be our training data set the second will be what we call as the we what we call as the validation data set okay we will eventually be partitioning our data set into three different partitions the training data set the validation data set and the test data set training the training data set will come out of node number one the validation data set will come out of node number two and the test data set will be left at node number this node number coming out of this node here okay so we are utilizing this training data set out of the split data and botting it into again two more parts the training data set and the validation data set okay so to do that let's introduce another split data component here and if you click on that you say we put 095 is a fraction of rows here okay so uh what is 95 of 95 see this uh the total data set coming out of note number one is 0.95 95% of and again what is coming out of here is .95 what is coming out of note number one is 95% sorry 95% of the 95% of the entire data so roughly you can say it is 0.9 of 0.9 which is uh 81 right which is 81 that means roughly 81% so now 81% is going to become our training data set and then this 5% of the 5% which is 2% is uh is it's going to be our validation data set and the remaining is going to be this 5% of the original is going to be the test data set okay so 80 some percentage 20 some percentage is going to be our validation data set and the remaining 5% is going to be we have test data set this why do we need this validation data set here in order to hyper in order to perform hyper parameter tuning in order to perform hyper parameter tuning of our model we need this validation data setting so we bated essentially the split data here also please make sure that you do randomized split put in some random seed stratified split is true and use the same Target variable here AR15 as your stratification key column and then eventually let's run this they run selected and then we are ready in the process we are now absolutely ready with three different data sets the training data set the optional the not optional yeah you can say optional the optional validation data set and the test data set the validation data set is as I said earlier is to use that we only use that to train the we only use that to perform hyper parameter tuning of the model nothing else okay no Professor uh yeah so here we have used error D 15 right yeah in data again yes first statification key column yeah so there was initially also departure Del 15 as well in our original data set no no the specific use case we are solving here is trying to uh predict flights which are likely to incur a delay in their arrival times okay only in arrival time yes so if we have to uh extrapolate this for departure delay 15 as well yeah then the use case changes and then the departure Del 15 will then become your target if we both have to I mean if we have to analyze on both the arrival and deposit delay by 15 minutes how the model would look like like in that case uh then you have to use different I if you have to use two target variables ariable departure delay 15 right both because both the data set are available for us so I mean in the single model how can we uh you can't do that here okay but why if we can select those two variable as a Target variable will that not work uh you can encode that into a third variable okay and use that as a combination but here it is not possible okay you cannot have two target variables ultimately is predicting one target has to compare with one target variable it has to predict the outcome against one single Target variable okay so for I mean our assignment there are two target variable for example like uh the customer CH and uh no no no no no custom your customer CH customer CH assignment you're talking about yeah yeah customer churn is only one target variable what is a target variable in customer CH total charges total charges as well for the telecommunication company you have to do it two different times two different times that's what I'm yeah yeah yeah if you want to you can do it but the only thing is that there is a churn this thing categorical column that says yes or or no whether there is customer churn or no that should be your target VAR but like I said if you want to solve you have to solve separately okay Sol so there will be two models that we have to create only the only data will be like initial feature enging will be same but then you have to split it between the two target variable that split data will be bated for two target variable uh no you'll have to build a completely entire new Pipeline with the with the other Target variable got it okay yeah okay yeah so now here uh yeah so we were at yeah so I was describing you know about discussing about valid importance of the validation data set in the hyper parameter tuning right so we don't and we use validation data set because we don't want to introduce directly the test data set when performing hyper parameter tuning we directly don't want to introduce the test data set because that will already be available to the model then the model there's nothing for the model to generalize on that's why we keep the test data set separately and we and all this would have been explained in you and so that that's why we keep a separate validation data set which we only show to the the model during hyper parameter 2 the exact uh test data set is is sort of kept insulated from the model till the very end until the model is actually scored or until the model is actually tested okay question Professor this kind of tuning model hyper parameter is it required in all cases or it's only specific here any no any model any prediction it is always best to okay perform okay yes it's always recommended because the reason being you can select a certain you can select a certain say collection of parameters s parameters right some values but you don't know really are those values really getting optimized are they are they the optimal values right we don't know that so how do you select that so the how do you select yeah how do you how do you eventually get optimal par hyper parameters yeah so the validation set here helps to confirm whether it is done properly so you can actually do a metric for validation set as well yeah the validation set essentially we we run how do we now how do we say for hyper parameter tuning we have to uh train the model and test it test it exactly correct we have to train and test train and test take one configuration train and test take another configuration train and test take a third configuration train and test and then see the results and compare them and see which one is the best which one produces the lowest average log loss or the lowest training log loss okay the lowest value of the loss function of the error function and then that becomes our most optimal or the best train model in Azure they said we call it the best train model I wouldn't call it so I'd say it's an optimal model that we should use because the optimal model but the point I'm trying to actually arrive at here is at the the point where we are actually using the training data the test data set to test the model is not the actual test data set that you see here in the first plate data it is that test data is actually we simulating it by means of creating a separate validation data set that is why we call it as the validation data correct okay we train the model using yeah we trained the model using training but we validate it using this validation data so that so that the purpose of doing that is so that the actual test data set is kept in isolated from the model the model knows nothing about this never seen this because we want to really asset check the performance of the model later on so we don't want to show this to the model yet that's the entire avoid bias to avoid bias yeah the entire intention this yeah thank you right I mean the model is already seen it how theel already how how how will it generalize right it can truly generalize only if you if it is if you're testing it on samples it has never seen before so uh again now so now sorry I have a question before you proceed uh is this uh validation set supports K4 validation um this is uh this is just a validation data set okay it's not kold kold cross validation is something different uh each time you train and test so here you see we are giving it a different training data set as well as a different validation data set if you're going to train using this data set and validate on this as a test data set but in kfold what you do is you take a data set and you what you do you divide it into different folds so uh if the if the value of K is is say three then each time oneir each time only one3 of your data set will be the test data set the remaining two thirds will be the training data set and that will again there'll be three iterations now because K is three and then the next iteration the the original test data set will become a different partition a different one3 partition will now become your test data set and the remaining two3 will become your train and then third iteration uh the the last one3 partition will now become the test and the remaining 2/3 partition will now become the training data set so that is called as cross validation so you're validating each and every part of the segment of the data as as training as well as test in training and test that is our validation sure thank you validation okay to Val actually I'll tell you sorry sor can I uh okay so can I can you ask me once more please no no no no problem can you ask me once more please yeah no problem my question is does Azure ml support K4 validation then uh okay I I haven't seen it support K4 cross validation okay I haven't come across an instance where it does but uh I'll uh let me let me email you on that is it okay so far I haven't come across a situation where so I haven't come across theu okay there is this one instance where it does where it says cross validate model here okay now but cross validation is something that is recommended only when your data set is very small if you have a very small data set then you see if the data set is already very small then it is not advisible to again bate in training and test because already very small but then you can cross validate it then you can train your model by CR performing cross validation and that is the way you train and test your model okay uh okay any more questions at this point so uh now here we can introduce we introduce the what is a what is called as a decision tree here okay so say decision three and we say two we'll introduce what is called as a two class boostered decision tree here okay please drag the two class boosted decision tree here okay and we'll be using we'll be applying performing the classification based on this two class boost Deion Tre okay uh so now what is as you know you know actually a little bit a little brief history about the decision tree is that decision tree was originally concocted by famous scientist called spinland and there are some variations of the decision tree algorithm like cart CRT c4.5 C 4.8 Etc and decision tree can be used for both classification as well as regression okay you can use it for both classification as well as regression and the way decision tree works is decision trees uh they tend to usually uh uh first calculate what is called as a um attribute selection measure okay uh now there are basically two to three different types of attribute selection measures you know some examples are like the information gain or uh uh gain ratio okay and then you have something called gen index okay so these are typically what are called as attribute selection measures by means of which Sorry by means of which the decision tree can compute which is going to be the root node or which is going to be the starting okay the starting node is going to be that one which is going to have maximum entropy which is going to have maximum information in it and the more the information uh and more the number of classes together the more the information more the entropy so you start in a decision ke you start with maximum entropy and you keep going until uh you until you reach the leap node where entropy becomes zero okay so this is just a primer on how decision trees work and every point you calculate the best attribute the optimal attribute which can bifurcate the maximum number of samples or which has the maximum capacity to to to stratify the samples at every Point okay at every every Point that's why you want you want to select the most optimal attribute and to do that to find out which is that most optimal feature or most optimal will attribute among all the other available attributes you will compute that you will compute something called as the attribute selection measure something called as Information Gain or gain ratio and the feature that has the maximum value of the gain ratio will be selected each time from all the other attributes from among all the other attributes okay in theory that is the way that decision trees work okay now here uh you see there's something called as maximum number of leaves per tree uh here if you click on the two class boosted decision tree you see something called as a maximum number of leaves per tree so basically what this is is this is the hyper parameter of the decision tree uh that says the maximum number of leaves allowed in each tree so for example if you set it to 10 each tree can have at most 10 leaves okay now uh if you increase this value okay it can allow uh you know more complex trees okay and uh it can improve the accuracy potentially but uh you know there is a risk of overfitting and then then it can also increase the training times as well okay and then you have the maximum number of samples per Leaf okay here uh maximum number of samples per Leaf node so what is the minimum number of sorry minimum number of samples per Leaf node Leaf node means the the last no is part as the leap no okay which are the tertiary nodes the decision so here the minimum number of leap nodes is the minimum number of uh samples that are required to create a Le so for example if it is set to five for example okay so then you're saying that a leaf not must have at least five samp okay now you can set it to even higher numbers and uh you know uh higher numbers can promote generalization but they may miss certain patterns so so that is the way we have to select them be judicious in selecting okay now then again you see uh learning rate right so learning rate is where uh again you we talk about uh when we talk about uh when we talk about the optimization right uh where where actually optimization here means the minimization of the loss function uh for which the learning rate becomes very important so if you keep this you know at a reasonable value and a smaller value then the algorithm may take more time to converge but it will uh in all possibility it will likely touch the global minimum otherwise if you set it to a very high value it may take you know at least it may take minimal time but there is a possibility that it overshoot the global minimum okay so we have to select the learning rate judiciously as well usually typical value is 0.01 which you start with okay uh works well and then this other hyper parameter called the number of trees constructed okay is uh uh this refers to the total number of trees that we are going to utilize in this boosted decision tree okay so here basically the trees are added sequentially and every tree that you add every decision tree that you add is by boosting is going to work on removing the errors of the previous tree of the prior trees okay so um I think all of this has been discussed but then again I'm just just clarifying it once more is about the number of trees constructed so uh the total number of trees is the total number of trees in the entire ensemble so you know you can have uh as many trees as you want uh increasing the you can actually improve the performance by increasing the number of trees but it may also you know in the process imp I mean increase your uh time for training as well as it may increase the computational cost okay so those are things to consider okay and finally you can put in a random number seat so that any randomizations that happen are all reproducible so uh we can put in the parameters here you can put the values here just configure it and just right click and say run selected here then go to tune model hyper parameters now here is where we want to introduce something called as tune model hyper parameters so let's type Pune model hyper prameters can you please go back on uh yeah all so then you want to introduce this other new component called a stune model just type hyper that's enough you type that and this will appear please drag this node onto the workspace area says tune model hper parameters here okay this one and uh this is uh by means of which we can perform the hyper parameter tuning so it it specifies three different parameter sweeping unit what do these mean so here is what it means so when you say uh you know the entire grid it refers to the entire grid of the hyper parameter space or or in other words it's referring to the entire hyper parameter space where it does an exhaustive search of all the of all the values of all that the hyper parameter can take so it basically does a very you know intense intensive computation takes place and by means of intensive computation takes every other value of the hyper parameter and exhaustively searches the entire hyper parameter space so therefore it may increase the computational complexity right so therefore what it does is it increases the computational complexity and also it may increase the training time okay so uh that's about entire grid now what on the other hand there is something called as random speed where u a random grid here what random grid does is it does not search the entire hyper parameter space exhaustively right so it's like uh you know if you know uh you know let's say you are in your house so let me give an example of understanding this easily let me uh say that you're in your house and you're trying to search for a fountain pen you're trying to search for Fountain P that you kept somewhere in your house now you know what are the likely places now it's only you who knows what are the likely places in the house where you may have left that pen so what you do is you go to those only only those locations inside your house to search for that pen so because you know that those are the only likely places where you may have you know put the pen there and you go to these exact location you try searching for it somewhere the other on on one of these locations you find it so what this saves you is this saves you from searching for that pen in that entire house you don't have to go to every single room or every single drawer and keep and look up every single cabinet and drawer to search for your pen that is what happens in entire grid What happens in random grid is you randomly select those positions that you know are likely to give you a result and you only search those so that what what it does is it cons reduces the computation comp comp lexity okay the time complexity of the algorithm and it also makes it much faster okay uh but the chances of finding the most optimal uh set of hyper parameters is more in the entire grid because you're doing a very exhaustive search compared to random grid okay and then there is something called as random sweep where you take uh you search groups instead of searching at specific locations you search groups of locations and you you take some uh you know you take a bunch of range range of instead of taking exact values for the hyper parameters to search you take a range of values of those hyper parameters and then you search those those locations that is what happens in random sweep you call it a random sweep okay so um I just leave it to I don't know what I had left it to let me see uh yeah I done you know I basically selected entire grid you can also select entire grid I want take too long okay uh you can also try random grid if you want it's up to you I selected entire grid here uh there's not a there's not too many hyper parameters to search so it it's fairly quick uh and then here also uh we need to click on the label we need to uh specify what is going to be the label column here so launch the column selector and here is where we want to include the AR Delta 15 okay a l15 is our Target variable so you want to specify that as a label column here and then comes where uh we are asked to input the metric for measuring performance for classification here this one metric for measuring classification performance so let's put that to F score because I could have also chosen accuracy but fcod is better because we dealing with an remember we are dealing with an imbalanced data set if you're dealing with an IM data set f f score is a more balanced approach okay F score gives us a more realistic uh view of the model performance compared to accuracy accuracy may be accuracy may be skewed right because if the the majority of the correct classifications are from the majority class then that is cu the classification okay that's why so F score then we can use f score and and this it doesn't matter because you're not solving regression so whatever is here doesn't matter it's not going to get uh evaluated but this is important metric for measuring performance for classification so let's do that and now uh and that's it so let's uh right click on it and uh let it train okay all right at this point uh let's take a 10 minute break okay shall we take a 10 minute break so we almost at the time 9.m yeah we are almost what's maybe 5 minutes sir then we'll come back okay five minutes you want okay okay so is it you want five minutes then because we almost at the end of the session sir that's why I mean yeah overshoot by another 10 minutes at the most yeah yeah sure it's okay so 10 minutes is good right okay okay then let's take a you know 10 minute window break and uh and we we'll reconvene here till the time please uh run run tune model hyper parameters and then you can go because it's going to take a while the reason I'm saying is because going to take a while for the model to perform its hyper parameter tuning and maybe a little while to do the train model as okay not much time though uh yeah if you're back before that you can also run uh train model but but let's stop at tune model hyper parameters and take a 10 minute break and then we'll come back after 10 minutes so around 9 uh 8 I believe 98 p.m. I okay thank you thank you e e e e e e e e e e e e e e e e e e e e e e e e e e oh okay uh yes sorry so let's resume from where we left up so sorry so as you can see my T model hyp parameters I started with the entire grid but I changed it to random gr okay and I limited the maximum number of runs on my random grade to only five so it'll done with four five different configurations and then try to see which one produces the lowest uh uh loss okay and then uh select that as the most optimal model so it's still taking a while for me for you if it is already completed taking a lot of time I think 15 minutes it's not working it's not completing even 15 minutes yeah so if if you're running entire yeah if you're running entire GD please uh stop that okay stop that and select random grid how do you stop it Professor just click the stop button at the bottom of the screen here there stop okay stop it and just specify the parameters you want to random R uh meanwhile Prof can you click on the train model so we can look at the so meanwhile yeah meanwhile if it is already done for you can set up the rest of the model here so for the train model you only need give it uh by using the launch column selector the name of the target variable which is AR D15 the column The Strain model component only expects the label column for you to enter the label column here please do that okay and uh out of the second node here is you will get the most optimal model here so select that the most optimal model here now to the feed it to the train model component and uh the training it are still going to come from here so select that and use that to train the model of to sorry Professor how to stop this so stop there is a stop button you can see there the stop button at the bottom sorry okay yeah yeah so that will stop that will stop the execution and then you can change it you can change the uh the parameter sweeping mode to random grid uh that should take less time comp to the entire okay after then uh after you are done with the train model you need to now select score model here okay so from the left hand side sorry Professor I'm little lagging uh I'm still at tune model hyper parameter stopped it and yeah yeah yeah so you know so yeah just execute this uh you know I'm just saying it for prity of those already okay so train model is okay the next item is model yeah the next item and then the subsequent item is the score model where in fact here is where we test the already train model coming out of the train model here we test the already train so any any parameter tuning in train model no we there's already train no this all this is the best train model the best set of hyper parameters is coming in from here using the best set of hyper parameters the model will get trained on the training data set here okay and then that train model will now be input to the score model component which will now apply that train model on the test data set here okay to see how well the model has been trained but the model can really generalize well to these unseen to these unseen test samp coming out from here so one question how do I just uh pressed some other uh tab like data set so how do I go back to the experiment uh so if if you this is already still running yeah it was running actually but I went to a different pain so I'm not able to see this um the you can you no you can always uh you can always go to the this uh click on experiments here yeah okay okay and you will see the experiment that you're working on if you saved it okay Okay click on that it take you back okay yes yeah thank you I'm back there Professor could you click on the train model once again please yeah thank you so uh yeah then because the score model we test the model Professor I I missed that part of like how the left P left panel from where we drag and drop this so so uh so what you need to do is click on experiment you see this experiments tab right yeah so you want to click on the experiments tab here and then uh here you will see the experiment that you're working on correct yes just click on that click on that it'll take you back yeah I'm on my model but then how to I mean I'm not able to see that left panel where we are searching for this but yeah it is running maybe it's not I think it's still executing maybe yeah that's correct if it is still executing you because that's what I'm not able to drag and drop the train model here yeah you have to stop it you have to stop it and then the panel will appear and then you can write make the remaining con tricky part Professor can we look at the parameters for train score and evaluate there are no parameters for score and evaluate these are just there are no parameters for these two but only for the train model you will need to input the label column which is the 15 the target all right okay last time I ran it it didn't take this long maybe because of my internet speed uh it's not up to the Mark today earlier it had completed quite quickly e e e okay I think so what I'll do is is here uh last time I checked I had a score of 100% okay so the accuracy was 100% And the F1 score was also 100% uh and the Precision and recall was also 100% the last time I checked so uh I will send a screenshot of my result okay once this has concluded I'll I'll send I'll send all of you the screenshot I'll ask to share it with you and meantime you can also keep running your model okay but uh let's stop here because I don't know how long this is going to take uh sure Professor yeah uh but uh yeah the so but when I uh sent you the screenshot of the results okay at the time if you have any questions please don't hesitate to email me I'm uh sending you my email in the chat window here uh please don't hesitate to email me on my ggu email ID so this is my GPU email ID okay so if you can add audio to that will be very good just to like you explained here just five minutes telling yeah what is the score model model tells that will be very easy for us okay okay okay sure sure I I'll make a small video and I'll I'll ask to share that video file with you as yeah that would be helpful thank you so much thank you profor yes yes you're very welcome and I'll definitely share that small video with you sure okay but should you have any questions related to the the results that you see okay and then please email me on this email on my ggu email I've shared it on the chat window sir one quick question uh yeah this outlier it automatically does this whenever we do a data cleaning or is there any specific uh uh step here to where where exactly remove outlier data Cas of in general it's a general question not necessarily for this okay so yeah uh outlier analysis and then you have to see you know outlier basically you can do something like a box block to see which is which is a data that is lying plusus 1.5 you know in the inter qule range right so uh it is if it is beyond that technically that is an outline so uh then uh you can perform outl if those are outliers you can remove those outliers also oh okay thank you that's so about the assignment uh for for this assignment number two and three just some turn in total charges so is it expected to be bring the two models for both this output and the target variable uh okay sorry sorry I actually lost your audio just for a second can you please repeat that for the assignment where customer churn and total charges for a telecommunication company is expected the scenario right so is it expected like for all the four questions we have to give input for both variables output variables uh so actually I didn't understand the question okay so so so there are two scenarios predicting custom charges parag you may have to direct that question to Dr looks like no no please let me know I'll tell okay okay yeah so the scenarios predicting customer CH and total charges in telecommunication company okay uh and the data set is given which is which talks about the entire data set that talks about the Telecom company yes yes yes so uh uh so so there are uh few questions like four questions around that as well okay so is it expected for each of the questions we have to give the details for both both of these Target variables no no it will be specifically mentioned on which what it is I think one or two screenshots the others you have to they written so only for those which are written you need to specifically explain whatever is asked uh let me uh is that is that the assignment one uh it's assignment uh part two and three uh it's assignment two yeah no assignment one question two and three okay assignment one oh yeah okay just a moment I'll pull it up just a moment I'll I'll share the screen I'll pull it up and I'll ask you what question it is that you have one quick minute uh so that is um yeah I found it okay let me share it now okay do you see that now yeah so yeah yeah this is assignment right yes so now what is your question specific related to which part so you need to answer these questions you know then it's given what is expected like for example for this provide screenshots and the summary of your findings so so there are two target variable here looks like so customer turn and price price no where which specific build and train yeah here here is what it is asking right yes right so basically build and train both a logistic regation model means this is for classification regression model is for classification to predict Char yes or no okay this turn uh this thing your data set is here see uh pull up the data set uh very quick sh [Music] this yeah see this is your data set right and where it says turn this is basically your the target Lael the Target variable Target variable correct and then there is also total charges as well yeah for classification this is the target variable but when it comes to regression this is your target variable total charges correct okay this is a target variable for regression this is the target variable for classification that's right so in regression you are trying to predict the total charges as a continuously valued variable and for churn you're are basically predicting a totally discrete variable which has two discret values yes and no that can be you know it's a it's a binary descriptor binary binary classific binary classification but for when it comes to this is so this is your target variable for logistic regression this is the target variable for linear regression and these are all your features depending on whatever is important whatever is and so uh let's back yeah so and it says uh linear regression model to predict total charges yeah so detail the steps taken for data preparation which means that you know all of those cleaning missing data Etc all of all things okay separating the segregating the categorical from the numeric and then if there is any such uh data imputations normalization dat the normalization okay yeah all about feature engineering like all feature engineering steps including categoric VAR and then evaluate the model using appropriate metrics that is for classification at the moment you say evaluate model here uh here the evaluate model for classification it is going to give you the classification metrics which are accuracy precision recall Au Au okay and then um those are the uh uh F1 score those are the classification metrics if you do the same if you use the same evaluate model when with your linear regression model here so that that component is here linear regression this is the one this is the one you'll be using here and then hyper parameter tuning right you'll do the hyper parameter tuning and then this so at the time the evaluate model will give you a set of regression metrics which are totally based on errors okay relative absolute error mean squared error RMS relative absolute error relative squared error and then finally what is called as the uh coefficient of determination r s okay which is a measure of that is a measure the R square is a measure of what it's a measure of the goodness of fit of the model how closely the model fits the given data and how much of the uh there's a concept independent independent variable so here integration how much of the variation uh in the dependent variable or how much of the variance or variability in the dependent variable is well explained by the independent variables is what is being shown by the coefficient of determination or R squ value in other words what percentage of variance uh what is the percentage of the variance that is explained in the dependent variable by those independent variables and whatever is not explained is is means that whatever is not explained is not is not present in those independent variables it's not it is coming from somewhere else that that variation in the dependent variable is not being explained by those independent variables so coefficient determination is 085 it means 85% of the variance in the uh in the dependent variable is being actually explained by the independent variable the remaining 50% which is cannot be explained by the changes in the independent variables it's it's unknown what is causing that variation it is unknown that what is causing that variation in the dependent variable that 15% is unknown but certainly it is not coming from it is not being explained by the independent variables but from something else we don't know so that is what R square means coefficient of determination same same thing as R squ okay my total tune model hyper parameters successfully completed anyway so that I that is what it means but uh any other question you would like to ask about uh number six yeah so here given the data set with customer information perform an exploratory data analysis yeah so what is the best way you could do this the best way you could do this is by using power you had a youve had sessions with me in power haven't you yes we have right so just uh yeah take some cues from there and uh it'll be very simple it'll be very effortless and very simple to do this using power V just import the data set into the power VI right for example uh you want me to show you how yeah some okay let me give you some like just show you some especially you know yeah one one or two especially how to see you know the outliers the part that would be answering the question but I I'll show you walk you through one or two steps of how to just open powerbi and import the data okay yeah let me open power open just open power hi professor one question while you open powerbi during the regression do we need to include CH as our independent variable for regression so are you asking for regression uh yes correct for n regression yeah for regression yeah for n regression do we need to do we need to include the Chun as one of our independent variables or we can I screw it up uh so the churn column you mean the churn column this one yeah the question is do we need to in regression do we need to include as one the independent Yeah by which mean the CH column here in the data set right yeah by which you mean the churn column here in the data set yes that's correct we need to include it uh for our need regression problem uh it's quite up to you you know but uh you could uh it could serve as more information for you so you could as as descri descrip as a categorical Des okay thank you the professor slightly digressing you know this this thing got completed the decision tree and I saw the result as you know 100% like you mentioned yeah right yeah so that means that which means that the model is generalizing very well to the test data and that means that the model is likely to perform very well on if you give unseen new data if you deploy this model on the web and then if you give completely new data it is likely to perform extremely well on the data as well so now one question Professor when we say we give a new data so where do we where do we insert that data no that you deploy right you to set up the web service here and then deploy the and then once you set it up as a web service there will be a web page okay so first you'll have to set up the web service here you see the we set up web service button at the down below yeah yeah yeah you'll have to run this again once if you run it again it will the set up web service will be active then you set up the web service and then you deploy the model it'll deploy it on a different website it will take you there and there it will show all the inputs it will show basically boxes where you can enter text Data against all these different features values okay but now if I have to upload a file just like we did now that also yeah that also you can do you can either enter single samples type simple samples in or you can put in a batch of samples as an Excel file there through that website it will give an output yeah it will give out because now a model is fully trained yes okay so it can work either on single single singular data single sample data or it can work with batch data also okay but you need to deploy it on the web yes you need to set up the web service and deploy it okay sir are you showing us the powerbi yeah sorry so I yeah I'm so in particular I would like to see how you do data transformation because I think we missed during the class oh so much will not be possible because little no little bit on the this thing with our assignment because that's the first part of our yeah so here question six okay here so launched powerbi okay uh then you can open a blank report just open a blank report and you get this interface with you okay you will you will land on this interface what you'll do is you say get data here so then you say get data and I think yours Telecom customer CH is an Excel file the way no it's a CSV file okay so you see you click on Tex CSV here okay and then import that file here so I I'll go to the exact location where I kept that file here I have it it's Telco customer CH open okay once you open this data it will automatically use the connectors to import the data set onto the the right hand corner where it says where it says sorry where yeah it's going to give you this dialog box and ask you to load it okay it will ask you to uh load the data set what you do is uh you can once you have uh you can either use the get data and uh load the CSV directly here and open it and once you are ready to load it before that you can use a transform data if you use transform data okay here transform data click on transform data then you can directly perform some some cleaning here but that be easier to do in the Azure itself okay that kind of data imputation will be easier far more easy to do in Azure so here you can choose columns you can remove a column if you don't want a column here things like that can be done using what is called as a power query editor but that'll be a little bit more complicated for you so easier thing is to perform all the data imputations in aure first okay uh no actually I had a doubt like just for visualization do we need to do transformation no no no so what you're going to do is in your Azure here you will perform you will have the Telecom data set here and then you perform some feature engineering and finally when you have the uh when you just before normalization after you cleaned the missing data then download this okay you can download this data set okay uh this is yeah this you see here no sorry this clean data set you can uh save as a data set sorry not screen you not seeing your screen we are seeing the powerbi [Music] screen okay now do you see my aure screen yes we can okay see so now you're going to basically uh do some data Transformations using your Telecom customer Chan and after you clean the missing data you can download this data you can say uh download this data set what Happ data [Music] set yeah so you can convert this to CSV here okay and download this data set and then upload that clean data set here okay so let's assume that here I'm uploading the clean data set and this is my clean data set already okay so I can then just click on load here I don't need to transform it it's already transformed I can just directly load this and then it is going to load on the right hand side then I can start creating visualizations with so this is still loading it is still loading the data set using the connectors in power play on the right hand side okay see now you see on the right hand side data pan my the data has loaded you see the data is now loaded now what I can do is uh create some kind of a visual by clicking on this template here to do my Eda I get to make sense of the data I can create this Visual and I can say okay uh let me look at [Music] uh okay let me look at uh something like monthly charges on the Y AIS let's say and my XA [Music] something like uh gender maybe okay and uh you can so you can see you can create visualization in this okay and maybe I'll do something like ch as my will be here okay it say churn or no churn and I can create visualizations like this to do my Eda okay so I can I can change this I can change the sort of visualizations I want okay and experiment with all kinds of visualizations I have here okay and then and this is the way you can create your visualization here directly and this is easy to do Eda okay that Azure does not support so well okay so somebody was also asking about outlier you know like how we identify that in this particular out right outlier is a little more complicated uh but so that you know you can you have to you'll have to do a box PL yeah box only box plot can you can so if you want box plot you can you can get more visuals go to get more visuals here okay uhhuh okay get more visuals will uh here you can import box flot from here by going to get more visuals Professor I use nine to solve this problem okay fine n also has you know good good support for visualization if you are familiar with n you can also use that I'm more comfortable with n yeah yeah that's fine also okay so here you can use Bo called as box plot you can use any tool you want I'm just showing you you know this is so something like s SW box plot okay something like this you can import this and uh use that so that means we need to import certain only for the Box okay yeah if you want you can import particular types of visualizations if you like you know from here going to get more majority of them are available here can we use any tool of our choice like Tableau as well to do the visualization I mean if it is already not mentioned is it mentioned anywhere we use a certain tool I think it's mentioned mention to use anything yeah I didn't read I'm not sure if was mentioned right no I mean no no I mean if it is not specifically mentioned and uh Dr is okay with with you using any tool of your choice then you can use any tool of your choice no it's specifically Professor mentioned that we can choose any tool of choice other than what were explained in the session okay right then you can definitely use any tool of your choice but we can use this tools power Azure M and N power you can use Azure 9 okay you can use Rapid minor you can use table can use tools other than power any tool that you're comfortable with but that gives me the results okay whether the visualization comes from Power VI or where what if it comes from it hardly matters whether it comes from Rapid minor or whether it is coming from Tableau but uh if it giving you the same visualization for all the three then that is okay Power VI is a fairly easy tool to use you can you can also use this very versatile and very extensive also with a large huge number of visualizations including if you use something like get for visuals also so you can experiment and see only one challenge in as your professor we found that P values we're not able to obtain but P Valu is easy to use if you use if you can use R Studio okay if you can use R Studio there is a statistical software called R studio if you can use the r programming language it's easy to use find the P values that's any any such kind of hypothesis testing uh you can use uh to find the P values you can use R st it's primarily used intended for statistical as a statistical software so it's free it's completely free you can install it our studio maybe take a day or two to learn it okay just you don't need not learn everything just related to whatever is related to the assignment just to get the result that's it search on the internet will give you the commands to use and you can get the results from there assignment is due uh end of uh next week or it's in the middle of next week I think just three days more yeah uh here I don't see a date it's on the 10th Professor 10th okay on the 10th which is initially it was seventh and it's extended by three days okay okay that's a lot profile is it possible for us to extend it because uh a lot of us we are very new pleas please please yes please request Dr s if you like so he will extend it I canot it's indeed complex assignment this time yeah actually this is yeah please please request Dr s okay and if he allows it certainly I mean you have no issues no problems yeah especially for the people who don't have you know data science experience you have you may need more time I understand right so you need more time yeah especially since all of you are working professionals so yeah Professor just one more request how do you change the color of this uh you know like today I was trying very hard to change the I couldn't get it actually the color of how you uh each one you know like this uh what do you call it like X each column or uh like you want to see them properly like with different colors what you have plotted just now yeah huh so because they are all you have to you have to go to this uh you can go to this formater visual okay and go to Legend here oh Legend okay uh not Legend sorry sorry sorry you have to go to let see here was that option I couldn't find for colors actually color selection no manually you can select four columns in the column section yeah here right yeah you have color here so okay change the color from here border so it it depends on visual type if you put up a category column in in the x-axis then you will find out column names if you have multiple columns right yeah you can basically you go to the format your Visual and then for any of these types of visuals uh you always need to go to the for any of the visual okay P charts Etc whatever any of the visuals you need to go to uh this uh the format visual okay and then U license yeah see sense and then it will ask you to s name I think here I put counter [Music] [Music] G yeah something like this very simple right go here slices and change the colors yellow [Music] so every visual you have to come to format your Visual and then check the settings okay okay sure thank you even for scatter plot the scatter plot also sometimes Ed for right so if you say uh something like uh monthly charges in the x-axis and toal charges on the sorry toal charges on the Y AIS okay not some so don't summarize don't summarize something like this right if you have a scatter plot like that then you can go to again format your visuals and if I want to change the these are called markers and if I want to change my markers I can go to the markers and I change the color uh marker Style so I can say something like triangle you make them triangular and I go to color this time I can say use some other [Music] color okay sure then thank you okay I think so that that up thank you so much okay welcome okay thank you everyone so shall we stop here any more questions thank you Prof thank you Prof sir yeah so please if you want need more time please make that request to Dr s or you can request SW men if you you know him right from up and you'll pass on that however it is generous and translated for you thank you okay thank you thanks for the extra time byebye yeah thank you very much appreciated your time is very much appreciated everyone okay thank you thank you sir thank you profor bye thank you very much and yeah please you put in the feedback before you leave it will you know be of great help and thank you and I we're going to see each other again in one of the future sessions sure thanks yeah good night thank you so much have a good thanks good night everyone good night e

Transcript for:🌲 使用Azure实现决策树分类算法

Transcript for:
🌲 使用Azure实现决策树分类算法