Building a Machine Learning Model with Weka

do you want to build a machine learning model without having to code if you answered yes then this video is for you because today i'm going to be showing you a very simple machine learning software which will allow you to build a machine learning model using a simple point and click and so without further ado we're starting right now so the software that we're going to be using today is called weka w-e-k-a so let's google it for that weka and it's the first link here we got three data mining with open source machine learning so this software is built in or coded in java and it's going to be a point-and-click software so we could just simply download it by clicking on the download and for those of you who are interested it will also allow you to use the scikit-learn and also r and also deep learning 4j and so i'll probably be covering this in a future video but for this video i'm going to be showing you how you could build a very simple machine learning model using weka so first thing that you want to do is click on the download and then depending on which operating system you are using you're going to download one of these files and then you're going to be installing it onto your own local computer okay so pause the video for a moment while you download and install it okay and so now that you have already installed wika let's open up wika okay so this is the week and it's called wicca gui juicer so the one that we're going to be using is the explorer so you want to click on this one so weka stands for waikato environment for knowledge analysis and coincidentally wika is also the name of this bird found in new zealand and the software is developed by the university of volcato and so a point to note is that this software was the first machine learning software that i personally use in my early days venturing into machine learning and data science and so it was back in 2005 and so at the time all of this is pretty much known as data mining and this was my favorite tool whereby i have been using it for several years and developing my early machine learning models and so i could use neural network support vector machine decision tree random forest and in this video i'll be showing you how you could build a simple model and so let's get familiar with the general gui interface here so you're gonna see that there are a total of six tabs here and at default it's going to be at the pre-process tab and here you're going to be able to import your data open file okay so let me first find a data set for you to play around with so why don't you click on open file and then you're going to navigate this to the program files and you're going to find the weka 3-8 folder and then you're going to click on the data folder so in this folder there's going to be a lot of data sets here so let's try the cpu data set click on open and so right here you're going to be seeing that this area is going to be telling you the general features of the data set so here you're going to see that instances is 209 meaning that there are a total of 209 rows or 209 data samples and there are seven attributes so these are the columns so there are seven independent variables oh i mean there are seven variables and the seven variables are shown below here where the first six variables are the independent variables and the seventh variable is the dependent variable or the class variable so you're going to be making use of the six independent variables here in order to predict the dependent variable or the class variable so on the right hand side here you're gonna see the general attributes of each of the variables so for variable number seven you're going to see that the minimum value is 6. the maximum value is 1150. the mean value is 105.622 with a standard deviation of 160.831 and if you click on each of the variables here you're going to be seeing that the minimum maximum mean and standard deviation value are changing and normally in order to build a model like a regression model or a classification model we will be needing to do some data scaling because you're going to see here that each variable has different minimum and maximum value or even different mean and standard deviation and so this will be problematic for the building of the model and so what we need to do is we need to perform a data scaling approach whereby we're making each of the independent variable to be in a similar scale so you're going to see here that the minimum and maximum values for all of them are different so upon doing a min max normalization you're going to be scaling the minimum values to be zero for all of the six variables here and the maximum value will be one so let me show you how to do that so in the filter you're going to be clicking on the choose all right and double click on filter i mean uh click on the triangle here find the unsupervised find the attribute and then find normalize okay and then you want to click on normalize and then you want to click on apply and so after clicking on apply you're going to be seeing that each of the six variables here all have a minimum and maximum value of zero and one so all of them are now comparable okay so all of them have a minimum and maximum of 0 and 1. so all six are now comparable and make note that only the class variable was not subjected to any type of min max normalization because this is the value that we're going to be predicting so we're not modifying it in any way all right and so now we're ready to build a model as you can see this is fairly simple you just click on the open file in order to import the data in and then you could click on the choose in the filter and then you click on the normalize function click on apply and then you have essentially done min maps normalization and so when you're clicking on each of these variables you get to see that the minimum and maximum values are now zero and one and so in the right hand side in the bottom panel you're going to be seeing the distribution in the form of a histogram plot so this will allow you to see the general distribution all right and so now we're ready to be building the model click on classify and for under the classifier click on choose and then in the functions find linear regression and then for cross validation number of fold to be 10 so this corresponds to the 10 fold cross validation and so in 10 fold cross validation you're essentially going to be splitting the data into 10 segments and then you're going to be using nine segment in order to build a model and once you finish building the model you're going to take the model and then apply it for prediction on the left out segment and then you're going to be doing the same thing over and over for 10 times so that each of the segment each of the 10 segment will be left out one time and used as the testing set whereas 10 of the iteration will be using different combination of nine segment in order to build a training model okay so i'll provide you more detail of that in one of the medium articles that i have written and so i'm going to be sharing you the link in the video description of this video all right and so let's click on the start in order to make the prediction all right and now you can see that you have already created a simple linear regression model using the data set cpu data set and now you have a correlation coefficient of 0.9 and then the root mean squared error of 69.556 and so this is the equation the multiple linear regression equation of the model that we have built okay and so it's very simple and if you want to have a prediction model for your training set you could just click on it and then you click on start alright and then now you see the prediction for the training set and this is the prediction for the cross validation set meaning that you split the data into 10 folds that i have already mentioned and if you want to do some data split you could also click on here and normally we do like a 80 20 split so i'm going to be using 280 okay and then you're gonna see the resulting prediction result for your 80 20 splits and more details are here so you can see here that 80 were used as the training set whereas the remaining 20 are used as the testing sets all right so this is fairly simple and let's say that you want to use a different algorithm so let's click on the choose and the multi-layer perceptron is the back propagation neural network let's click on that okay so back propagation neural network got 0.96 for the prediction and smo reg is the support vector machine click on that and you get 0.93 so it should be noted that for the support vector machine the performance was not so great because we haven't done any form of parameter optimization so this is using only default values let's try other algorithms let's see trees let's try the random forest which is my favorite all right so you can see that random first got the best performance here without even doing anything as well using the default values 0.9737 okay and let's try something else what else okay and this one gave pretty poor results 0.86 so a series of rules from the prediction all right and that's all right let's have a look at the visualize section so the visualize section will allow you to see the scatter plot of all of your variables so it's quite small but you get to see the general distribution of each pair in your variables okay so why don't we try creating our own data set and then we're gonna use it for the prediction so let me go to the github of data professor and let's download the data on the solubility prediction so it is delani solubility so i'm going to be showing you how you could prepare and input data for your week prediction all right so here let's let's download this delani descriptor csv file and i'm putting it into the desktop all right that's all we need and then i'm going to open up the atom which is a text editor i'm going to import the data here okay and now i'm going to save it onto the desktop and i'm going to be calling it delani solubility with dot arff arff is the input file for wika all right so actually i haven't been building a model for wika for quite some time let me have a look at an example data set wrong folder all right this one data and let's take a look at the cpu one all right so we're going to give it a name at relation at relation and then we're calling it dilani and then here we're going to define the variables and then we're going to say the data type here and afterward we're going to use at data okay fairly simple and the rest are just a simple comma separated value format all right so it's what adds attributes at attribute and so i have molag p so let me do it like this they're all numerical so go like this and then add data okay so i'm adding a empty space for better visibility or readability of the code and i'm going to save it and that's it that's all i have to do in order to create the input file for wika and it's essentially a csv file whereas you will use the following as your header of the file and so we got will read this as the name of the data set and then it will detect the data type and then the data will be found below all right so let's close it and open up in the weaker file it's in desktop thelani solubility open all right and you can see that this is the histogram of the moloch p molecular weight number of rotatable bonds aromatic proportion and the log s so let me apply the normalize or actually i could also use another form of data scaling which is standardized by standardizing i could click on apply and then you're going to be seeing that the mean value and the standard deviation will be adjusted to a range of 0 and 1. and so it will have a mean value of zero so this is essentially called mean centering and the standard deviation will become one and so this will be called unit variance okay let's do that apply and now you see that the mean is now zero and standard deviation is now one okay and the log s or the dependent variable that we're going to be predicting is left untouched and now let's perform the prediction let's use the linear regression start okay and we have 0.873 let's adjust the settings here a bit let's use no form of attribute selection so we're gonna use all of the features eight seven two two and let's try the cross validation eight seven five nine okay let's try the random forest 9405 okay let's try the neural network eight five seven seven so random force performing the best so far eight seven two okay and so you will see that in the panel here they're providing you the names of the actually the timestamp and also the machine learning algorithm that was used to build the model so for quick access so you can see random four is giving the best performance so far using the default parameters okay visualize and you see the general distribution of the data set okay and so if you're finding value in this video please give it a thumbs up subscribe if you haven't yet done so hit on the notification bell in order to be notified of the next video and as always the best way to learn data science is to do data science and please enjoy the journey thank you for watching please like subscribe and share and i'll see you in the next one but in the meantime please check out these videos

Transcript for:Building a Machine Learning Model with Weka

Transcript for:
Building a Machine Learning Model with Weka