Transcript for:
Machine Learning Lecture: Clustering and Applications

In this machine learning class, we will see the clustering from 5th unit and supervised learning. In today's class, we will see what is clustering and applications of clustering. And clustering as a machine learning task.

After that, we will see one example that identifying the professor to handle the machine learning subject in an university. After that, let us see the different types of clusters. And there are three different types of cluster.

And we will see all those things in today's class. First, let us see what is clustering. Clustering is used for finding the subgroups or clusters in a data set on the basis of characteristics of the object within the data set.

Okay. So, this is our data set before clustering. So, this is the data set before clustering. So, in this data, there is no label. No label, that is the raw data.

So, this raw data is given to clustering algorithm. The algorithm will find the subgroups or clusters, okay? It will find the subgroups. So, this is one subgroup, this is one subgroup, okay?

So, it will find the subgroups based on the characteristics of the object, okay? So, based on the characteristics, it will group the data, okay? So, that it will identify the different number of clusters in the given raw data set. Hence, the objects within the group are similar. Okay.

Suppose if it is one group means the objects inside the group are similar. It is similar, but the objects are different from other groups. Okay.

Suppose this group and this group, two datas we can take and these data are different. Okay, these two datas are different. That is objects from different group are different, but objects in the same group are similar.

So, this defines the effectiveness of clustering. Okay, when the clustering will be effective, the objects inside the group are similar to each other, but the objects from different groups. So, one group from one object from this group and this object from this group and these two objects are different.

different. Okay. Then this defines the effectiveness of clustering. So, after clustering, we identify the label of each group. Okay.

Let us see one example for clustering. Okay. We are having an advertisement company to promote the movies.

Okay. So, countrywide promotional data is available. The data is having the age, location, financial condition. and the political stability of the people in different part of country okay almost we are having uh the data of all people in this country and we want to run different types of promotion for different grouped people okay accordingly the movie will reach the correct people so that we will get more promotion isn't it so we cannot blindly promote the movie to all the people in the country okay but we select the group of people. Suppose if it is a sports related movie, then we have to target only the youngsters, right?

So, the cluster analysis helps this activity by analyzing different set of peoples and arriving at different types of clusters. Okay, how we are going to identify the people by their browsing patterns, that is the first step. What are the sites they are visiting and based on their likes and dislikes and the frequent visited sites and etc. So by using all those things, we can identify the people based on their like patterns.

Accordingly, we select a group of people to promote our movie so that we will get more profit actually. And next let us see some of the applications of PlusRing. The first one is text data mining. So in this text data mining, the text categorization, text clustering, document summarization, concept extraction, sentiment analysis and entity relation model. So these are some of the tasks will come under text data mining.

And second one is customer segmentation. So here we need to create the cluster of customers based on their parameters. So the parameters are demographic financial conditions buying habits etc likes and dislikes so based on those categories That is characteristics, we can group the customers and these groups will be used by the retailers and advertisers to promote their product to the correct segment of people.

The next one is anomaly checking, that is the checking of anomalous behavior of pattern or the people such as the fraud and bank transactions unauthorized computer intrusions and suspicious movements on radar scanner okay so these are anomalous behaviors and the next one is data mining data mining is mining the required data from huge volume of data that means the data mining task by grouping the large number of features from extremely large data set okay here the data set size may be few terabytes So, from those huge volume of data, it is highly difficult for selecting only the required data or getting a required knowledge, isn't it? So, in that purpose, we can use data mining. So, these are some of the applications of clustering.

Next, let us see the clustering as a machine learning model. So, the clustering is used to discover rather than predicting the output. Okay, that is it is used to discover the existing data that is unlabeled data instead of predicting a single output. Okay, so the clustering is defined as unsupervised machine learning task that automatically divides the data into clusters or groups of similar item.

So here the unlabeled data which is given to the unsupervised machine learning model, right, that is the clustering model. Next, the clustering will analyze this data and it will identify the internal pattern of data based on the characteristics. Then it will group the data or cluster the data. Here the triangles will become separate groups, rectangles separate groups, circles separate groups and these ellipses separate groups.

And this analysis achieves without prior knowledge of the data. This is the most important characteristics of clustering okay so the analysis are used without any prior knowledge of data and also clustering can create new data okay for example these three are the three different cluster and these are the data inside the cluster and this clustering algorithm can create a new data for this particular cluster Okay, likewise. So, these are the new data of these particular clusters.

Okay, and the clustering label the object with the class label. Okay, so for every object, it can produce a label. It can produce a label and that label is called as class label.

Okay, so in the classification also we can produce labels, but that is predefined label. But when come to clustering, this is not predefined. The cluster itself will produce a new label after analyzing the internal pattern of data.

Here, the unlabeled objects are given cluster labels which is inferred entirely from the relationship of attributes within the data. So, based on the internal relationship about the data, the corresponding label will be given to the data. Okay, and the effectiveness of clustering is measured by homogeneity within group as well as heterogeneity between distinct group.

That is inside the data within a cluster are similar to each other and data from different clusters are different from each other. Okay, so this is the effectiveness of clustering. And next let us see one example for this clustering. Okay, the problem is identifying professors who can handle machine learning subjects in an university. Okay, so the machine learning is the intersection of statistics and computer science.

Okay, so the combination of statistics and computer science is the machine learning. The professor should have the knowledge in both the subjects. Then only he can handle the machine learning. Right, so. the search the result of research publications of these professors from the internet so from the internet we need to search the publications of those professors Okay, and under which particular subject they have published the paper.

So by using the machine learning algorithm, those papers will be grouped together and infer the expertise of the professor into three buckets. Okay, so based on the paper, the cluster will group into three different buckets. Okay, three different groups. First one is Statistics, Computer Science and Machine Learning.

That means the professor will publish the paper from Statistic discipline or the professor can publish the paper under pure Computer Science or the professor can publish the paper, the combination of Statistics and Computer Science. Okay, that is after plotting the number of publications of these professors in the university, it two core areas. First one is statistics that is say x-axis is statistics related publications and y-axis is CSE that is computer science related publications.

So, these are the papers published by the professors in an university. Now, the clustering algorithm will analyze the patterns of those data and the three different group of clusters of the data. So, the first one is the pure statistics have less computer science related paper. So, these are the pure statistics paper.

This is the first cluster which is having less computer science knowledge and the second one is pure computer science professors have less number of statistics related paper. Okay, so these are the computer science papers and those paper have less knowledge in statistics. Okay, and the third cluster, the professors. So, this paper published by some professors and those professors have published the papers on both the areas. See, computer science as well as the statistics.

Okay, and thus we can assume that the persons have the knowledgeable in machine learning concept because they can handle both the computer science as well as statistics Papers and they are having in knowledge of both computer science as well as statistics So we can identify that those professors only can handle machine learning subject in the university And next let us see the different types of clustering technique. There are three major techniques are there first one is partitioning method and second one is hierarchical method and third one is density based method. And in all those approaches, the creating, clustering and measure the quality of cluster and applicability are entirely different. Okay. And three methods are completely different to each other.

The first one is partitioning method. Okay. Partitioning method means it uses the K-min algorithm or K-mead algorithm And these mean and mid-add are represent the cluster center. So, based on the cluster center only, it will identify the distance between the data point and cluster center. It will understand the size of cluster and number of data inside the cluster.

And this partitioning method adopts distance based approach to refine the cluster. That means from the center of the cluster, how much distance the new data point is. So, accordingly it will construct the cluster and it finds mutually exclusive clusters of spherical or nearly spherical shape, okay, that means this like shape, okay, and the effective of data set of small or medium size.

If the data size is very small or medium size then this partitioning method will be very effective one. okay and second one is hierarchical method this is the second method here it creates hierarchical or tree-like structure through decomposition or merger okay so the entire data set is there this data set will be either splitted or the split will be merged between each other to create the hierarchical structure okay either by split or merge it will create the hierarchical structure okay and it also uses the distance between nearest or farthest point in the neighboring cluster as a guideline of refinement Okay, so here it also uses the distance measurement that is nearby or far away the data point. Accordingly, it will construct the neighboring cluster. Okay, and the third one is erroneous mergers or split cannot be corrected at subsequent levels.

Okay, so the erroneous cannot be corrected at the subsequent levels that is in the hierarchical level so parent node to child cluster okay and the third one is density based method so this is the third one okay and this is useful for identifying arbitrarily shaped cluster okay and density based means there is no fixed shape of this data set that is our cluster okay the shape may be anything So, based on the density of data, the shape may be differed. And the guiding principle of cluster, cluster creation is the identification of dense region of the object. Okay. In which area more number of objects will be located. Okay.

Then that will form the new cluster. Okay. And this separates from the low density regions.

So, for example, low density regions are there, very small number of clusters may be placed, but it identifies only the density data set in which location more number of data will be placed. Okay, it may filter out outliers. Okay, if there is any outlier, the density based method easily identify those. outliers and it will omit those LTS while creating the clusters.

So, these are the three different partitioning methods that is clustering method. The first one is partitioning method, hierarchical method and density based method. Up to this we have seen the clustering from unsupervised learning. In this class we have seen the definition of clustering, applications of clustering, clustering as a machine learning task.

After that, we have seen one example that is the professor can handle machine learning subject from in an university and then we have seen the types of clustering that is three different types of clustering there, partitioning method, hierarchical method and density based method. So, in the next class, we will see the partitioning method in detail, right. Thank you.