Transcript for:
Understanding Text Mining and Analytics

foreign we gave an overview of text Mining and Analytics first let's define the term text Mining and the term text for analytics the title of this course is called texture Mining and Analytics but the two terms text Mining and text analytics are actually roughly the same so we are not going to really distinguish them and we're going to use them interchangeably but the reason why we have chosen to use both terms in the title is because there is also some subtle difference if you look at the two phrases literally mining emphasizes more on the process so it gives us a algorithm radical view of the problem analytics on the other hand emphasizes more on the result or having a problem in mind uh we are going to look at the text data to help us solve a problem but again as I said we can treat these two terms roughly the same and I think in the literature you probably will find the same so we are not going to really distinguish them in the course post-text Mining and text Analytics mean that we want to turn text Data into high quality information or actionable knowledge so in both cases we have the problem of dealing with a lot of test data and we hope to turn these text Data into something more useful to us than the raw Text data and here we distinguish two different results one is high quality information the other is actionable knowledge now sometimes the boundary between the two is not so clear but I also want to say a little bit about these two different angles of the result of texture mining in the case of high quality information we refer to more concise information about the topic which might be much easier for humans to digest than the raw tax data for example you might face a lot of reviews of a product a more concise form of the information would be a very concise summary of the major opinions about the features of the product positive about the let's say battery life of a laptop now this kind of results are very useful to help people digest Text data and so this is to minimize the human effort in consuming tax data in some sense the other kind of output is actionable knowledge here we emphasize the utility of the information or knowledge we discover from Text data it's actually more knowledge for some decision problem or some actions to take for example we might be able to determine which product is more appealing to us or a better choice for a shopping decision not such a right outcome could be called axiomonology because a consumer can take the knowledge and make a decision and act on it so in this case tax mining supplies knowledge for optimal decision making but again the two are not so clearly distinguishable so we don't necessarily have to make a distinction text mining is also related to tax retrieval which is the essential component in any text mining systems our tax retrieval refers to finding relevant information from a large amount of Text data so I've taught another separate book on text retrieval and search engines where we discuss various techniques for text retrieval if you have taken that mooc and you will find some some overlap and it will be used for uh to know the background of text retrieval for understanding some of the topics in text mining but if you have not taken that mooc it's also fine because in this small context Mining and analytics we're going to repeat some of the key Concepts that are relevant for tax mining but at the high level uh let me also explain the relation between text retrieval and text mining text retrieval is very useful for text mining in two ways first text retrieval can be a pre-processor for text mining meaning that it can help us turn big text Data into a relatively small amount of most relevant attached data which is often what's needed for solving a particular problem and in this sense text retrieval also helps minimize human effort text retrieval is also needed for knowledge Providence and this roughly corresponds to the interpretation of text mining as turning textures that are into action more knowledge once we find the patterns in Text data or action more knowledge we generally would have to verify the knowledge by looking at the original Text data so the users would have to have some text retrieval support go back to the original Text data to interpret the pattern or to to better understand the knowledge or to verify whether the pattern is really reliable so this is a high level introduction to the concept of text Mining and uh the relation between text Mining and retrieval next let's talk about the text Data as a special kind of data now it's interesting to view text Data as data generated by humans as subject sensors so this slide shows a net shows an energy between Text data and non-text data and between humans as subjective senses and physical senses such as such as a network sensor or thermometer so in general a sensor would monitor the real world in some way it will sense some signal from The Real World and then would report the signal as data in various forms for example a thermometer would watch the temperature of uh real world and then we report the temperature in a particular format similar a geosensor would sense the location and then report the location specification for example in the form of longitude value and latitude value a network sensor with monitor Network traffic or activities in the network and then report some digital format of of data similarly we can think of humans as subjective sensors that would observe the real world and from some perspective and then humans would express what they have observed in the form of Text data so in this sense human is actually a subjective sensor that would also sense what's happening in the world and then Express what's observed in the form of data in this case Text data now looking at the text data in this way has the advantage of being able to integrate all kinds of data together and that's indeed needed in most data mining problems so here we are looking at a the general problem of data mining and in general we would be dealing with a lot of data about our world that are related to a problem and in general we'll be dealing with both non-text data and Text data and of course the non-text data are usually produced by physical sensors and those non-tax data can be also of different formats numerical data categorical or relational data or multimedia like video or speech right so these non-text data are often very important in some problems but text data is also very important mostly because they contain a lot of symmetrical content and they often contain knowledge about the users especially preferences and opinions of users right so uh but by treating text Data as the data observed from Human sensors we can treat all these data together in the same framework so the data mining problem is basically to turn such data turn all the data into actionable knowledge that we can take advantage to change the real world of course for better so this means the data mining problem is basically taking a lot of data as input and giving action monology as output inside the data mining module you can also see we have a number of different kinds of mining algorithms and this is because for different kinds of data we generally need different algorithms for mining the data for example video that I might require computer vision to understand video content and that would facilitate the more effective mining and we also have a lot of General algorithms that are applicable to all kinds of data and those algorithms of course are very useful although for a particular kind of data we generally want to also develop special algorithms so this course will cover specialized algorithms that are particularly useful for mining Text data