Transcript for:
Module 3 - Lecture - Natural Language Processing 3: Text Classification

many of the tasks within natural language processing involve some form of text classification so whether it's sentiment analysis or topic analysis or semantic analysis what we're trying to do is look at a piece of text and then classify it into a number of categories and so i'm going to take some time in this lecture to take a look at what what are the basics of this this task and then i'm going to show you a couple of decision support systems that use text classification so the basic problem of text classification or sometimes it's called auto categorization in industry is this idea that we are going to build some kind of model where we're going to take in a collection of documents and then determine which of these non-overlapping mutually exclusive and collectively exhaustive categories they belong into so for example in the example about costco is this you know the documents are movie reviews and we're classifying them as about clothing or sorry their their online reviews and they are they talking about clothing here are they talking about camera are they talking about home appliances so this is the task here's um you know classification and if you think about a lot of the tasks um underneath they are at heart text classification tasks so think about determining the type of chatbot inquiry to select the appropriate answer so if somebody asks a question on a chat bot the chat bot might say okay are they asking about a loan are they asking about opening a new account so that's a text classification problem classifying social media posts to determine appropriate action so you might have a text classifier that says okay this is a very is this somebody who is um asking a question is this somebody who wants to see a manager is this somebody who is spreading a rumor so you might have a classification scheme and you need to have your you know some ai that's going to classify those so that you can take the appropriate action you might categorize emails to sort them into some kind of data store so there are a lot of emails and a convenient you know a convenient tool would be something that can read through the content of an email and determine what it's talking about okay this thing is asking about contract this one is about the human resource issue so on so forth so it's a you know it might be a business specific or even person specific categorization scheme and then even sentiment analysis at its heart is a text classification scheme it's a it's a task because we are classifying the text as positive or negative so how would we classify so if we look at this sentiment analysis this naive sentiment analysis um you know we we had these this list of negative words and a list of positive words and we just counted up how many there were in the text and this gave us the you know the answer we were looking for well if we extend this slightly we could say okay well if we want to have a news document classifier we can say okay um we're going to have a list of words that are about sports so team coach win loss game and a list of words that are that have to do with business so business company stock sell profit and we will simply loop through the words in the document and each each each time a word is in the sports list we'll add one each time it's in the business list we'll subtract one and then or we'll we'll add one to the business list and then we figure out okay the document had you know for which of these did it have the most to have the highest score so it might score five on business and only three on sports so therefore it's about business okay so this is another way of extending this idea of just counting words and you you know starting off with the lexicon and just using that as our classification scheme so this is this is one approach to text classification so um that's one where we just kind of a priori came up with a dictionary of terms just based on instinct another way of doing this is to so is to use statistics on the presence of words so let's say we have let's say we have two we have two sets of documents one that is about dogs and one that is about cats now rather than coming up with a list of terms about dogs beforehand we simply do some analysis of these existing documents and we can come up with some kind of statistical weight for the terms that are in there so we might you know after we analyze this basic you know training set of documents we'll say okay well every time you come across the word dog you should add 10 you know not just adding one like we did in our naive case but we actually came up with a value for the word dog and you know bark add seven leash add five bone add five if you come across the word cat subtract four okay so this will also help us classify cat documents because they'll be large negatives if you come up with scratch minus five per minus eight and meow minus ten okay so we've got our training documents about dogs and cats we come up with this what's called a prevalence score scheme and that means we can score any new document coming in so this particular document had a score of negative 27 therefore it is very likely in the cat category okay so this prevalent score you use not just individual words but you use one word phrases two word phrases and three word phrases so there might be three word phrases that also have a score associated with them and this this method of using a prevalence score has the beauty of being very simple and very interpretive and we have used it in a lot of our own research we've been able to use prevalence scores to differentiate between documents that indicate product defects and well a number of other applications so that's one way of doing it another approach is our old friend machine learning so rather than starting off with you know doing a statistical analysis of the words that are in here and coming up with the weights that way we simply pass in the labels right so let's say we have a we want to have a classifier that will classify news items we have our documents that are labeled as politics sports or business and this is our training data so notice we're not actually running any numbers ourselves we're not doing any statistical analysis all we're doing is saying this is the document this is the label and we're going to use supervised machine learning and we'll have the computer notice the pattern and then once we have our machine learning algorithm then any new documents that we come in it'll just be able to output whatever particular label is most appropriate for that and so this can be you know their various algorithms have various levels of transparency and interpretability but we do have a tool here that was trained where we didn't have to figure out this the statistics or the word distributions ourselves so two two examples of you know uses of text classification well one example is this what we saw earlier when we were looking at sentiment analysis so this tool that allowed us to generate this interface here which did the competitive analysis and ranked all of our competitors in terms of the sentiment on all these different categories well this was actually a two-step process so first they had to take the unstructured documents and then classify them on the basis of what they were talking about and then within each of these it determined the sentiment so if the first you know the first step of this process was a classification task and then within the documents as classified then it applied sentiment and this led to this ability to create these non-overlapping rows and then to have a sentiment assigned to each row now this is a very widely used in industry because this is fairly mature this is fairly easy to understand and this provides a very quick and easy way of understanding what customers are saying about what aspects of your offerings another use case of text classification is one where it demonstrates how you can use text classification to improve operations in this case florida state universities i t support wanted to improve their well they wanted to improve their i.t support so they analyzed 100 000 service tickets so these were written in free text form and they wanted to figure out is there a way to figure out is there a way to determine what people are asking about then that way we can come up with a classification scheme and then track different problems over time and they were able to do this so they came up with a classification scheme to determine that there were i guess six different topics that people tended to talk about so you know password reset or you know data loss so once they were able to determine what the topics were that people were talking about they were able to create a machine learning classifier to reliably classify any documents that were coming in and so what this meant was that they could not only route them to the correct person but they can look at trends over time and so they were able to create this nice interface that showed you okay what is the basic problem this is the prevalence of the problem over time and this shows you all these sub tasks or the the words that tended to appear underneath that particular problem so they're calling it topics here but really they're just talking about a particular service request and these were individual words that you know allowed them to show what was associated with their and so the impact of this interface was to you know allow for better employee training because they were able to spot trends in terms of what tended to go up and down at particular times a year and they were able to use this as a way of you know so allocating employees and doing employee training