Transcript for:
Module 3 - Lecture - Text Classification With Python 5

one of the fundamental subtasks within natural language processing is text classification which is also sometimes called Auto categorization and this is the process where you take some piece of text we call it a document but it could be you know something as short as a phrase and put it into categories so for example you know this picture here kind of summarizes what it is that we're trying to do we just have after we build some kind of software tool we would like to be able to feed it some documents and then have it automatically put it into some non-overlapping slots so in this picture for example we just have you know some let's say product reviews online product reviews and we are just automatically taking the text and deciding what is the main focus of that text document so there are a number of use cases in in natural language processing implementations where you know text classification might be a fundamental part of it so for example if you have a chat bot that is going to handle enquiries from customers you might decide okay well there are ten different kinds of inquiries and we need to be able to determine based on the text of their inquiry you know whether it's captured through you know speech to text or they've written it down um we're starting with some text and we're trying to decide okay well what is it you know which of these ten things are they actually asking about so you might also you know want to do some voice of the customer of social media monitoring kinds of things and you'll you'll want to you know go through a bunch of social media posts and maybe put them into categories based on what kind of action you want to take okay and then also emails you might want to categorize emails automatically that's a useful you know natural language processing kind of tasks in fact I think it was one of the first application you know separating the email that you might actually want to read versus the spam okay so what I'm gonna do in this video is to walk you through a a fairly simple text classification program in in Python so the actual code file is I don't know if you can read this but on d2l under the current module I have a Python file called Auto categorization and if you'd like to download that and follow along that's fine if you'd like to just kind of watch and see what goes into making a classifier I think that the you know for those of you who don't have a lot of interest in the actual technical details I think that seeing something like this would be valuable because I think it shows you know how you know below the barrier for entry is on a basic text classifier so they're going to get much more complicated and much more powerful beyond what I'm going to build here today but it's it's fairly painless to you know get in on the ground level okay so if you are following along the prerequisites for you know getting this thing running and your own computer is well obviously you have that Python installed and pycharm which I think some of you already did and there's a there's a library called NLT K or the natural language toolkit which provides some libraries for working with with text and then finally scikit-learn which is a machine learning library Python library so yeah we are going to be using machine learning so you can use machine learning in order to do your natural language processing so um let let's go ahead and dive in and take a look first of all the data set that I'm working with is a couple of thousand documents that that came from the BBC news so I just downloaded this data said it's available publicly so in this folder here I've got these one two three four five different folders and then within each folder are you know a couple hundred or few hundred text documents so if I just double click on this I see okay Industrial revival hope for Japan so that these are these are about business we just kind of browse through and see you know kind of eyeball some of the content let me you know and then so we can go over here and to tech and see we've got a bunch of text documents in here too okay so you know this is going to be our our our set this is gonna be our our data set that we're going to use to train our machine learning algorithm now this is going to be supervised machine learning so what that means is I've got a bunch of training examples here and this is the label so the idea is we're going to have the program look at all of these different documents and determine given a new document which of these categories that best belongs in okay so that's the that's the basic overview of the task so we're going to you know go ahead and take a look at this at this file and I'm gonna bring in pieces of it at a time okay so this dis just this just there's something that you put in a Python file to be the the thing that is automatically executed so okay so the first thing I want to do is take this you know this tree of files and then just convert it into something that I can maybe browse a little bit and so I'm gonna call this you know create data set and so you know having having two thousand two hundred files just spread around a bunch of directories is fine but you might want to pull it into a single spreadsheet so I made a just a little a little function here which I called create data set and I will let me just pull this over here um so what this does is this opens up an output file called data.txt and then it goes through each of the labels and then goes through the subdirectories and then it it opens up each of the files in there and then so this data dot txt file that results from this is just gonna be a big spreadsheet that includes the name of the directory as the label and then the text of the document has the you know as the data now I see I have a couple of variables that I need here I need that's what this red underline means so the labels well these are the these are the different categories okay and then I also see that I need my base directory which is right there bring this back over okay so so again I'm gonna go so what it's doing is saying okay for each of these labels in this list of labels here there's going to be a subdirectory that's called that just open each file in there and then just write out write out the name of that directory as the label and then put in the next column the name the file and then in the third column the text of the document so after I run this create data set function I end up with a document that looks like this okay so I've got label and then I wrote the file name just for reference not really gonna use that and then this column has the actual text edit okay so I can see okay so I can see I did a quick pivot table here and I can count the number of documents that I have of each so I've got 510 business 386 entertainment 417 politics 511 sports and 401 Tech so this is what I would call a fairly balanced data set sometimes if you have something like a while I've got 2000 business but only you know 50 in sport that tends to cause some problems when you're trying to build a classifier this is fairly balanced so we're not gonna worry about it we're just gonna we're just gonna not do anything like resampling they're super sampling or subsampling anything idea so we'll just we'll just proceed with a nice nice balanced state set okay so so the next thing I need to have is some way of taking these documents out of this spreadsheet and then pulling them into memory for Python and a convenient way to store them in memory is what's called a in a couple so a tough old and remember that with Python we just use the hash for comment so this is not an executable we're gonna have with these parentheses there's you can store kind of like a record with multiple pieces and separated by commas so we would like to have something like this where we have a label and then the text and then we're gonna have a list of these okay so we're gonna have a list so there'll be one document there'll be another document okay so we're just gonna have a list of tuples and each tuple is going to consist of the label so business entertainment whatever and then the actual text and this is how we'll so we're gonna read this this document this text file into this data structure so that we'll have it accessible in memory for when we want to work with it okay so that's gonna be what it looks like so this is the subdirectory that this is the this is the function that I call setup dots so that's what this is doing so I pull this over here now okay so this creates a list okay and then it opens up that data txt file it goes through row by row and then splits it out so that we have the the actual so this is the tupple this these parenthesis demarcate the tupple and the first part is going to be that label so it's in the first column and then this is the third column and it's just gonna strip off the new new new line on the end and then it will append it to that list so after we do this we're gonna have a list of documents that consist of these tuples or labeling tax okay so anything that we do now we just need to pass a reference to those documents and well then we'll have the documents that we need so one thing that I might want to know right off the bat is you know just what kind of word distributions do we have do each of these you know I have a kind of a guess as you know that they would have different you know different occurrences of the word let's say market so market might be you know might appear a lot more often in once about business than it would about entertainment but I would I would a convenient place to start would be to just look at these counts and so let's go ahead and do that so I've got a I've got another function here which is called print frequency distribution okay all right so this is going to go through the documents and count let's just comment this out for now it's gonna it's going to count all of the all of the words and for each label it's going to give me a running count of how many times that word occurs in documents with that label so you know gonna go through basically and take all of the business documents put them all together and then count how many times each word appears in there and then I'm just gonna print out the top 20 in each category okay so let me let me first of all do it a way that's not very nice let's do this so um let's just say so I'm gonna go through that Doc's data structure iterate through all of the documents and get the label from the first element and the text from the next document and then what I need to do is get all of the individual words so basically it needs to split that string on white space to get me down to individual words and that's called tokenize word tokenize and this word tokenize is a function that comes from NLT kay so right there from NLT Kay import word coconuts so I'm gonna get it tokenize the text of each document and then put them all into a a big dictionary that's mapped from the label of that category to that big that big list of tokens okay and then I'm gonna go through for each of those category labels and then print out the 20 most common ones so let's see what this looks like and I think you probably guessed that this is this is not gonna be so pretty at first okay so it has to go through 2,200 documents and then count up all the words okay so this is this is what it prints out okay so there's the category and here each of these tuples consists of the word and then the number of occurrences of that word and you see here that the is the most frequently occurring word in all the categories whether it's text boards politics entertainment pizza and then also a comma is a comma and period are the next most frequently occurring so-called words okay so clearly this is not going to be very useful so we need to we need to first of all let's get rid of the let's get rid of the punctuation okay so I have a function here which is called clean text okay and I'll just paste it in here okay so now instead of just taking the document text and then call clean and it's going to clean text okay and so this what this line does is it replaces anything that fall falls under punctuation with nothing okay so it's just gonna strip out all punctuation and then just to normalize everything it converts everything to lowercase okay so now my document text is nice and clean okay so no punctuation and everything converted to lowercase okay so um let's let's see how far this gets us okay so the punctuation is gone but I still have this a bunch of these useless words up here don't really okay I don't really you know those aren't going to be very useful for differentiating these classes because they're all just you know just junk filler words so what we want to do these these words that appear so often in every single category or what we call stop words that is we don't we don't want to include stop words in any kind of classification thing that we're doing so we have a NLT cave provides a list of stop words in English that that we can just choose to not include so it basically found these words appear so often that they don't really have much power for differentiation so we're gonna just script those out and then because I noticed that the word said appears an awful lot in in news reporting and also mr. appears an awful lot so we're just gonna add those so now we are going to instead of getting the doc tokens like this I have a function called get tokens okay I'm gonna copy it over now I'm gonna get those tokens okay and so now it does that tokenization but then this line this is called a list comprehension so this says okay well only include it in this list of tokens and a token is just basically synonymous with word in this context so these tokens are going to not include any of the stop words so now by calling get tokens we're gonna have so we've got we got rid of punctuation and different cases with this by using this we got rid of all the junk words so now we only have meaningful tokens so maybe we're getting close to words that'll actually show us that'll be different across all the different class ladies and I see that okay so now it's starting to look better so on business so this is probably us year would also so there's market growth company economy firm okay so we see that these these have peer hundreds of times across these documents and we have a very different set of words with politics we've got government labor so this is a British data set people election Blair party ministry in tech we've got technology mobile music games entertainments film and music and okay so so we do have different word distributions and these are going to be key in helping us classify okay so so we printed this frequency distribution but really we just wanted to do that to kind of eyeball the set and be able to tell whether there are words that will help us differentiate now that there are we know that we can build a classifier so we could at this point if we like just go the manual route we say okay so if we want like if doc or we could say if government in document then we'll say politics plus equals one and so we have a score so each doc so if we have a new document we could just score it based on whether it contained these words I mean that would be a very basic way of doing it but I it would be nicer certainly more dynamic to use a machine learning algorithms so that's what we're gonna do so what we want to achieve here is remember that with machine learning you provide you know supervised machine learning you provide labeled say labeled examples and it learns how to classify new data so we're going to so that consists of two steps we're gonna have to train in and then also test it along the way and then also you know once you've trained it and you determine that it works to a sufficient degree then you can use it to classify new dates so we want to we want to be able to pass those documents to some something that's going to train classifier and then we would like to just do something like this where we have okay this is a new document so I just pasted this off of today's news this is something about Google so obviously that you know that should classify as text so you want to be able to just have something that's called okay train the classifiers of the documents that calves and then okay now that I have a new document classify it and so these two steps are ones that you can do you know this is like production deployment okay some new some new customer query has come in you run the classifier and and the trained classifier and then you can classify it okay so this is just gonna be our deployment step we can even just mark that this is deployment in okay but we don't have that yet first we need to train the classifier okay so this gets a little bit complicated but let me just put the function there and then I'm gonna add little by little what it is we need to we need to add okay all right so the first line of my train classifier so it's taking a reference to that those documents again is remember that we want to you want to split up the training data into a set that we that we trained on and then a set that we test on so you know we don't want to you know it's kind of cheating to just train and test on the same set because you know you want to you want to see how well your classifier works on unseen held out data so we're going to take that those documents and then split them up into training and testing okay and so I have a function here called get splits which does that so let me bring that over get splits okay and don't worry if you're not following the complete syntax here um I think you know thankfully python is a fairly readable language so you can kind of see what's happening so this gets splits function so it takes a reference to those documents again it shuffles them because we want to have a random border we don't want the order to give us some kind of clue and also we don't want we don't want 80% of our documents coming from you know the first categories and then we don't see those last categories so we're gonna shuffle that up so this by convention you know exes are your like text data and your Y's are your labels so this is a list so X train is a list of training documents Y train the corresponding training labels okay so the first document will have text you know so this will be a list of text documents and this will be a list of labels so the first business document and then if document number one is a business document in Y train it should say business okay so these are the corresponding test ones so this will be these will hold 80% of our data and these will hold 20% okay and so the way I do this there are shorter ways to do this but I find this way very readable I'm just I just established the the dividing point is the pivot which is 80% of the length of the documents and then I go through and I add those documents you know to the x and y lists accordingly okay so it's going to return the those training and testing splits okay so that's what that line right there is doing okay so now you know now we have to get into well how do we represent a text document you know in a way that a computer can have meaning to it so we have to convert a you know something that's unstructured into a into a vector of numbers and the most sensible way to do that so let's see what I mean by that okay so we want to have a we want to each document will be a vector okay and so a vector is just like that you know one dimensional matrix right and so what we would like to have is something like this okay document one has some number here some number here so each document would really just be a row of numbers and what are these numbers well a sensible thing to do would just be counts of words right so let's say you know car the word TV the word film the word sport okay so each document then would be a vector under this scheme of just counts so okay the this document has the word car twice TV once film zero and sport 0 and this document has car once TV 0 film 0 and sport three times okay so each document now is a vector of numbers which correspond to counts of the words that are in those okay so that's what we have to do we have to convert each document into a vector using something called a vectorizer so this makes a vectorizer that does that and this particular thing actually makes that that matrix so don't worry about the syntax of that okay so once I have my training documents so there's my training documents converted into these numerical vectors now I can just call this single line right here and this is the so this is the function that I call to make a naive Bayes classifier so you can you can just Google what that is but but basically it uses Bayes rule to you know gain evidence of what class the document belongs to okay and so I pass in that that matrix of document vectors and then the correct training labels okay so this line is what actually creates my classifier so now that I have this classifier now I want to know well how well does it do so I have a function here which is just evaluate classifier okay it's that there so this is gonna evaluate the classifier and it's going to print out those classification metrics that we talked about the ones precision recall and f1 so it's going to print out the title it's going to print out the precision recall an f1 okay all right so okay alright so now we've got my so after we create it we're going to do two things we're going to first call evaluate classifier on the training data and then we're going to call evaluate classifier on the test data and obviously it's going to do better on the training data than on the tests tape so let's just do that part of the first and I'm going to comment out these last lines okay all right so we see here the here the metrics for the naive Bayes classifier so on the training data this is precision so 99% precision 99% we call 99% F 1 so pretty good and then even on the training data it's 97% F precision 97% recall 97% it f 1 ok so this is a really good classifier I decide this is good enough I want to keep it as it is so now I I don't want to train the classifier every time I need to classify it right because once it once it understands how to classify something we want to be able to have just something that we can pull in and in a single line or to just classify so we only classify only train at once and then we store it so these lines down here this actually creates a file in the file system that will store that classifier now there's also another thing that that vectorizer that we use to convert the the document into a vector representation we're also going to store that too so because when we take a new piece of data we need to know how to turn it into the same vector with the same vocabulary as the last one so we're going to go ahead and store those two ok so now we're gonna so we'll we'll call this train classifier one more time you'll train the classifier and well we'll use this Python uses the word pickle to describe something that is stored in a file system so it's going to store that classifier in the vector I threw sold train one more time okay and so I have these right here so there's the account vectorizer than the naive Bayes classifier so those are stored in the file system now so I don't need to train the classifier anymore so now I'm ready to just say okay now I've got this and this and now I can just classify new data that comes in okay so I've got this example here this thing that I pasted and now I want to just be able to say okay classify it so let me get this okay so the process of actually doing the classification okay so it's going to take as a as a parameter it's the text and it's going to open up that classifier that's stored in the file system it's going to open up that vectorizer that's stored in the file system and then it's just gonna make you know so this part right here actually uses the vectorizer to create those word vectors based on the text and then it's gonna make a prediction and I'm just gonna print out the prediction okay so so now again I don't actually need these Docs anymore either because that was just for training so now I can just have this two line program here that takes whatever text let's say you know in the original use case um a chat bot you know we have the the text that as it was typed in by the customer and now we just want to classify it so we can do it with these two lines it's going to open up that trained classifier put it into one of the categories and then print out that category so let's see how it does with that one document there okay it says it's Tech okay and that that makes sense because it's it is tactics probably has a lot of the words that are that are distinct for tech let's try um let's just get something-something on Entertainment News let's just grab a random okay scandals engagements and divorces so it should be interesting that's taken too long I need to have some just gonna get some text then you know to see if it if it can recognize an entertainment document anything is so slow about TMZ maybe effective just my network must be very slow ok so here's what I'll do I have something that I pasted earlier when my network wasn't quite so slow about Scarlett Johansson I'm starring in a biopic so we'll we'll use this as our test document now and you know what we want to see is that this is going to be classified as a entertainment and it is ok alright so so there's there's the whole process again we know we had to do some some pre-processing but basically we just opened up those text documents and we trained a machine learning classifier to to put them into different categories we used existing you know Python is free and ltk is free scikit-learn is free so for you know the price of zero on on software expenditure and you know what amounts to a couple of hours of work we have something here that you know does a pretty good job of Auto categorizing something now news because of the way it's categorized tends to be fairly easy it might be that in a particular domain it's harder to differentiate between the classes so you might get maybe six percent seventy percent precision recall but that might be sufficient for your SEO johansson is misspelled wit you know that might be sufficient for your classification means but once the once the classifier is trained you can deploy it just by doing something very simple like that that's have a nice short script okay now this is kind of the entry level things get you know you can there are a number of different classifiers that are available naive Bayes works pretty well with text there's you know there's several more that comes right out of the box there's also different vectorizer so this this vectorizer as I said you know if we go back to that this represented each document has a count of words that are in the vocabulary there are also other vectors that can have different things in here you know there's one called the tf-idf which includes not just the count of the word but also it includes some measure of how rare the word is across the whole corpus so you know if it's a if something as a high tf--idf it occurs frequently in this document and then maybe not quite you know it's not quite so frequently in other documents so it's a highly differentiating term in the case of these categories classifications account vectorizer works just fine but there's also you can also do a tf-idf vectorizer okay but you don't get any more accuracy with that one than you do with the count divisor okay all right so um I hope that I hope that gives you some insight into text categorization and the kinds of things that you can do you know go ahead and ask me any questions that you have and start with this document you know this Python file on d2l and play around with it and happy to help anybody who would like to get it working and maybe maybe you're running into some some bugs but I enjoy and we'll talk about some more natural language processing some topics