Transcript for:
Text Processing Basics and Challenges

[Music] [Music] hello everyone welcome back to the final lecture of the first week so in the last lecture we were discussing about various emprah collage so in particular chip slow and heaps law that how the words in the vocabulary distributed in the corpus okay and we saw that the distribution is not very uniform there are certain words that are very very common so we saw that roughly hundred hundred words in the vocabulary make for 50 percent of the corpus that by that I mean that number of tokens and on the other hand there are 50 percent of the words in the in the vocabulary that occur only once okay and we discussed what are the various relationships among the number my book every size and the number of tokens that I observe in a corpus and also how they grow the desart to each other and gifs law gives gave me a relation between the frequency and the rank of a border so today in this lecture we will start with the basic tree processing in in language so we will cover the basic concepts and what are the challenges that one might face while doing the processing okay so we are going to the basics of text rehearsing okay so we'll start with the problem of tokenization as the name would suggest okay remember the name token token is an individual word in my corpus so now what happens when I am pre-processing the text in given in any language what I will face is a string of characters the sequence of characters now I need to identify what are all the different words that are there in this in the sequence so what were creation is the process by which I convert the string of characters into sequence of various words so I am trying to segment it by the various words that I'm observing okay so now before going into what is tokenization I will just talk about a slightly little problem sentence segmentation okay so this you may or may not have to do all this and it depends on what is your application so for example suppose you are doing classification for the whole document into certain classes you might not have to go to the individual sentence and you can just talk about what are various first that are present in this document on the other hand suppose you are trying to find out what are the important sentences in this document in that application you will have to go to the individual sentence so now if you have to go to the newest a sentence the first task that you will face is how do i segment this whole document into a sequence of sentences okay so this is sentence 1 sentence 2 and so on and this task is called sentence segmentation so now what you might feel that this is very trivial tasks but let us see is it trivial ok so what is sender segmentation the problem of deciding where my sentence begins and ends so that I have a complete unit of words that that I call as a sentence now do you think there might be certain challenges involved suppose I am talking about the language English can I always say that wherever I have a dot it is the end of the sentence let us see so there are many ways in which I can end my sentence so I can have exclamation or question mark that ends the sentence and they are mostly unambiguous so whenever I have an explanation or question Mulligan septoria this is the end of the sentence but is the case the same with a dot so can you think of a scenario where I have a dot in English and but it's not the end of sentence so you can find all sorts of abbreviations right they end with a period like dr. mr mph okay so you have three dots here so you cannot call each of the this as the end of your sentence so again you have numbers two point four four point three and so on so that means the problem of deciding whether a particular dot is the end of the sentence or not is not entirely trivial so I need to build certain algorithm for finding out is it my end of the sentence okay so in in text processing we face this kind of problem in nearly every every simple every simple task that we are doing okay so even if it looks a trivial task we face with this problem that okay can I always call dot as the end of the sentence so how do we go about solving this now if you think about it whenever I see a dot or question mark or exclamation I always have to decide one of the two things it's at the end of the sentence or is it not the end of the sentence okay so any data point that I'm seeing I have to divide into one of these two classes okay so if you think of these two classes end of the sentence or not and little sentence so each point have to divided into one of the two classes and this in general this problem in general is called classification problem okay so you're classifying into one of the two classes now so the idea is very very simple so you have two pluses and each data point you have to divide into one of the two classes so that means you have to build some sort of rule or algorithm for doing that so in this case I have to build a binary classifier what do I mean by a binary classifier there are two classes end of the sentence or not end of the sentence in general there can be multiple classes so now for each dot or in general for every word I need to decide whether this is the end of the sentence or not the end of the sentence so in general my classification fires that I will build can be some rules that I write by hand okay some simple if-then-else rules or it can be some expressions I say if my particular example matches with this sexual expression it is one plus if it doesn't match it is another class all I can build a machine learning classifier so in this particular scenario what can be the simplest thing to do let us see can we build a simple rule based classifier so so we will start with an example of a simple decision tree so by decision tree I mean a set of if-then-else statements okay so I am at a particular word I want to decide whether this is the end of the sentence or not okay okay so I can have this simple if-then-else kind of decision tree here so I am at a word and I the first thing I check it are there lots of blank lines after me so this would happen in a text whenever this is the end of the a paragraph and there are some blank lines so if I feel that there are a lot of blank lines after after me that means the after this word I may say okay this might be the end of the sentence with a good confidence so that's why the branch here says yes this is the end of the sentence but suppose they are not a lot of blank lines then I will check if the final punctuation is equation mark or exclamation or equivalent in that case okay so they are quite an ambiguous iverson is the end of the sentence now suppose it is not then I will check in the final punctuation is a period so if it is a period if it is not a period this is easy to say that this is not the end of the sentence but suppose the end of this is a period so again I cannot say for certain if it is the end of the sentence so I will again check for simplicity I might have a list of abbreviations and I can check if the word that I am currently facing is one of the abbreviations in my list if it is there I'll say okay this is not under to a sentence if it is so here I am etcetera or any other observation if the answer is yes I am NOT and all descendants if the answer is no that is this word is not in abbreviation and this will be the end of the sentence this is very very simple if then as roots okay this may not be correct but this is one particular way in which this problem can be solved in general you might want to use some other sort of indications we call them as various features they are various observations that you make from your corpus okay so what are some examples okay so suppose I see the word that is ending with dot can I use this as a feature whether my word starts with in upper case lower case cap all caps or you the number okay how will it help so let us say I have here I am here and my word is 4.3 so I am a dot I want to find out if it is the end of the sentence if I can say that the previous the current word is a number it's a high probability that this will be in number and it will not be the end of the sentence so this can be used as another feature okay so again by feature you can think of a symbol rule whether the word I am currently yet it's a number okay oh I can use the fact where the case of the word with dot its upper case or lower case so you so what happens generally in abbreviations they are mostly in upper case so suppose I have dr. and it starts with an upper case okay I can say that this might be an abrogation okay same with lower case lower case will give me more probability that this is not an approbation similarly I can also use the case of the word after dot okay so is it upper case lower case capital or number so how will that help so again whenever I have the end of the sentence the next word in general starts with the capital so again this can be used what can me some other features so I can have some numerical features so that is I will have certain thresholds what is the length of the word ending with dot is it if the lamp is is small it might be an abbreviation if the length is larger it might not be an abbreviation okay and I can also use probably the what is the probability that the word that is ending with dot occurs at the end of the sentence okay so if it is really the end of the sentence it might happen then that in a large corpus this ends and ends quite often same thing I can do with the next word after dot is it the start of a sentence what is the probability that it occurs in the start of a sentence in a large corpus okay so you might be able to use any of these features to decide given equal to the word age at the end of the sentence or not okay so now suppose I ask you this question do you have the same problem in other languages like Hindi so in Hindi you will see that in general there is only a thunder that you use to indicate the end of the sentence and this is not used for any other purpose so this problem you will see it again language-dependent this problem is there for English but not so for Hindi but you will see there are other problems that do not exist for English language but are there for other Indian languages we will see some of the examples in the same lecture okay now so how do we implement it recently so as you have seen this is a simple if-then-else statement okay so now what is important is that you choose the correct set of features so how do you go about choosing the set of features you will see you need from your data what are some observations that can separate my two classes here so my two classes here are end of the sentence and on the end of the sentence and what are the observations we were having okay in general it might be an abbreviation the case of the world and that its before the dot may be upper case or lower case and one of these might indicate one class other might indicate other class so all these are my observations that I use as my features now whenever I am using a numerical features like the length of the word before dot I need to pick some sort of threshold okay that is whether the length of the word is between two to three or say more than three between five to seven like that so my tree can be if the length of the word is between five to seven I go to one class otherwise another class oh so now here is one problem suppose I keep on increasing my features this can be both numerical or non you may go features it might be difficult to set up my if-then-else ruse by hand so in that scenario I can try to use some sort of machine learning technique to learn this vision tree okay in in the literature there are a lot of such algorithms that are available that given a data and a set of features with cursor decision tree for you okay so I'll just give you so the names of some of the algorithms and the basic idea on which they work is that so at every point you have to choose a particular is split okay so you choose a feature value that is splits my data into certain parts and I have certain criteria to find out what is the best displayed so one by blue criteria is what is an information gained by this okay so these algorithms that we have mentioned here like ID 3c 4.5 and cart they all use one of these criterias no in general once you identify what are your interesting features for this task you are not limited to only one classifier like decision tree you can also try out some other classifiers like support vector machines large tree regression linear networks all these are quite popular classifiers for variation ability applications so we will talk about some of these edge we will go to some advanced topics in this course okay now coming back to a problem with tokenization okay we said that domination is the process of segmenting a string of character characters into words finding out what are the different words in this issue now remember we talked about token and type distinction so suppose I give you a simple sentence here I have a can opener but I can't open these cans how many tokens are there if you count there are 11 different words 11 different occurrences of words so you have 11 word tokens but how many unique words are there so you'll find there are only 10 unique words okay which word repeats is the word I repeats twice so there are 10 types and 11 tokens so my tokenization is to find out each of the 11 work tokens from the sentence in practice at least for English you can use certain tool kits that are available like NL TK in Python coronel P in Java and you can you can also use some Unix commands so in this course we will mainly be using analytic a toolkit for doing all this rhetoric tasks and some other tasks as well ok but in general you can use any of these three possibilities so for English most of the the the problems that we will see are taken care of by the token is token areas that we have discussed previously okay but it still it is good to know what are the challenges that are involved when I have I try to design it organization algorithm okay so for example here you will see that I if I encounter a word like fill lands in my data so one question that I have is whether I created a simple Finland as it is Finland's or I I converted to Finland's by removing the apostrophe okay so this question you might also try to defer to the next processing step that we will see but sometimes you might want to tackle this in the same step okay similarly if you see what are do I treat it as a single token or two tokens what R's this problem you might have to solve in the same step whether it is a single token or multiple tokens same with I M should not shouldn't and so on similarly whenever you have name and it each like San Francisco shredded it as a single token or two separate tokens now remember when we were talking about some of the cases ynl Basehart so you might have to find out that this particular sequence of tokens is a single entity and treated as a single entity not as multiple different tokens okay so this problem is related similarly if you find am dot P dot H do you call it a single token or multiple tokens so now there are no fixed answers to to these and some of these might depend on what is the application for which you are doing this pre-processing okay but one thing you can always keep in mind suppose you are doing it for the application of information tree well the same sort of steps that you apply for your documents should be applied to your query Ashville otherwise you will not be able to match them perfectly okay so suppose if I am using it for information table so I should use the same convention for both my documents as well as the query okay so then another problem can be how do I handle hyphens in my data this looks again a simple problem but we will see it's not that simple so let us see some kind of examples what are the various sorts of hyphens that can be there in my corpus so here I have a sentence from a the research paper abstract and the sentences this paper describes mimic an adaptive mixed initiative a spoken dialogue system that provides movie Showtime information okay so in the sentence itself you see two different hyphens one is with initiative initiative another is show - time okay so now can you see that these two are different - the first - is not in general that I will I will huge in in my text okay second - I can huge in my text I can add show time within - but how did this - initiative came into the corpus okay so so we have given this a title end of line - so what happens in research papers for example whenever you write a sentence you might have to do some sort of justification and that's where you end the line even if it is not the end of the world so evolution you will end up with in - okay so now when you are trying to pre-process and when you are retrieving such kind of hyphens you might have to join these together and you say you have to say that this is a single word initiative and not initiate - tip okay but again this is this is not trivial because for showtimes you will not do the same shorten you might want to keep it as it is then there are some other kind of hyphens like lexical hyphens so you might have these hyphens with various prefixes like KO prima Tumulty etcetera sometimes they essentially determine - - saul so that is you put - so that it becomes easier to interpret the sentence like here case paste hand-delivered etc are optional similarly if you see in the next sentence three to five year direct my marketing plan okay three to five year can be written perfectly without keeping the hyphens but here you are putting it so that it becomes easier to interpret that part occurs so again when you are doing tokenization your problem at how do I handle all these hyphens further there are various issues that might that you might face for certain language but not others so funny example is like in French if you have a token like l'ensemble so you might want to match it with uh Ensemble okay so that might be a similar problem that we are facing in English but let us take something in German okay so I have this I have this big sentence here okay but the problem is that this is not a single word this is a compound composed of four different words and the corresponding English meaning is this one so you have four words in in English so when you are putting in in French they make a compound so now what is the problem that you will face when you are processing that a German touched and you are trying to tokenize it so you might want to find out what are the individual words in this particular compound so you need some sort of compound splitter for example okay so this problem is they for German not so much for English okay so now what happens if I break any language like Chinese or Japanese okay so here is a sentence in Chinese so what do you see in Chinese words are written without any spaces in between okay so now when you are doing the people sing your task is to find out what are the individual word tokens in this Chinese sentence so this problem is also difficult because in general for a given utterance of a sequence of characters they might be more than one possible ways of breaking it into sequence of words and both might be perfectly valid possibilities okay so in Chinese we do not have any space between words and have to find out what are the places where I have to break these words and this problem is called tokenization word tokenization same problem happens with Japanese and you have further complications because they are using four different scripts like katakana hiragana kanji in Romans so this problems becomes a bit more severe now the same problem is there even for Sanskrit okay so if some of you have taken a Sanskrit goes in in your class 8th or 10th so you might be familiar with the the rules of something in in Sanskrit language okay so let us say this is a simple single sentence in Sanskrit but this is a huge this looks like a single word but it's not a single word it is composed of multiple words in Sanskrit and they are combined with us on the relation okay so this is a chance for a proverb in nice proverb in Sanskrit that translations in English Shh why should tell the truth one should say kind verse why should neither tell harsh truths not flat reflecting lies this is the rule for all times this is this is a proverb and the single sentence that talks about this proverb but they're all the words are combined with something relation so if you try to undo the Sunday this is what you will find at a segmented text okay so there are multiple words in this in the sentence they are combined to make it like a single it looks like a single word no so this problem is so in Chinese Japanese and Sanskrit but in Sanskrit the problem is slightly more complicated and why is that so in Japanese and in Chinese when you try to combine various words together you simply concatenate them you put them one after another without without making any changes at the boundary it doesn't happen in Sanskrit when you combine two words you also make certain changes at the boundary and this is called us and the operation okay so in this particular case see see here I have the word brihat and the word na but when I am combining I'm I am writing it Briona so you see here the the the letter T gets changed to and okay so that means when I'm trying to analyze the sentence so this particular sentence in Sanskrit I need to find out not only what are the breaks but what is the corresponding word from which this sentence is right so from here I have to find out the the actual send words are bruat plus na that gives give me this Briana and this is very very common in Sanskrit that you are always combining words by doing this on the operation so this further complicates my problem up for a second but tokenization or segmentation okay so this is just a list from Wikipedia what are the longest words in various languages okay not descend of this but the words you see in Sanskrit the longest word is composed of 431 characters it's the compound and then you have Greek and Afrikaans and and other languages in English you see the longest word is of 45 characters this is non scientific so what is the the particular word in in Sanskrit that which composer 431 lat Asst so this was from the word Ambika / in a shampoo by through-through Mulumba it is a single compound from his from his book okay so now when I talk about this problem of tokenization in Sanskrit or in English this problem is also called word segmentation I have a sequence of characters and you segment it to find out individual words now what is the simplest algorithm that you can think of let us say the case of Chinese okay so the simplest algorithm that works is a greedy algorithm that is called maximum matching algorithm so whenever you are given a string you start your pointer at the beginning of the string now suppose that you have the dictionary and the words that you have that you are currently seeing all should be in the in the dictionary so you will find out what is the maximum match as from a dictionary in the string you break there and put the pointer from at the next correct and again do the same thing so so this Bewdley to judge what are the actual words by taking the maximum matches and this works nicely for most of the cases okay now so so this psyllid iteration now can you think of some cases where the segmentation will also be required for English text in English in general we do not combine worst to make a single single word okay we do not do that but what is the scenario where we we are doing that right now okay so Dutch do hash tags come into mind so for example suppose I have hash tags like thank you searching in music Monday so here different words are combined together without putting a boundary in between so if you are given a hash tag and you have to analyze that you have to actually segmented into various words okay no so when you talk about Sanskrit so so so this we have a segment available at the site Sanskrit or India dot F R so just briefly see what is the design principle of building a segmental in Sanskrit so first we have a generative model that says how do i generate a sentence in sanskrit i have a finite alphabet Sigma okay that means a set of various characters in Sanskrit now from this finite alphabet I can generate a lot of words okay that are composed of various number of phonemes or letters from the self of it now when I have a set of words I cannot combine them together with an operation of Sunday that's what I mean by Sigma star here okay so W star here so I have a set of words W and I do a cleaner closure that means I can combine any number of words together but whenever I am combining words I am doing them by Sun the operation okay this is a relation between the works so so have my set of infected words also called pathogen in Sanskrit and I have the relation of sundry between them and that's how I generate sentences but my problem is how do I analyze them so that is the inverse problem that is whenever I am given a sandwich W I have to analyze it by inverting the relations of Sandy so that I can produce a finite set of word forms w1 to WN okay and I'm saying together with the proofs that is a formal way of saying that but what I mean is the w12 WN when I they combined by some the operation they give me the actual sentence the initial sentence okay so that's how the segment segment is built now this is a snapshot from from the segment ER so I gave the same sentence there and and it gave me all the possible ways of analyzing the sentence and it says that there are 120 different solutions okay so here whenever I have Brianna so you see there are two possibilities bruat and briam plus na okay like that it gives me all the possible ways in which the sentence can be broken into individual work tokens now this is another problem that I will have to find out what is the most likely word sequence among all these 120 possibilities but we can use many many different models that he will not talk about in this lecture probably in some some other lectures so coming back to normalization so so we talked about this problem that the same word might be written multiple different ways like u dot s dot a versus USA now I should be able to match them together okay especially if you are doing information retrieval you are giving a query and you are returning from some doc even suppose you query contains u dot s dot a and the document convinced us a if you are only doing the surface level match you will not be able to map them to each other so that so so you will have to consider this problem in advance and do the pre-processing accordingly of either your documents or the query but you do the same sort same settings so what I'm what we are doing by this we are defining some sort of equivalence classes we are saying you are saying u dot s dot a should go to one class and they are the same type we also do some sort of case folding that H we can reduce all letters to lowercase so whenever I have the word like wor di will always write small W already so that whenever even if it is starting the sentence and it occurs in capitals because of that in general I know that this is about wood W or D but this is not a generic truth sometimes depending on application you might have certain exceptions for example you might have to treat the name and it is separately so if you have a entity General Motors you might want to keep it as it is without case folding similarly you might want to get us for United States in uppercase and not do the case folding and this is important for the application of machine translation also because if you do a case folding here you will know us in lower case that means something else was such us that is in United States we also have the problem of lambda Asian that is you have individual individual words like M our age anyone to convert them to their lemma that means what is the base form from which they are derived similarly car cars cars cars so all these are derived from car so then this is some sort of mobilization you are saying all these are some sort of equivalence class because they come from the same word from so the problem with lambda issue is that you have to find out the actual dictionary head word from which they are derived okay and for that we use morphology okay so what is morphology I am trying to find out the search of a word by seeing what is the particular stem the head word and what is the effects that I apply to it okay so these individual units are called various morphemes okay so so you have extends that evolutionary hide words and affixes that are what are the different units like as for plural etcetera you are applying to them to make the individual world okay some examples are like for prefix you have um an T etcetera for English and at Ypres etcetera for Hindi know Sanskrit suffix like ETA shun etc and talk K etcetera for Hindi and in general you can also have some infix like you have evolved like with and you can in fix and in between this is in sanskrit so we'll discuss in detail about it in ma flossing later so so there's another concept you have lamentation where you finding the actual dictionary had word so is also concept called stemming where you do not try to find that traditional head word but you just try to remove certain suffixes okay and you up whatever you obtain is called a stem so this is a crude chopping of various of excess in that word okay so this is again language dependent so what we are doing here words like automate automatic automation all will be reduced to a single lemma automatic so they're standing so you know there actually might automate with an e but here so I'm just chopping off the affixes I think so I am removing here this I see IO n all and putting it to automate okay so this is one example so if you try to do a stemming here so you will find from example easily removed from compressed IDI she moved and so on so what is the algorithm that is used for for this is Tammy so we have the porters algorithm that is very very famous and this is again Simpson side of if-then-else ruse okay so what are some examples here so what is the first step I take avert if it ends with SS es I remove es from there and I end with SS so example it caris s goes to Karis if not then I see whether the word ends with IES I put it - I like ponies goes to pony okay if not I see if the word ends with SS I keep it as Isis if not I see the word ends with s I remove that s okay so cats goes to cat but Karis does not go to Karis with only one s because this is step comes before if there is a double is ending the word I retain it otherwise if there is signal is I remove that like that there are some other steps so if there is a wobble in the sand in the in my word and the word ends with ing I remove ing so walking goes to walk but what about King you see in in K there is no wobble so King will be retained as it is same there is a wobble and there's a needy I remove the CD and I have this what played to play so you can see that what is the use of this heuristic of having this vowel if you didn't have this vowel you would have converted King to K okay and like that there's some other rules like if the word ends with H null then I put it put a tea it's irrational so relational to it and if the words and word ends with I said er I can I remove that are digitized a 3/8 you are 280 and if the word ends with L I removed that L if the word ends with Abel I removed a table in the verse and with a te I remove the raining ok so like that these are some steps that I take from my corpus for each foot I I converted to its stem ok it does not give me the correct dictionary had worked but still this this is a good practice in principle for informational Trevor ok if you want to match the query with the documents so this is for this week so next week we will start with another dosing task that is a spelling correction ok