Understanding CRFs for Named Entity Recognition

So hi everyone, today we will be discussing about how a conditional random field that is CRF works for extracting named entities from a given text. So let's set up a basics first, what is named entity recognition, what is information extraction, what is a CRF and then we will eventually using an example we will know how conditional random fields works. So what is information extraction first of all? Information extraction refers to extracting important information from a given text from a given long text now this can be name of a person name of an industry some time a particular some venue this information we might wish to extract and this is called as information extraction now named entity recognition is part of information extraction uh when we say named entities we wish to pick out proper nouns that is name of a person name of a country, name of some organization or anything that can be given a specific name like for example Ram works at Google Now here there are two entities present Ram that is a person's name and Google that is an organization name So you can see that we can assign tags like this A person name can be given a tag as PER organization ORG, location LOC geopolitical entity as GPE etc etc Now in NER there are 2-3 major problems that we may face. One is segmentation ambiguity. So let me give you an example. Assume that we have a sentence New York is the best city to live in. Now if you notice here there are two words that combine together to form a single entity that is the name of a city that is New and York. Now to train such a system that consider these two words as a single entity is a bit difficult. Now the second ambiguity that comes is tag assignment that is when the same name can have different meanings. For example, Nirma in India can be a name of a girl as well and a very famous washing brand as well. Similarly, Apple can be the name of a fruit and tech organization also. Moving ahead, there can be multiple ways using which we can go for named entity recognition that is the most common one being linear CRFs. Maximum Entropy Markov Models or Biolistiums. So we would be discussing about how a CRF works for Named Entity Recognition for now. As you can see in the definition given a linear chain CRF confers to a labeler in which tag assignment for the present word depends only on the tag of the previous word. So for example if we have a long sentence like for example we have this sentence CRF is amongst the most prominent approach used for NER. And we wish to get an a tag for used so it's the crf considers the tag of its previous word to confer a tag for the current word alongside some feature functions so let's understand what are the different steps required in a crf so the first step required is deciding over the feature functions so what are feature functions a feature function is a function that is built over a particular particular words uh word level features useful generate word level features for example uh feature functions can be Is first letter capital? How many count? Is vowel present? Is it present in some gazetteer? The given word. Word embedding of a word. Presence of a hyphen in the word. Etc. So these are small small functions you can understand. That intakes a word and gives you an output. Depending upon it that helps us to build features over words. Right? So... If we input a word we would get a set of different features for that whether first letter is capital False or true whether vowels are present false or true whether there is a particular prefix present in the word like like in or un Whether our suffix is present like ed or ing Etc. So these are the features that we wish to build over the different words that we are getting right We would be skipping an explanation over different feature functions like feature functions are dependent upon the developer. He can choose as many feature functions as he wish to have. Now let's take an example and let's understand how CRS works or using like how can feature functions are calculated over a given sentence and how CRS calculates the probability of a given tag sequence. For example, we have a sentence called as Ram is cool. Like and the name entity labels for this particular sentence is PER, O and O. So let's understand first of all the output. PER refers to a person entity that we have found that is the name of a person. RAM is the name of a person. IS and COOL has got tags as O. So what is O? O refers to that the words are not a named entity. I would suggest you to go to my last video about IOB tagging where you can understand what does O stands for. So, in this given sentence, we have an entity Ram and there are two words that are not an entity. That is why the correct sequence that we wish to predict is PER O and O. So, let's understand how it will work out. So, each feature function has this particular signature. As you can see, f x, y, y-1, i. So, let's understand what is this signature. x refers to particular sentence that we have, that is Ram is cool. y refers to the tag of the current word so like what tag we can calculate like for example o for like if we are considering i equals to 2 i is the last one that is the index so if ram is cool and we have an i equals to 2 that means the last word we are not considering that from mathematical or like from a computer science perspective in general we'll if we start counting from a 1 2 and 3 as an index so is is the second word that we are getting so let's assume that we are trying to calculate the some feature function x for the word is so the signature would look something like this ram is cool that is x that is the name of the that is a sentence y is the current words tag that we uh that we are trying to uh predict the probability for uh y minus one is the tag for the previous word that we're trying to calculate that we that we assume in the sequence and 2 is the index so let's now understand what is the entire major equation that is followed in crf so this is the equation that we follow up in crf so probability of y given x equals to this is very very important exponential summation over j wj fj x comma y upon summation of y hat exponential summation j yj fj x comma y hat so this is a very dangerous equation and i guess we might take some time to understand it uh so let's understand this very dangerous looking equation so uh starting off with fj x i x comma y refers to the function signature that i already told you that is summation over a small fj y i minus 1 y i x i So let's very easily, let's try to understand what does this entire fj x comma y means. So it means that we are trying to calculate a particular feature function for all the words in the given sentence. Right. For example, we have feature function is vowel. Now it returns us a value 0 or 1. So we are trying to calculate by this. It means that we are trying to calculate the summation value of. that particular function for all the words so for example if we have a sentence ram is cool and we have a feature function uh is uh is capital is first letter capital so the this value would be one because ram has a capital letter its first letter that means uh it will give an output as one and the rest of the two word is cool will give an output as zero so one plus zero plus zero equals to one so just try to understand this is a summation of the values of a given feature function for all the words in the given sentence now what is wj wj is the weights assigned to different feature functions so for example uh in any system uh we would wish to give out uh some uh more uh importance to a particular feature as compared to some other features so wj refers to that and this summation is over different feature functions so for different feature functions like assume that we have 10 feature functions is is vowel present is consonant present is first letter capital etc we are trying to calculate this value fj x comma y that is a summation of a given feature function for all the words and submitting them all together right so if we have 10 feature functions we will be calculating each feature function for all the words and then adding them together now what does this uh denominator means in the denominator you can see there are two summations going on so the first summation the inside the inner summation is similar to the summation that is above but the outer summation has a y hat in the equation so it refers to the value that we get for every possible y so if we look what on the left hand side we are trying to calculate the probability of a given tax sequence given the sentence so for example uh we are trying to like for example if we consider y can be uh have different values so we are trying to calculate for each tax sequence given the sentence what is the probability right now this y can be p e r o o this can be o o o this can be o r g o o o so make sense we are trying to calculate the probability for all the possible feature fun all the possible tax sequences that can be possible given the sentence x right and uh based out of that we are trying to calculate these functions as well so y hat is the sum of all the possible sequences over which we are trying to run this fj summation. So, in the numerator, we are considering just that particular sequence for which we wish to calculate the probability and in the denominator, we are considering all the possible sequences that can be possible and their summation. a bit difficult to understand but i guess you would get it soon so as i will show you the example so if you look at here if you if you are trying to calculate uh of the probability for this particular sequence per per loc as i said you they can be many possible sequences that can happen right So, assume that this is one of the sequences for which we are trying to calculate the probability given Ram is cool. So, the numerator would look something like this. Exponential summation of j, wj, summation over fj Ram is cool, perperloc. Now, this capital fj if you remember is the summation of all the words of some feature function x for all the words. And that is why we are n. summation is running over for all the function all the feature functions now if you look in the denominator it would be a bit easier to understand that we are adding up the possibility for the summation values for each and every possible sequence that is present like you can see a rame school o o o rame school v e o r g o rame school p r o r g o etc etc so in the numerator it's only that particular sequence for which we are trying to calculate the probabilities considered in case of denominator we are considering all the possible sequences now as you know that uh the actual sequence of tax sequence should be per oh as we discussed earlier so out of the probability we are calculating for all the sequences this particular sequence that is probability for per oo given rama school should be the highest right now uh one thing that we have missed as of now is how we are calculating the weights wj that we have mentioned earlier in the equation if you remember alongside feature functions wj so how is this getting calculated uh it is getting trained actually so we are doing some sort of a training uh over the crf using some training data and weight updation has been done using gradient descent only over this particular equation we won't be deep diving into the derivatives and all so it's just for an understanding then how crfs work out thank you so much

So let me give you an example. Assume that we have a sentence New York is the best city to live in. Now if you notice here there are two words that combine together to form a single entity that is the name of a city that is New and York.

Now to train such a system that consider these two words as a single entity is a bit difficult. Now the second ambiguity that comes is tag assignment that is when the same name can have different meanings. For example, Nirma in India can be a name of a girl as well and a very famous washing brand as well. Similarly, Apple can be the name of a fruit and tech organization also. Moving ahead, there can be multiple ways using which we can go for named entity recognition that is the most common one being linear CRFs.

Maximum Entropy Markov Models or Biolistiums. So we would be discussing about how a CRF works for Named Entity Recognition for now. As you can see in the definition given a linear chain CRF confers to a labeler in which tag assignment for the present word depends only on the tag of the previous word.

So for example if we have a long sentence like for example we have this sentence CRF is amongst the most prominent approach used for NER. And we wish to get an a tag for used so it's the crf considers the tag of its previous word to confer a tag for the current word alongside some feature functions so let's understand what are the different steps required in a crf so the first step required is deciding over the feature functions so what are feature functions a feature function is a function that is built over a particular particular words uh word level features useful generate word level features for example uh feature functions can be Is first letter capital? How many count?

Is vowel present? Is it present in some gazetteer? The given word. Word embedding of a word. Presence of a hyphen in the word.

Etc. So these are small small functions you can understand. That intakes a word and gives you an output.

Depending upon it that helps us to build features over words. Right? So...

If we input a word we would get a set of different features for that whether first letter is capital False or true whether vowels are present false or true whether there is a particular prefix present in the word like like in or un Whether our suffix is present like ed or ing Etc. So these are the features that we wish to build over the different words that we are getting right We would be skipping an explanation over different feature functions like feature functions are dependent upon the developer. He can choose as many feature functions as he wish to have. Now let's take an example and let's understand how CRS works or using like how can feature functions are calculated over a given sentence and how CRS calculates the probability of a given tag sequence.

For example, we have a sentence called as Ram is cool. Like and the name entity labels for this particular sentence is PER, O and O. So let's understand first of all the output.

PER refers to a person entity that we have found that is the name of a person. RAM is the name of a person. IS and COOL has got tags as O. So what is O? O refers to that the words are not a named entity.

I would suggest you to go to my last video about IOB tagging where you can understand what does O stands for. So, in this given sentence, we have an entity Ram and there are two words that are not an entity. That is why the correct sequence that we wish to predict is PER O and O. So, let's understand how it will work out.

So, each feature function has this particular signature. As you can see, f x, y, y-1, i. So, let's understand what is this signature.

x refers to particular sentence that we have, that is Ram is cool. y refers to the tag of the current word so like what tag we can calculate like for example o for like if we are considering i equals to 2 i is the last one that is the index so if ram is cool and we have an i equals to 2 that means the last word we are not considering that from mathematical or like from a computer science perspective in general we'll if we start counting from a 1 2 and 3 as an index so is is the second word that we are getting so let's assume that we are trying to calculate the some feature function x for the word is so the signature would look something like this ram is cool that is x that is the name of the that is a sentence y is the current words tag that we uh that we are trying to uh predict the probability for uh y minus one is the tag for the previous word that we're trying to calculate that we that we assume in the sequence and 2 is the index so let's now understand what is the entire major equation that is followed in crf so this is the equation that we follow up in crf so probability of y given x equals to this is very very important exponential summation over j wj fj x comma y upon summation of y hat exponential summation j yj fj x comma y hat so this is a very dangerous equation and i guess we might take some time to understand it uh so let's understand this very dangerous looking equation so uh starting off with fj x i x comma y refers to the function signature that i already told you that is summation over a small fj y i minus 1 y i x i So let's very easily, let's try to understand what does this entire fj x comma y means. So it means that we are trying to calculate a particular feature function for all the words in the given sentence.

Right. For example, we have feature function is vowel. Now it returns us a value 0 or 1. So we are trying to calculate by this. It means that we are trying to calculate the summation value of. that particular function for all the words so for example if we have a sentence ram is cool and we have a feature function uh is uh is capital is first letter capital so the this value would be one because ram has a capital letter its first letter that means uh it will give an output as one and the rest of the two word is cool will give an output as zero so one plus zero plus zero equals to one so just try to understand this is a summation of the values of a given feature function for all the words in the given sentence now what is wj wj is the weights assigned to different feature functions so for example uh in any system uh we would wish to give out uh some uh more uh importance to a particular feature as compared to some other features so wj refers to that and this summation is over different feature functions so for different feature functions like assume that we have 10 feature functions is is vowel present is consonant present is first letter capital etc we are trying to calculate this value fj x comma y that is a summation of a given feature function for all the words and submitting them all together right so if we have 10 feature functions we will be calculating each feature function for all the words and then adding them together now what does this uh denominator means in the denominator you can see there are two summations going on so the first summation the inside the inner summation is similar to the summation that is above but the outer summation has a y hat in the equation so it refers to the value that we get for every possible y so if we look what on the left hand side we are trying to calculate the probability of a given tax sequence given the sentence so for example uh we are trying to like for example if we consider y can be uh have different values so we are trying to calculate for each tax sequence given the sentence what is the probability right now this y can be p e r o o this can be o o o this can be o r g o o o so make sense we are trying to calculate the probability for all the possible feature fun all the possible tax sequences that can be possible given the sentence x right and uh based out of that we are trying to calculate these functions as well so y hat is the sum of all the possible sequences over which we are trying to run this fj summation.

So, in the numerator, we are considering just that particular sequence for which we wish to calculate the probability and in the denominator, we are considering all the possible sequences that can be possible and their summation. a bit difficult to understand but i guess you would get it soon so as i will show you the example so if you look at here if you if you are trying to calculate uh of the probability for this particular sequence per per loc as i said you they can be many possible sequences that can happen right So, assume that this is one of the sequences for which we are trying to calculate the probability given Ram is cool. So, the numerator would look something like this. Exponential summation of j, wj, summation over fj Ram is cool, perperloc. Now, this capital fj if you remember is the summation of all the words of some feature function x for all the words.

And that is why we are n. summation is running over for all the function all the feature functions now if you look in the denominator it would be a bit easier to understand that we are adding up the possibility for the summation values for each and every possible sequence that is present like you can see a rame school o o o rame school v e o r g o rame school p r o r g o etc etc so in the numerator it's only that particular sequence for which we are trying to calculate the probabilities considered in case of denominator we are considering all the possible sequences now as you know that uh the actual sequence of tax sequence should be per oh as we discussed earlier so out of the probability we are calculating for all the sequences this particular sequence that is probability for per oo given rama school should be the highest right now uh one thing that we have missed as of now is how we are calculating the weights wj that we have mentioned earlier in the equation if you remember alongside feature functions wj so how is this getting calculated uh it is getting trained actually so we are doing some sort of a training uh over the crf using some training data and weight updation has been done using gradient descent only over this particular equation we won't be deep diving into the derivatives and all so it's just for an understanding then how crfs work out thank you so much

Transcript for:Understanding CRFs for Named Entity Recognition

Transcript for:
Understanding CRFs for Named Entity Recognition