📚

Understanding CRFs for Named Entity Recognition

Aug 2, 2024

Conditional Random Fields (CRF) for Named Entity Recognition (NER)

Introduction

Discussion on how Conditional Random Fields (CRF) works for extracting named entities from text.
Overview of topics:
- Named Entity Recognition (NER)
- Information Extraction (IE)
- CRF basics

Information Extraction

Definition: Extracting important information from long text.
Examples of information extracted:
- Names of persons, organizations, locations.
- E.g., "Ram works at Google" (Entities: Ram - person, Google - organization).

Named Entity Recognition (NER)

Part of Information Extraction focused on identifying proper nouns.
Tags for entities:
- Person: PER
- Organization: ORG
- Location: LOC
- Geopolitical Entity: GPE

Challenges in NER

Segmentation Ambiguity:
- Example: "New York" as a city name.
- Difficulty in training models to recognize compound entities.
Tag Assignment Ambiguity:
- Example: "Nirma" can refer to a person or a brand.
- Example: "Apple" can refer to a fruit or a tech organization.

Approaches to NER

Common methods include:
- Linear CRFs
- Maximum Entropy Markov Models
- BiLSTMs

Conditional Random Fields (CRF)

Definition: A linear chain CRF assigns tags based on the previous word's tag and features.
Example sentence: "CRF is amongst the most prominent approach used for NER."
The CRF considers the tag of the previous word to assign tags, in conjunction with feature functions.

Feature Functions

Definition: Functions that generate useful features for individual words.
Examples of feature functions:
- Is the first letter capital?
- Is a vowel present?
- Is the word in a gazetteer?
- Word embeddings, presence of hyphens, etc.
Output of feature functions helps in building word-level features.

CRF Process Example

Input: Sentence "Ram is cool" with NER labels: PER O O (Ram - person, other words - not entities).
Tags explained:
- PER: Person (Ram)
- O: Other (non-entities)

Feature Function Signature

Signature format: f(x, y, y-1, i)
- x: Sentence (e.g., "Ram is cool")
- y: Current word's tag
- y-1: Tag of the previous word
- i: Index of the current word in the sentence

Main Equation in CRF

Equation for probability of tag sequence given the sentence:
- P(y|x) = (1/Z) exp(∑_j w_j f_j(x, y))
- Z = ∑_y_hat exp(∑_j w_j f_j(x, y_hat))_

Explanation of the Equation

f_j(x, y): Feature function for a given word and tag.
w_j: Weights assigned to each feature function.
Z: Normalization constant, summing probabilities over all possible tag sequences.

Calculating Weights (Training)

Weights are learned from training data using gradient descent.
Important for adjusting the influence of different feature functions in predictions.

Conclusion

Understanding CRF's role in NER helps in effectively tagging entities in text.
The complexity of feature functions and tag assignment is managed through statistical models like CRFs.

Full transcript