Understanding CRFs for Named Entity Recognition

Aug 2, 2024

Conditional Random Fields (CRF) for Named Entity Recognition (NER)

Introduction

  • Discussion on how Conditional Random Fields (CRF) works for extracting named entities from text.
  • Overview of topics:
    • Named Entity Recognition (NER)
    • Information Extraction (IE)
    • CRF basics

Information Extraction

  • Definition: Extracting important information from long text.
  • Examples of information extracted:
    • Names of persons, organizations, locations.
    • E.g., "Ram works at Google" (Entities: Ram - person, Google - organization).

Named Entity Recognition (NER)

  • Part of Information Extraction focused on identifying proper nouns.
  • Tags for entities:
    • Person: PER
    • Organization: ORG
    • Location: LOC
    • Geopolitical Entity: GPE

Challenges in NER

  1. Segmentation Ambiguity:
    • Example: "New York" as a city name.
    • Difficulty in training models to recognize compound entities.
  2. Tag Assignment Ambiguity:
    • Example: "Nirma" can refer to a person or a brand.
    • Example: "Apple" can refer to a fruit or a tech organization.

Approaches to NER

  • Common methods include:
    • Linear CRFs
    • Maximum Entropy Markov Models
    • BiLSTMs

Conditional Random Fields (CRF)

  • Definition: A linear chain CRF assigns tags based on the previous word's tag and features.
  • Example sentence: "CRF is amongst the most prominent approach used for NER."
  • The CRF considers the tag of the previous word to assign tags, in conjunction with feature functions.

Feature Functions

  • Definition: Functions that generate useful features for individual words.
  • Examples of feature functions:
    • Is the first letter capital?
    • Is a vowel present?
    • Is the word in a gazetteer?
    • Word embeddings, presence of hyphens, etc.
  • Output of feature functions helps in building word-level features.

CRF Process Example

  • Input: Sentence "Ram is cool" with NER labels: PER O O (Ram - person, other words - not entities).
  • Tags explained:
    • PER: Person (Ram)
    • O: Other (non-entities)

Feature Function Signature

  • Signature format: f(x, y, y-1, i)
    • x: Sentence (e.g., "Ram is cool")
    • y: Current word's tag
    • y-1: Tag of the previous word
    • i: Index of the current word in the sentence

Main Equation in CRF

  • Equation for probability of tag sequence given the sentence:
    • P(y|x) = (1/Z) exp(∑_j w_j f_j(x, y))
    • Z = ∑_y_hat exp(∑_j w_j f_j(x, y_hat))

Explanation of the Equation

  • f_j(x, y): Feature function for a given word and tag.
  • w_j: Weights assigned to each feature function.
  • Z: Normalization constant, summing probabilities over all possible tag sequences.

Calculating Weights (Training)

  • Weights are learned from training data using gradient descent.
  • Important for adjusting the influence of different feature functions in predictions.

Conclusion

  • Understanding CRF's role in NER helps in effectively tagging entities in text.
  • The complexity of feature functions and tag assignment is managed through statistical models like CRFs.