Text Processing Basics and Challenges

Jul 11, 2024

Final Lecture of the First Week

Recap of Last Lecture

  • Discussed various empirical laws
    • Zipf's Law
    • Heaps' Law
  • Highlights:
    • Vocabulary distribution is non-uniform
    • ~100 words make up 50% of tokens
    • 50% of vocabulary words occur only once
    • Relationships between vocabulary size and number of tokens
    • Zipf's Law: relation between word frequency and rank

Basic Text Processing

Tokenization

  • The process of converting a string of characters into words
  • Deals with segmenting text into individual words
  • Sentence segmentation as a pre-step

Challenges in Sentence Segmentation

  • Deciding where sentences begin and end
  • Example issues with punctuation:
    • Periods may not always indicate the end of a sentence (e.g., abbreviations like Dr., Mr., etc.)
    • Numbers (e.g., 4.4)

Solutions for Sentence Segmentation

  • Binary classification: End of the sentence vs. Not end of the sentence
  • Rule-based methods (e.g., decision trees with if-then-else)
    • Example features include punctuation, capitalization, and context
  • Machine learning approaches (e.g., decision trees, support vector machines)

Deep Dive into Tokenization

  • Token and type distinction
    • Example: “I have a can opener but I can’t open these cans”
      • Tokens: 11 words
      • Types: 10 unique words (I repeated twice)
  • Toolkits for tokenization:
    • NLTK in Python
    • CoreNLP in Java
  • Challenges in Tokenization
    • Handling contractions (e.g., I've, won't)
    • Proper nouns (e.g., San Francisco)
    • Hyphens (e.g., show-time)

Language-Specific Challenges

  • Chinese and Japanese: No spaces between words
    • Word tokenization is non-trivial
  • Sanskrit: Sandhi operation combines words, altering characters at boundaries
  • German: Compound words need splitting (e.g., four-word compounds)

Normalization

  • Matching different forms of the same word (e.g., U.S.A vs USA)
  • Equivalence classes and case folding
    • Generic case folding to lowercase, with some exceptions for named entities
  • Lemmatization and Stemming
    • Lemmatization: Find base dictionary form
    • Stemming: Simplifies by chopping off affixes
  • Example algorithms:
    • Porter’s algorithm for stemming, with rules to handle different suffixes

Conclusion

  • Summarized various pre-processing challenges and solutions
  • Next lecture topic: Spelling correction