Text Processing Basics and Challenges

Jul 11, 2024

Final Lecture of the First Week

Recap of Last Lecture

Discussed various empirical laws
- Zipf's Law
- Heaps' Law
Highlights:
- Vocabulary distribution is non-uniform
- ~100 words make up 50% of tokens
- 50% of vocabulary words occur only once
- Relationships between vocabulary size and number of tokens
- Zipf's Law: relation between word frequency and rank

Basic Text Processing

Tokenization

The process of converting a string of characters into words
Deals with segmenting text into individual words
Sentence segmentation as a pre-step

Challenges in Sentence Segmentation

Deciding where sentences begin and end
Example issues with punctuation:
- Periods may not always indicate the end of a sentence (e.g., abbreviations like Dr., Mr., etc.)
- Numbers (e.g., 4.4)

Solutions for Sentence Segmentation

Binary classification: End of the sentence vs. Not end of the sentence
Rule-based methods (e.g., decision trees with if-then-else)
- Example features include punctuation, capitalization, and context
Machine learning approaches (e.g., decision trees, support vector machines)

Deep Dive into Tokenization

Token and type distinction
- Example: “I have a can opener but I can’t open these cans”
  - Tokens: 11 words
  - Types: 10 unique words (I repeated twice)
Toolkits for tokenization:
- NLTK in Python
- CoreNLP in Java
Challenges in Tokenization
- Handling contractions (e.g., I've, won't)
- Proper nouns (e.g., San Francisco)
- Hyphens (e.g., show-time)

Language-Specific Challenges

Chinese and Japanese: No spaces between words
- Word tokenization is non-trivial
Sanskrit: Sandhi operation combines words, altering characters at boundaries
German: Compound words need splitting (e.g., four-word compounds)

Normalization

Matching different forms of the same word (e.g., U.S.A vs USA)
Equivalence classes and case folding
- Generic case folding to lowercase, with some exceptions for named entities
Lemmatization and Stemming
- Lemmatization: Find base dictionary form
- Stemming: Simplifies by chopping off affixes
Example algorithms:
- Porter’s algorithm for stemming, with rules to handle different suffixes

Conclusion

Summarized various pre-processing challenges and solutions
Next lecture topic: Spelling correction

Full transcript