Coconote
AI notes
AI voice & video notes
Export note
Try for free
Text Processing Basics and Challenges
Jul 11, 2024
Final Lecture of the First Week
Recap of Last Lecture
Discussed various empirical laws
Zipf's Law
Heaps' Law
Highlights:
Vocabulary distribution is non-uniform
~100 words make up 50% of tokens
50% of vocabulary words occur only once
Relationships between vocabulary size and number of tokens
Zipf's Law: relation between word frequency and rank
Basic Text Processing
Tokenization
The process of converting a string of characters into words
Deals with segmenting text into individual words
Sentence segmentation as a pre-step
Challenges in Sentence Segmentation
Deciding where sentences begin and end
Example issues with punctuation:
Periods may not always indicate the end of a sentence (e.g., abbreviations like Dr., Mr., etc.)
Numbers (e.g., 4.4)
Solutions for Sentence Segmentation
Binary classification: End of the sentence vs. Not end of the sentence
Rule-based methods (e.g., decision trees with if-then-else)
Example features include punctuation, capitalization, and context
Machine learning approaches (e.g., decision trees, support vector machines)
Deep Dive into Tokenization
Token and type distinction
Example: “I have a can opener but I can’t open these cans”
Tokens: 11 words
Types: 10 unique words (I repeated twice)
Toolkits for tokenization:
NLTK in Python
CoreNLP in Java
Challenges in Tokenization
Handling contractions (e.g., I've, won't)
Proper nouns (e.g., San Francisco)
Hyphens (e.g., show-time)
Language-Specific Challenges
Chinese and Japanese: No spaces between words
Word tokenization is non-trivial
Sanskrit: Sandhi operation combines words, altering characters at boundaries
German: Compound words need splitting (e.g., four-word compounds)
Normalization
Matching different forms of the same word (e.g., U.S.A vs USA)
Equivalence classes and case folding
Generic case folding to lowercase, with some exceptions for named entities
Lemmatization and Stemming
Lemmatization: Find base dictionary form
Stemming: Simplifies by chopping off affixes
Example algorithms:
Porter’s algorithm for stemming, with rules to handle different suffixes
Conclusion
Summarized various pre-processing challenges and solutions
Next lecture topic: Spelling correction
📄
Full transcript