Coconote
AI notes
AI voice & video notes
Export note
Try for free
Label Ambiguity and Data Quality in Machine Learning
Jul 4, 2024
Lecture Notes: Label Ambiguity and Data Quality in Machine Learning
Introduction
Label Ambiguity
: Ambiguity in bounding boxes for images and other data types.
Examples
: Speech recognition, structured data (e.g., user ID merge).
Speech Recognition
Example Scenario
: Audio clip with background noise.
Multiple transcriptions possible: "um nearest gas station", "umm nearest gas station", with commas or ellipses.
Ambiguities
: Spelling of "um" (one 'm' or two 'm's), use of ellipses or commas, and whether the last part is intelligible or not.
Transcription Standards
: Important to standardize transcription conventions to improve speech recognition algorithms.
Structured Data: User ID Merge
Scenario
: Merging data records from different sources (e.g., website vs. mobile app).
Ambiguity
: Determining whether two records belong to the same person.
Supervised Learning
: Algorithm to predict if records are from the same person.
Ground truth can be gathered through explicit user linkage or manual labeling by humans.
Labeling Consistency
: Important despite inherent ambiguities.
User Privacy
: Respect user data and privacy, use data only with proper permission.
Other Structured Data Examples
Bot/Spam Detection
: Ambiguous labels for user accounts.
Fraud Detection
: Ambiguous whether a transaction is fraudulent.
Job Seeking Behavior
: Hard to infer user's intent from behavior on job sites.
Improving Data Quality
Input Quality (X)
: Ensure data is high-quality and informative.
Example: Improve lighting for smartphone defect detection images.
Labeling Consistency (Y)
: Provide clear instructions to labelers to reduce noise and randomness in labeling.
Systematic Approach to Data Issues
Key Questions
:
What is the input (X)?
What should be the target label (Y)?
How to ensure consistent labeling?
Solutions
: Improve sensor quality, data quality, and provide clear labeling instructions.
Moving Forward
Next Steps
: Systematic framework to address data and labeling issues for better Machine Learning performance.
📄
Full transcript