Label Ambiguity and Data Quality in Machine Learning

Jul 4, 2024

Lecture Notes: Label Ambiguity and Data Quality in Machine Learning

Introduction

  • Label Ambiguity: Ambiguity in bounding boxes for images and other data types.
  • Examples: Speech recognition, structured data (e.g., user ID merge).

Speech Recognition

  • Example Scenario: Audio clip with background noise.
    • Multiple transcriptions possible: "um nearest gas station", "umm nearest gas station", with commas or ellipses.
    • Ambiguities: Spelling of "um" (one 'm' or two 'm's), use of ellipses or commas, and whether the last part is intelligible or not.
  • Transcription Standards: Important to standardize transcription conventions to improve speech recognition algorithms.

Structured Data: User ID Merge

  • Scenario: Merging data records from different sources (e.g., website vs. mobile app).
  • Ambiguity: Determining whether two records belong to the same person.
  • Supervised Learning: Algorithm to predict if records are from the same person.
    • Ground truth can be gathered through explicit user linkage or manual labeling by humans.
    • Labeling Consistency: Important despite inherent ambiguities.
  • User Privacy: Respect user data and privacy, use data only with proper permission.

Other Structured Data Examples

  • Bot/Spam Detection: Ambiguous labels for user accounts.
  • Fraud Detection: Ambiguous whether a transaction is fraudulent.
  • Job Seeking Behavior: Hard to infer user's intent from behavior on job sites.

Improving Data Quality

  • Input Quality (X): Ensure data is high-quality and informative.
    • Example: Improve lighting for smartphone defect detection images.
  • Labeling Consistency (Y): Provide clear instructions to labelers to reduce noise and randomness in labeling.

Systematic Approach to Data Issues

  • Key Questions:
    • What is the input (X)?
    • What should be the target label (Y)?
    • How to ensure consistent labeling?
  • Solutions: Improve sensor quality, data quality, and provide clear labeling instructions.

Moving Forward

  • Next Steps: Systematic framework to address data and labeling issues for better Machine Learning performance.