Data Agnosticism

Jul 12, 2024

Data Agnosticism

Speaker: Nick Kreidller

Introduction

  • Generalist, applied mathematician
  • Numerical solutions to partial differential equations
  • Data Scientist at Accretive Health in Chicago
  • Competed in several Kaggle competitions (Master tier)
  • General message: Responsible data analysis and quick iteration can produce high-performing predictive models without domain expertise

Domain Expertise vs. Machine Learning Skill

  • Data science debate on domain expertise vs. ML skill
  • Arguments for both sides:
    • Domain expertise helps to know which data/features are important
    • Algorithms only care about the quality of data (garbage in, garbage out)
  • Takeaway: Algorithms don't need to know the source of the data

Data Agnosticism

  • Algorithms process data agnostically
  • Can use models to find features without domain expertise
  • Importance of responsible data analysis and quick iteration
  • Example: Kaggle competitions on whale detection and healthcare

Process for Finding Whales (Kaggle Competition)

  • North Atlantic right whale call detection
  • Spectrograms used to identify whale calls
  • Started with a simple correlation-based model
    • Averaged all right whale spectrograms
    • Generated features based on max normalized cross-correlation
    • Used random forests for classification

Improvement Cycle

  • Identified samples with poor signal-to-noise ratio
  • Applied contrast enhancement
  • Quick iteration and evaluation
  • Consistent data-driven improvements
  • Python tools helped in rapid iteration

Example Results

  • Simple model yielded area under curve (AUC) of 0.92 (higher than benchmark)
  • Contrast enhancement improved AUC to 0.94
  • Continuous improvements through quick iterations

Good vs. Bad Cycle

  • Good: Iterative improvement cycle (prediction, evaluation, model improvement)
  • Bad: Random walk through algorithms without data focus

Tools and Final Outcomes

  • Python libraries facilitated quick data analysis and model improvements
  • Data-driven strategy led to high leaderboard positions in Kaggle competitions
  • Importance of better data over complex algorithms

Conclusion

  • Algorithms care about getting better data
  • Responsible data analysis and quick iteration lead to successful models
  • Available code on GitHub

Q&A

  • Applied same process to work projects and competitions
  • Random forest importance weighting for feature selection used

Contact: GitHub link for code and hiring info at Accretive Health.