Data Agnosticism

Jul 12, 2024

Data Agnosticism

Speaker: Nick Kreidller

Introduction

Generalist, applied mathematician
Numerical solutions to partial differential equations
Data Scientist at Accretive Health in Chicago
Competed in several Kaggle competitions (Master tier)
General message: Responsible data analysis and quick iteration can produce high-performing predictive models without domain expertise

Domain Expertise vs. Machine Learning Skill

Data science debate on domain expertise vs. ML skill
Arguments for both sides:
- Domain expertise helps to know which data/features are important
- Algorithms only care about the quality of data (garbage in, garbage out)
Takeaway: Algorithms don't need to know the source of the data

Data Agnosticism

Algorithms process data agnostically
Can use models to find features without domain expertise
Importance of responsible data analysis and quick iteration
Example: Kaggle competitions on whale detection and healthcare

Process for Finding Whales (Kaggle Competition)

North Atlantic right whale call detection
Spectrograms used to identify whale calls
Started with a simple correlation-based model
- Averaged all right whale spectrograms
- Generated features based on max normalized cross-correlation
- Used random forests for classification

Improvement Cycle

Identified samples with poor signal-to-noise ratio
Applied contrast enhancement
Quick iteration and evaluation
Consistent data-driven improvements
Python tools helped in rapid iteration

Example Results

Simple model yielded area under curve (AUC) of 0.92 (higher than benchmark)
Contrast enhancement improved AUC to 0.94
Continuous improvements through quick iterations

Good vs. Bad Cycle

Good: Iterative improvement cycle (prediction, evaluation, model improvement)
Bad: Random walk through algorithms without data focus

Tools and Final Outcomes

Python libraries facilitated quick data analysis and model improvements
Data-driven strategy led to high leaderboard positions in Kaggle competitions
Importance of better data over complex algorithms

Conclusion

Algorithms care about getting better data
Responsible data analysis and quick iteration lead to successful models
Available code on GitHub

Q&A

Applied same process to work projects and competitions
Random forest importance weighting for feature selection used

Contact: GitHub link for code and hiring info at Accretive Health.

Full transcript