Data Agnosticism
Speaker: Nick Kreidller
Introduction
- Generalist, applied mathematician
- Numerical solutions to partial differential equations
- Data Scientist at Accretive Health in Chicago
- Competed in several Kaggle competitions (Master tier)
- General message: Responsible data analysis and quick iteration can produce high-performing predictive models without domain expertise
Domain Expertise vs. Machine Learning Skill
- Data science debate on domain expertise vs. ML skill
- Arguments for both sides:
- Domain expertise helps to know which data/features are important
- Algorithms only care about the quality of data (garbage in, garbage out)
- Takeaway: Algorithms don't need to know the source of the data
Data Agnosticism
- Algorithms process data agnostically
- Can use models to find features without domain expertise
- Importance of responsible data analysis and quick iteration
- Example: Kaggle competitions on whale detection and healthcare
Process for Finding Whales (Kaggle Competition)
- North Atlantic right whale call detection
- Spectrograms used to identify whale calls
- Started with a simple correlation-based model
- Averaged all right whale spectrograms
- Generated features based on max normalized cross-correlation
- Used random forests for classification
Improvement Cycle
- Identified samples with poor signal-to-noise ratio
- Applied contrast enhancement
- Quick iteration and evaluation
- Consistent data-driven improvements
- Python tools helped in rapid iteration
Example Results
- Simple model yielded area under curve (AUC) of 0.92 (higher than benchmark)
- Contrast enhancement improved AUC to 0.94
- Continuous improvements through quick iterations
Good vs. Bad Cycle
- Good: Iterative improvement cycle (prediction, evaluation, model improvement)
- Bad: Random walk through algorithms without data focus
Tools and Final Outcomes
- Python libraries facilitated quick data analysis and model improvements
- Data-driven strategy led to high leaderboard positions in Kaggle competitions
- Importance of better data over complex algorithms
Conclusion
- Algorithms care about getting better data
- Responsible data analysis and quick iteration lead to successful models
- Available code on GitHub
Q&A
- Applied same process to work projects and competitions
- Random forest importance weighting for feature selection used
Contact: GitHub link for code and hiring info at Accretive Health.