Lecture on Cargo Competitions for Click Prediction

Sep 6, 2024

Lecture on Cargo Competitions for Click Prediction

Overview

Today's talk covers three main Kaggle competitions focused on click prediction:

  1. Display Advertising Challenge
  2. Outbrain Click Prediction
  3. Mobile Ad Click Prediction

1. Display Advertising Challenge

  • Timeframe: 5-7 years ago
  • Objective: Predict if a user will click on a given ad
  • Context Data:
    • User details
    • Visited page details
    • Click label (1 for click, 0 for no click)

Data Features

  • L1 to L13: Integer values, mostly counts
  • C1 to C26: Categorical values, kept anonymous for user safety

Dataset Size

  • Training data: 45 million samples
  • Prediction task: 6 million samples
  • Post one-hot encoding: 33 million features, making it sparse

Evaluation Metric

  • Log loss: Measures the accuracy probability of the binary classification

Winning Strategy: Team '3 Idiots'

  • Workflow:
    1. Pre-process data to create 39 features
    2. Use Gradient Boosted Decision Trees (GBDT) to generate features
    3. Transform data using GBDT
    4. Train Field-aware Factorization Machine (FFM)
    5. Calibrate outputs for final results

Detailed Steps

  • GBDT: Trains trees iteratively to reduce residual errors
  • Sparse to Dense Encoding: N trees with depth D results in 2^D leaf nodes per tree
  • Log Transformation & Grouping: Log transform numerical features, group rare categorical ones
  • Hashing: Converts text features into integers via hash functions
  • FFM: Decomposes interaction terms into field-aware matrices, enhancing feature representations

2. Outbrain Click Prediction

  • Objective: Recommend content on news channels based on user behavior
  • Platform: Content discovery, pops up as embedded guide in news articles

Data Features

  • User's Page Views and Clicks: Tracks historical views, documents, platform, location, traffic source
  • Click Data: Display IDs, ad IDs, click status, metadata
  • Document Meta-data: Publisher channel, publish time, topics, entities, categories

Evaluation Metric

  • Mean Average Precision at 12: Measures prediction precision across the top-12 ranks iteratively

Third Place Solution

  • Feature Extraction & Model Stacking:
    • FFM with soft max click probability & pairwise rank loss
    • Extra Boost with pairwise rank loss
  • Additional Features: Page view counts, ad landing page views, impressions by ad, document vectors
  • Encoding Strategy: Aggregates user historical document vectors, using inner product for similarity comparison

3. Mobile Ad Click Prediction

  • Objective: Predict if a mobile ad will be clicked

Data Features

  • Ad Identifiers and Click Status
  • Time Details: Year, month, day, hour
  • Categorical Variables: Anonymized, includes site, app, device details

Evaluation Metric

  • Log loss: Again used for model evaluation

Winning Strategy: Expanded '3 Idiots' Team

  • Feature Engineering & Model Ensembling
  • Generated Features: Count features, bag features, click history
  • Hashing Trick: Used to transform text features
  • Advanced Encoding and Model Averaging: Uses logistic function-based geometric averaging

Conclusion

  • Importance of FFM: Found highly effective across multiple competitions
  • Feature Engineering Critical: For model performance
  • GBM and Hashing Functions: Useful for feature transformation
  • Current Trends: Likely shifted towards deep learning, can explore recent advancements via AdKDD workshop

Questions & Wrap-Up

  • Open floor for questions and comments on the discussed topics.