Understanding Adversarial Attacks in Machine Learning

Aug 1, 2024

Adversarial Attacks in Machine Learning

Classifier Outcomes

  • Classifiers produce four outcomes:
    • True Positive: Correct positive prediction
    • True Negative: Correct negative prediction
    • False Positive: Incorrect positive prediction; known as Type I Error
    • False Negative: Incorrect negative prediction; known as Type II Error
  • For more information, refer to the video on Classification Metrics and Evaluation.

Inconsistencies in Classifiers

  • All trained classifiers may have inconsistencies with an authoritative classifier (Oracle).
  • Humans often act as Oracles.
  • Adversarial attacks exploit these inconsistencies to increase classifier errors.

Types of Adversarial Attacks

Type I Adversarial Attack

  • Aims to increase False Positives.
  • Requires significant modification of input so that it appears as a different class to the Oracle but is misclassified by the trained classifier.
  • Achieved using techniques such as Autoencoders and GANs.
  • Example: Changing an input digit significantly so it looks like an 8, but the model misclassifies it as a 3.

Type II Adversarial Attack

  • Aims to increase False Negatives.
  • Involves producing instances that look the same to the Oracle but are classified differently by the trained classifier.
  • Typically requires minimal change, often just adding noise to input.
  • Example: Adding noise to a digit image so the trained model misclassifies it as an 8 while a human recognizes it as the same digit.

Differences Between Type I and Type II Attacks

  • Type II Attack: Moves data points using unnecessary features (e.g., x(3)), resulting in a false negative.
  • Type I Attack: Moves data points using missing features (e.g., x(2)), leading to a false positive.

Defensive Strategies Against Adversarial Attacks

  1. Adversarial Training:

    • Incorporate adversarial examples during training to help the model recognize and resist them.
    • Mix of Type I and Type II attacks to reduce their success rates.
  2. Ensemble Models:

    • Use multiple classifiers to increase robustness.
    • Feature Squeezing: Reduce number of features and classify the compressed input, comparing results from multiple classifiers to detect adversarial inputs.

Conclusion

  • Overview of adversarial attacks and their types.
  • Differences between Type I and Type II attacks.
  • Discussion of defensive strategies to protect classifiers.