Understanding Adversarial Attacks in Machine Learning

Aug 1, 2024

Adversarial Attacks in Machine Learning

Classifier Outcomes

Classifiers produce four outcomes:
- True Positive: Correct positive prediction
- True Negative: Correct negative prediction
- False Positive: Incorrect positive prediction; known as Type I Error
- False Negative: Incorrect negative prediction; known as Type II Error
For more information, refer to the video on Classification Metrics and Evaluation.

Inconsistencies in Classifiers

All trained classifiers may have inconsistencies with an authoritative classifier (Oracle).
Humans often act as Oracles.
Adversarial attacks exploit these inconsistencies to increase classifier errors.

Types of Adversarial Attacks

Type I Adversarial Attack

Aims to increase False Positives.
Requires significant modification of input so that it appears as a different class to the Oracle but is misclassified by the trained classifier.
Achieved using techniques such as Autoencoders and GANs.
Example: Changing an input digit significantly so it looks like an 8, but the model misclassifies it as a 3.

Type II Adversarial Attack

Aims to increase False Negatives.
Involves producing instances that look the same to the Oracle but are classified differently by the trained classifier.
Typically requires minimal change, often just adding noise to input.
Example: Adding noise to a digit image so the trained model misclassifies it as an 8 while a human recognizes it as the same digit.

Differences Between Type I and Type II Attacks

Type II Attack: Moves data points using unnecessary features (e.g., x(3)), resulting in a false negative.
Type I Attack: Moves data points using missing features (e.g., x(2)), leading to a false positive.

Defensive Strategies Against Adversarial Attacks

Adversarial Training:
- Incorporate adversarial examples during training to help the model recognize and resist them.
- Mix of Type I and Type II attacks to reduce their success rates.
Ensemble Models:
- Use multiple classifiers to increase robustness.
- Feature Squeezing: Reduce number of features and classify the compressed input, comparing results from multiple classifiers to detect adversarial inputs.

Conclusion

Overview of adversarial attacks and their types.
Differences between Type I and Type II attacks.
Discussion of defensive strategies to protect classifiers.

Full transcript