Understanding Adversarial Attacks in Machine Learning
Aug 1, 2024
Adversarial Attacks in Machine Learning
Classifier Outcomes
Classifiers produce four outcomes:
True Positive: Correct positive prediction
True Negative: Correct negative prediction
False Positive: Incorrect positive prediction; known as Type I Error
False Negative: Incorrect negative prediction; known as Type II Error
For more information, refer to the video on Classification Metrics and Evaluation.
Inconsistencies in Classifiers
All trained classifiers may have inconsistencies with an authoritative classifier (Oracle).
Humans often act as Oracles.
Adversarial attacks exploit these inconsistencies to increase classifier errors.
Types of Adversarial Attacks
Type I Adversarial Attack
Aims to increase False Positives.
Requires significant modification of input so that it appears as a different class to the Oracle but is misclassified by the trained classifier.
Achieved using techniques such as Autoencoders and GANs.
Example: Changing an input digit significantly so it looks like an 8, but the model misclassifies it as a 3.
Type II Adversarial Attack
Aims to increase False Negatives.
Involves producing instances that look the same to the Oracle but are classified differently by the trained classifier.
Typically requires minimal change, often just adding noise to input.
Example: Adding noise to a digit image so the trained model misclassifies it as an 8 while a human recognizes it as the same digit.
Differences Between Type I and Type II Attacks
Type II Attack: Moves data points using unnecessary features (e.g., x(3)), resulting in a false negative.
Type I Attack: Moves data points using missing features (e.g., x(2)), leading to a false positive.
Defensive Strategies Against Adversarial Attacks
Adversarial Training:
Incorporate adversarial examples during training to help the model recognize and resist them.
Mix of Type I and Type II attacks to reduce their success rates.
Ensemble Models:
Use multiple classifiers to increase robustness.
Feature Squeezing: Reduce number of features and classify the compressed input, comparing results from multiple classifiers to detect adversarial inputs.
Conclusion
Overview of adversarial attacks and their types.
Differences between Type I and Type II attacks.
Discussion of defensive strategies to protect classifiers.