Understanding Adversarial Attacks in Machine Learning

adversarial attacks in machine learning. A classifier, as you would know, can generate four different outcomes: True Positive, True Negative, False Positive, and False Negative.If these words are unfamiliar to you, I recommend watching my video on Classification Metrics and Evaluation. True positive and true negative outcomes are both correct outputs. False positive and False negative are two examples of classifier mistakes, which are erroneous outputs. False positive is also known as Type one error, and False negative is sometimes referred to as type II error. Any trained classifier will have inconsistencies with another trained classifier who has complete authority on a subject, which we refer to as "Oracle." Humans, for example, can act as Oracles in a lot of circumstances.An attacker might take advantage of these inconsistencies to cause the trained classifier to make more errors. Based on the sort of mistake that it aims to increase, adversarial attacks may be divided into two types. The term "Type I adversarial attack" refers to an approach targeted at making the model produce more False positives. The “Type II adversarial attack,” on the other hand, concentrates on raising the False negatives. In order to perform a type II attack, attacker has to produce instances which look the same to the oracle, but the trained classifier classifies them in two different classes. So, while in fact the two inputs are from the same class, the classifier falsely classifies them into different classes, causing a false negative. As shown in this figure, a type II adversarial attack requires just a little change in the data, which is simply accomplished by adding noise to the input. Type II attack has received more attention than Type I attack due to its ease of execution. In a Type I adversarial attack, the attacker must give an example that the classifier classifies as being in the same class as inputs that are clearly distinct in Oracle's view. Unlike the type II adversarial attack, we need to drastically modify an input such that the oracle can comprehend the difference while the classifier still classifies it in the same class as before. This significant change is usually achieved using autoencoders and GANs. Here is an example of type I and type II attacks on a digit classifier. In type II attack, we simply add some noise to the input image so that while it’s obvious for a human that the digit is the same, the model classifies the noisy input as an 8 In type I attack, we change the input digit drastically, so that it will look like an 8 in a human’s view. But we do this transition in a way that the model doesn’t notice the difference and classifies the new input as a 3. Let's take a closer look at these two sorts of attacks and how they differ. We have several points in a three-dimensional space that correspond to two classes of positive and negative training samples. The Oracle utilizes features x(1) and x(2) to categorize these points; hence, these two features are the best characteristics that any classifier can use to classify these training samples. However, our trained classifier is unaware of the optimal features and classifies the points using x(1) and x(3). The trained classifier, like the oracle, achieves accurate result on the training data. In order to perform a type II attack on this classifier, we have to take advantage of the “Unnecessary feature” x(3). As you can see here, moving a datapoint in the direction of x(3) can change the trained classifier’s output Meanwhile, the oracle will not observe any variation in point coordinates since its location on the x(1) and x(2) axes is unchanged. So, despite the fact that the real class of the datapoint should not change, our classifier classifies it as "negative," resulting in a "false negative." This is an adversarial attack of type II. Now, if we were going to perform a type I attack, we’d have to make use of the “Missing feature” x(2). We call x(2) the missing feature becaue despite the fact that it was one of the two essential features that oracles used to classify the points, it was not included in our classifier’s model. Just like the previous example, We advance a datapoint in the direction of the missing feature x(2) until the oracle classifies it as a negative sample. However, as you can see, the point's coordinates on the x(1) and x(3) axes remain unchanged, thus our classifier continues to identify it as positive.By using a type I adversarial attack, we can effectively mislead a model into making a false positive error. So far, we've looked at a variety of adversarial attacks and their characteristics. Do we, on the other hand, have any means of defending our poor models?Yes, we have, and we intend to learn more about them. The most straight forward defense would be using some adversarial examples in the training process, so that our model becomes familiar with these inputs and won’t be tricked easily. We can add some instances of type I and Type II attacks so we reduce their rate of success on our model in the future. This method is called “Adversarial Training” The utilization of an ensemble of models for classification is another defensive tactic. For example, in the “feature squeezing” defense strategy, we reduce the amount of features by compressing them; and feed the squeezed input to a second classifier, in addition to the standard model that classifies the raw input. If the prediction of the second classifier differs significantly from the prediction of the first classifier, we infer that the input was an adversarial attack. So, in conclusion, I first introduced the adversarial attacks and their different types Then, I tried to explain the difference between the two types of adversarial attacks And finally, I mentioned some defensive strategies for these attacks.

Transcript for:Understanding Adversarial Attacks in Machine Learning

Transcript for:
Understanding Adversarial Attacks in Machine Learning