Transcript for:
Understanding Adversarial Attacks in Machine Learning

adversarial attacks in machine learning. A classifier, as you would know, can generate four   different outcomes: True Positive, True Negative,  False Positive, and False Negative.If these words   are unfamiliar to you, I recommend watching my  video on Classification Metrics and Evaluation.   True positive and true negative  outcomes are both correct outputs.   False positive and False negative are  two examples of classifier mistakes,   which are erroneous outputs. False positive is also known   as Type one error, and False negative is  sometimes referred to as type II error.   Any trained classifier will have inconsistencies  with another trained classifier who has complete   authority on a subject, which we refer to  as "Oracle." Humans, for example, can act   as Oracles in a lot of circumstances.An attacker  might take advantage of these inconsistencies to   cause the trained classifier to make more errors. Based on the sort of mistake that it aims to   increase, adversarial attacks may be divided  into two types. The term "Type I adversarial   attack" refers to an approach targeted at making  the model produce more False positives. The “Type   II adversarial attack,” on the other hand,  concentrates on raising the False negatives.   In order to perform a type II attack, attacker  has to produce instances which look the same   to the oracle, but the trained classifier  classifies them in two different classes.   So, while in fact the two  inputs are from the same class,   the classifier falsely classifies them into  different classes, causing a false negative.   As shown in this figure, a type II adversarial  attack requires just a little change in the data,   which is simply accomplished  by adding noise to the input.   Type II attack has received more attention than  Type I attack due to its ease of execution.   In a Type I adversarial attack, the attacker must  give an example that the classifier classifies as   being in the same class as inputs that  are clearly distinct in Oracle's view.   Unlike the type II adversarial attack, we need to  drastically modify an input such that the oracle   can comprehend the difference while the classifier  still classifies it in the same class as before.   This significant change is usually  achieved using autoencoders and GANs.   Here is an example of type I and type  II attacks on a digit classifier.   In type II attack, we simply add some noise  to the input image so that while it’s obvious   for a human that the digit is the same, the  model classifies the noisy input as an 8   In type I attack, we change  the input digit drastically,   so that it will look like an 8 in a human’s  view. But we do this transition in a way that   the model doesn’t notice the difference  and classifies the new input as a 3.   Let's take a closer look at these two sorts  of attacks and how they differ. We have   several points in a three-dimensional space  that correspond to two classes of positive   and negative training samples. The Oracle utilizes  features x(1) and x(2) to categorize these points;   hence, these two features are the best  characteristics that any classifier   can use to classify these training samples. However, our trained classifier is unaware of   the optimal features and classifies the points  using x(1) and x(3). The trained classifier,   like the oracle, achieves accurate  result on the training data.   In order to perform a type II attack on this  classifier, we have to take advantage of the   “Unnecessary feature” x(3). As you can see here,   moving a datapoint in the direction of x(3)  can change the trained classifier’s output   Meanwhile, the oracle will not observe any  variation in point coordinates since its location   on the x(1) and x(2) axes is unchanged. So, despite the fact that the real class   of the datapoint should not change, our  classifier classifies it as "negative,"   resulting in a "false negative." This  is an adversarial attack of type II.   Now, if we were going to perform a type I attack,   we’d have to make use of  the “Missing feature” x(2).   We call x(2) the missing feature becaue despite  the fact that it was one of the two essential   features that oracles used to classify the points,  it was not included in our classifier’s model.   Just like the previous example, We advance  a datapoint in the direction of the   missing feature x(2) until the oracle  classifies it as a negative sample.   However, as you can see, the point's coordinates  on the x(1) and x(3) axes remain unchanged,   thus our classifier continues to identify it as  positive.By using a type I adversarial attack,   we can effectively mislead a model  into making a false positive error.   So far, we've looked at a variety of adversarial  attacks and their characteristics. Do we,   on the other hand, have any means  of defending our poor models?Yes,   we have, and we intend to learn more about them.   The most straight forward defense would be using  some adversarial examples in the training process,   so that our model becomes familiar with  these inputs and won’t be tricked easily.   We can add some instances of type I  and Type II attacks so we reduce their   rate of success on our model in the future. This method is called “Adversarial Training”   The utilization of an ensemble of models for  classification is another defensive tactic.   For example, in the “feature squeezing”  defense strategy, we reduce the amount   of features by compressing them; and feed  the squeezed input to a second classifier,   in addition to the standard model  that classifies the raw input.   If the prediction of the second classifier  differs significantly from the prediction   of the first classifier, we infer that  the input was an adversarial attack.   So, in conclusion, I first introduced the  adversarial attacks and their different types   Then, I tried to explain the difference  between the two types of adversarial attacks   And finally, I mentioned some defensive  strategies for these attacks.