Transcript for:
Understanding Activation Functions in Neural Networks

What's going on everyone, this is Jay Patel and in this video we will be talking about activation functions in neural network. We will see why do we need activation functions, what are the different types of activation functions and which functions to use and when. Also if you are new to this channel then consider subscribing because I regularly upload new video tutorials like these on machine learning. So hit the red subscribe button and also hit the bell icon so that you get notified. And for now, let's get started.

Let's first see why do we need activation functions and for that let's look at the equations of the forward propagation. Now in these equations if we skip the step of passing z1 into the activation functions then a1 will directly be given by z1 and if we calculate a2 this way then a2 will be given by this and we can expand this and it will become this and here w2 multiplied by w1 can be written. in the form of W dash while W2 multiplied by B1 plus B2 can be written in the form of B dash.

So, A2 has now the similar structure as A1 and if we continue doing this for the further layers then A3 can be given by W double dash multiplied by A0 plus B double dash. So, no matter how many layers we use we will be getting the similar kind of equations and thus we will not be able to take the advantage or the benefit of using the multiple layers and it will become as just using one layer. Now without using any activation functions all these become linear. So we need to use a nonlinear activation functions to take the benefit of the different hidden layers and the function must be nonlinear. Also the real-world data is nonlinear and thus learning from the complex relationship between the data will be better suited with the help of the nonlinear activation functions.

That's why we need to use an activation function. Now let's look at what are the different types of activation functions and let's first start with the most famous which is the sigmoid function. The sigmoid function is an s-shaped curve which looks something like this and it takes any value between 0 to 1 and for the given 0 input value it gives 0.5 as the output value. Now as it only takes the value between 0 to 1, it can be interpreted as a probability if we use this in the output layer.

So, sigmoid function is most commonly used at the output neuron in the binary classification. Let's say if we are classifying apples versus oranges and the output value is greater than 0.5 then we can classify it as an apple and if the output value is less than 0.5 then we can classify it as an orange. Thus, sigmoid function is very helpful and very suitable for using it at the output layer for the binary classification. But using sigmoid functions in the hidden layer has some drawbacks. Let's look at these drawbacks as we discuss the next activation function which is 10H function.

Now, 10H function or the hyperbolic tangent function is similar to sigmoid function just that it is stretched between minus 1 to 1 instead of 0 to 1. and because of this it has some advantages over sigmoid function when we use it in the header layer. In the back provocation or gradient descent we need to use the derivative of cost function with respect to w and the cost function depends on the activation function so we need to take the derivative of the activation functions as well. So if we see the maximum value of the derivative of sigmoid function is only 0.25 while that of 10h is up to 1. Thus, using 10H activation function, we can train our model much faster as the derivative value is higher, it will converge much faster. Also in 10H function, the average of the data will be close to zero.

And if we use 10H function in the hidden layers, then it works much better than the sigmoid function. It is because of the reason that the output of the hidden layers which is also the input to the next layer will be having the values whose average will be close to zero. or we can say that the data is normalized or centered around zero. And when we pass a normalized data to the next layer, it makes the training much more easier. Thus we use tanh functions in the hidden layers and not the sigmoid functions.

Now, both these tanh functions and sigmoid functions also has some disadvantage. When we take the derivative of both these functions while doing the gradient descent, we see that for a very high value of x or for a very low value of x, the derivative becomes very small. Thus, the learning will be very slow. Now, this is called vanishing gradient problem when the learning becomes very slow or the change or the training of the data does not take place effectively.

Now this problem can be overcome by using another type of activation function which is called a ReLU function or a rectified linear unit function. Now it is a linear function where we rectify the input which is less than zero. So the graph of ReLU looks something like this and the equation can be given by this. So, for every x less than or equal to 0, it gives 0 as the output and for every x greater than 0, it gives same x as the output. Now, as the values of this derivative is same for all x greater than 0, it overcomes the vanishing gradient problem.

Also, its derivative value is 1 which was the maximum value of the derivative of 10-inch function. So, the learning can also be very fast with the help of ReLU function. Also note one more thing here.

This function is not entirely linear as it gives 0 as the output for all x less than 0 while it gives linear output for all x greater than 0. So it can be also said as a piecewise linear function and this actually makes it even better as it can take the advantage of both linear and non-linear property. The linearity of the function can help to overcome the vanishing gradient problem and also makes the training faster and because of the non-linearity we can take the advantage. or the benefits of using multiple hidden layers in our neural network.

Now, there are other variations of ReLU also available. And they are Leaky ReLU and the exponential linear unit or ELU. In Leaky ReLU, instead of defining 0, we define it with the help of a very small linear component of x.

It can be given by 0.01 multiplied by x. And in ELU, we define it with the exponential component instead of the 0. Both of these leaky relu and elu works well and can be used but nowadays most commonly relu is only used. Now let's look at the other type of activation function which is the softmax function.

We saw that while doing the binary classification we use sigmoid function at the output neuron. But what if we are solving the multiclass classification problem? What kind of activation function will we use then?

In this case we can't use sigmoid function. as it is just not suited for this purpose and so for that we use a different kind of activation functions which we call as a softmax function. Let's say we are making the predictions for the four different classes and our final output values without the activation function comes out to be 7.51 and 3. I'm just assuming a hypothetical data here.

So from this data, it is clear that our winner should be class 1 which is 7 in the output neuron. But these output values can be made much better by scaling them into the probabilities. This can be achieved by calculating the exponent of each value in the list and dividing it by the sum of the exponent values.

And if we do that, we get these values. And the summation of these values is also equal to 1. Thus, now we have converted our given output values in the form of probabilities. And we will just classify the highest probability.

Now, note one thing here, we could have directly divided the values in the list with their sum and still it could have been interpreted as the probability. But taking the exponent makes it much better as the exponent function makes the larger value much larger and smaller value much smaller. Thus, it enhances the final output prediction and we can be more sure of our final classification.

As you can see with using exponent, the output probability is 0.97 while without exponent is just only 0.6. So the equation of the softmax activation function is given by this. While this equation is for every ith neuron. So we will be calculating it for every ith neuron in the output layer while doing the multi-class classification.

Now thus so far we have discussed about the activation functions for classification problems. But what if our final output prediction is linear and it can take any value like making the house price prediction where the price of the house can range from any number between 0 to 1 million or maybe more. In this case it is not suitable for using any kind of non-linear activation function in the output neuron and just in the output neuron. We will be using the activation functions in the hidden layers but we won't be using any activation functions in the output neuron. Thus for the output neuron we will be just taking the linear output without activation function for linear regression.

Okay, so let's summarize everything we have learned so far. For binary classification, we use the sigmoid function in the output neuron because it can be interpreted as the probability. But we don't use sigmoid function in the hidden layers because the 10-h and the ReLU function always outperforms the sigmoid function. So for hidden layers, we either use 10-h function or ReLU function.

Now both these 10-h and the ReLU function had one disadvantage, which was vanishing gradient problem and it was that for every very large or very small value of x the derivative becomes very small and the learning can become very slow and this can lead our model to not train effectively so it can be overcome with the help of a ReLU function which was a piecewise linear function and it has the advantage of both the linear and the non-linear property there are other variations of ReLU functions also available such as Leaky ReLU and the exponential linear unit. And while solving the multiclass classification problem, we use softmax function in the output layers. And when solving the linear regression problem, we skip using any kind of activation function or nonlinear activation function and directly pass z as our output.

Now note one more thing here. There are other types of activation functions also available but these are the most commonly used. And there is no fixed way that we must use either relu or the 10H function in the hidden layers.

It depends on the application we are trying to build and it also depends on the input data. So if you want you can try using both 10H and the relu activation function and see which function works better. I hope you found this video valuable and if you are new to this channel please subscribe to my channel.

Thank you. then don't forget to check my other videos in the neural network series. You can find its link in the description box or by clicking at the i button in the upper right corner. Now in the next video we will continue our discussion on the neural network and we will derive the equation for the back propagation.

So I see you in the next one.