1 Introduction to Activation Functions

Normally, there are 5 common activation functions: Sigmoid / ReLU / LeakyReLU / Randomized LeakyReLU / ELU.

ReLU (Rectified Linear Unit) is simple to compute since it involves only the maximum operator. This is a continuous function but not differentiable. It can be checked from the definition that ReLU is not differentiable at the zero point. ReLU is widely used in modern neural networks due to its simplicity.
Sigmoid and Tanh are differentiable. However, these two use exponential function and require more computation (exponential function is more difficult to compute than a maximum operator). We can see the behavior of sigmoid and Tanh are very similar. The difference is the sigmoid function outputs value in [0, 1], while Tanh outputs values in [−1, 1].
LReLU (Leaky ReLU) and Randomized LReLU. The function of LReLU is $f(x) = max(\alpha x,x)$, where $\alpha$ usually value 0.01. But if we randomly value $\alpha$, which satisfies a normal distribution with a mean of 0 and a standard deviation of 1, we call it Randomized LReLU. The original paper pointed out that randomized LReLU can achieve better results than LReLU, and gave the empirical value $\frac{2}{11}$, which is better than 0.01. As for why randomized LReLU can achieve better results, one of the explanations is that the random gradient of the randomized LReLU in the region less than 0 introduces randomness into the optimization method, and these random noises can help parameter values jump out of local optimum and saddle point.
ELU, the expression is $$ f(x) = \begin{cases} x & when\ x \geq 0 \\ \alpha (e^x - 1) & when\ x < 0. \end{cases} $$ where hyperparameter $\alpha$ usually value 1.

2 Vanishing Gradients

2.1 Non-saturating Neurons and Saturating Neurons

Simply put, when the backpropagation passes through this neuron, its gradient is not close to 0, and can continue to effectively adjust the weight of the neuron, then we call this kind of neurons are non-saturating neurons. Otherwise it is saturating neurons

2.2 Vanishing Gradients (梯度消失)

Saturated neurons will make the problem of vanishing gradients worse. Assuming that the value of neuron input Sigmoid is particularly large or small, the corresponding gradient is very close to 0. Even if the gradient from the previous step is large, the gradient of neuron weight (w) and bias (b) will also approach 0. As a result, parameters cannot be effectively updated.

3 Dying ReLU problem

3.1 ReLU solves Vanishing Gradients

The ReLU activation function was proposed to solve the problem of vanishing gradients (LSTMs can also do it but only for RNN models). The gradient of ReLU can only take two values: 0 or 1. When the input is less than 0, the gradient is 0; when the input is greater than 0, the gradient is 1. The advantage is that the multiplication of the gradient of ReLU will not converge to 0, and the result of the multiplication can only take two values: 0 or 1. If the value is 1, the gradient will keep same for forward propagation; if the value is 0 , the gradient stops forward propagation from this position.

3.2 One-side saturation

The Simoid function is saturated on both sides, which means that the function value will be saturated in both positive and negative directions; but the ReLU function is saturated on one side, which means that the function value will be saturated only in the negative direction. Strictly speaking, saturation should be close to 0, not euqal 0, but the effect is the same as saturation.

But, what are the benefits of one-sided saturation?

Let’s imagine neurons as switches that detects one certain feature, High-level neurons are responsible for detecting high-level/abstract features (with richer semantic information), such as eyes or tires. Low-level neurons are responsible for detecting low-level/specific features, such as curves or edges. When neurons are activated, which means that the corresponding feature is detected in input, and the larger the positive value, the more obvious the feature. But intuitively speaking, we use negative values to represent the absence of detected features and do not care negative magnitude (负值的大小). Therefore, it is more convenient and reasonable to use a constant value of 0 to indicate that no feature can be detected. A neuron with one-side saturation, like ReLU, meets the requirements.

3.3 Dying ReLU problem

But ReLU also has disadvantages. Although sparsity can improve computational efficiency, it may also hinder the training process. Usually, the input value of the activation function has a bias (b). Assuming that the bias becomes too small so that the value of the input activation function is always negative, then the gradient of the backpropagation process is always 0, corresponding weight and bias cannot be updated at this time. If the input of the activation function is negative for all sample inputs, then the neuron cannot learn anymore, which means the neuron is dead.

4 LeakyReLU solves the problem of neuron “death”

LeakyReLU was proposed to solve the problem of neuron “death”. LeakyReLU is only different with ReLU when the input is less than 0. The value of ReLU is 0 when input is less than 0, but that of LeakyReLU is negative with a slight gradient. The advantage is that the gradient can also be calculated when the input of LeakyReLU is less than 0 (instead of 0 like ReLU)

5 ELU (Exponential Linear Unit)

From the above discussion, it can be seen that the ideal activation function should satisfy two conditions:

The distribution of the output is zero-centered (零均值化的), which can speed up the training.
The activation function is one-side saturated, which can converge better.

LeakyReLU satisfy the first condition but not the second condition; while ReLU meets the second condition but not the first condition. However, ELU satisfies both conditions.

6 How to choose an appropriate activation function

ReLU is the most popular choice. Although we pointed out some disadvantages of ReLU, many people use ReLU to achieve good results. According to the “Occam’s Razor - Entities should not be multiplied unnecessarily”, that is, give priority to the simplest one. Compared with other activation functions, ReLU has the lowest computational cost and the simplest code implementation.
Try out Leaky ReLU sometimes.
Try out tanh but do not expect much (normally used at last layer).
Use Sigmoid only at the last layer.

Activation Functions in Neural Network