1 Normalization (归一化)

The formula of normalization is $$ x^* = \frac{x-x_{min}}{x_{max}-x_{min}} $$

where $x$ is the raw data (input), $x_{min}$ and $x_{min}$ are the minimum and maximum value in all data respectively, $x^*$ is the normalized value. Accoding to this formula, raw data $x$ will be mapped between [0,1], which removes the effect of magnitude on the final result.

2 Standardization (标准化)

The formula of standardization is $$ x^* = \frac{x-\mu}{\sigma} $$

where $x$ is the raw data (input), $\mu$ is average value, $\sigma$ is standard deviation, $x^*$ is the value after standardization. Thus, the data will satisfy normal distribution with mean of 0 and standard deviation of 1.

Note:

If you want variables of all dimensions are treated equally, standardization should be selected.
If you want to retain the potential weight relationship or the data does not conform the normal distribution, you should choose normalization.

3 Zero-centered (零均值化)

The formula of standardization is $$ Data(x^*) = Data(x) - mean(Data) $$

Zero-centered is usually used in image data (Note: the average value should be computed on the whole training set).

Normally, zero-centered can avoid the situation of “Z-type update”, which can speed up the convergence speed of the neural network.

3.1 Sigmoid function

The function is $$ Sigmoid(x) = \frac{1}{1+e^{-x}} $$

The value range of the Sigmoid function is (0,1).

If we do not use zero-mean, for the first hidden layer of the neural network, the input data $\mathbf{X}$ is all positive (because the original pixel value belongs to [0,255]). Taking a certain neuron in the first hidden layer as an example, when the backpropagation algorithm is performed and reaches the neuron, according to the chain rule, the formula is $$ \frac{\partial L}{\partial w} = \frac{\partial L}{\partial f} \cdot \frac{\partial f}{\partial w} $$

where $L$ is the gradient passed down from the upper layer, $f$ is Sigmoid function. As for $w$, $\frac{\partial L}{\partial f}$ is same, so the sign of $\frac{\partial L}{\partial w}$ depends on $\frac{\partial f}{\partial w}$.

Before calculate $\frac{\partial f}{\partial w}$, we firsylt calculate $Sigmoid(x)'$, the process of calculating is as follows $$ \begin{aligned} Sigmoid(x)' &= (\frac{1}{1+e^{-x}})' \\ &= \frac{0 \cdot (1+e^{-x}) - 1 \cdot (-e^{-x})}{(1+e^{-x})^2} \\ &= \frac{e^{-x}}{(1+e^{-x})^2} \\ &= \frac{1 + e^{-x} - 1}{(1+e^{-x})^2} \\ &= \frac{1}{1+e^{-x}} \cdot (1 - \frac{1}{1+e^{-x}}) \\ &= Sigmoid(x) \cdot (1 - Sigmoid(x)) \end{aligned} $$

Now, we know $$ \frac{\partial f}{\partial w} = \frac{\partial Sigmoid(wx +b)}{\partial w} = Sigmoid(wx) \cdot (1 - Sigmoid(wx)) \cdot x $$

Meanwhile, the range of $Sigmoid(wx +b) \in (0,1)$ , that of the $(1 - Sigmoid(wx +b)) \in (0,1)$ . The sign of $\frac{\partial f}{\partial w}$ only depends on $x$. Therefore, the $w$ of the neuron will be updated in the same direction, which will cause the “Z-type update” phenomenon.

Take a two dimensions example with two weights $w_1$ and $w_2$. Because of the same gradient signs of $w_1$ and $w_2$, either both are positive or negative. When sign is positive, they will change to the first quadrant, and when sign is negative, they will change to the third quadrant. If the optimal solution is in the fourth quadrant at this time, the actual training path will be much longer than the shortest path, which will slow down the network convergence speed.

But one of the characteristics of the sigmoid function is the non-zero centered. That is, after the output of Sigmoid in the first layer, all output values range between (0,1), which means zero-centered can solve the problem of “Z-type update” in the first layer, but have no idea in the back layer.

3.2 Tanh function

The function is $$ Tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} $$

The value range of the Tanh function is (-1,1). Obviously, Tanh function is zero centered. The derivation of $Tanh(x)'$ is as follows $$ \begin{aligned} Tanh(x)' &= (\frac{e^x - e^{-x}}{e^x + e^{-x}})' \\ &= \frac{(e^x - (-e^{-x})) \cdot (e^x + e^{-x}) - (e^x - e^{-x}) \cdot (e^x - e^{-x})}{(e^x + e^{-x})^2} \\ &= \frac{(e^x + e^{-x})^2 - (e^x - e^{-x})^2}{(e^x + e^{-x})^2} \\ &= 1 - (\frac{e^x - e^{-x}}{e^x + e^{-x}})^2 \\ &= 1 - Tanh(x)^2 \end{aligned} $$

So that $$ \frac{\partial f}{\partial w} = (1 - Tanh(wx+b)^2) \cdot x $$

Because of $Tanh(wx+b) \in (-1,1)$, then $Tanh(wx+b)^2 \in (0,1)$, the same as $1 - Tanh(wx+b)^2$. Thus, the sign of $\frac{\partial f}{\partial w}$ depends on $x$ in the first layer.

Therefore, as for Tanh function, zero-centered can solve the problem of “Z-type update” in the first layer. Meanwhile, Tanh function is zero centered, so it does not have “Z-type update” in the back layer.

3.3 ReLU function

The function is $$ ReLU(x) = max(0,x) $$

The function of $ReLU(x)'$ is as follows $$ ReLU(x)' = \begin{cases} 1,& x \geq 0 \\ 0,& x < 0. \end{cases} $$

so $$ \frac{\partial f}{\partial w} = \begin{cases} x,& wx+b \geq 0 \\ 0,& wx+b < 0. \end{cases} $$

Obviously, ReLU function is the non-zero centered. That is, zero-centered can solve the problem of “Z-type update” in the first layer, but have no idea in the back layer.

Pre-processing in Neural Networks