目录 / Contents

There are some more advanced optimization algorithms, they are widely used in solving large-scale optimization problems in machine learning and data science. For each algorithm, we will introduce both the implementation and the motivation.

1. Optimization with Momentum

1.1. Problems with Gradient Descent

While gradient descent is fundamental and easy to implement, it suffers from some drawbacks. A disadvantage of gradient descent is that it moves very slowly at flat region. By flat region we mean the region where the function values change slowly. This phenomenon can be illustrated by its update formula

At the initialization, the function value changes very fast, therefore the magnitude of the gradient is large. According to Eq: $w^{(t+1)} = w^{(t)} − \eta \nabla C(w^{(t)})$, gradient descent updates the iterates fast in the sense that $||w^{(t+1)} − w^{(t)}||_2$ is large. However, if we get trapped to a flat region, the gradient would be small and then the update would be very slow. This means that we get trapped to that flat region. If we are fortunate to escape this flat region, then we can continue to decrease the function value. However, when we are closed to a local optimum we will get trapped to that local optimum. The reason is that the gradient at the local optimum is zero (by the first-order optimality condition).

This motivates the following important question: Can we develop a better algorithm to speed up gradient descent?

One of answers is to use velocity (which is also called momentum).

1.2. Gradient Descent with Momentum

We first introduce some intuition: If we are repeatedly asked to go in the same direction then we should likely gain some confidence and start taking bigger steps in that direction.

To explain this, assume we want to drive a car to UoB but we do not know the direction. Then we have to ask other people for the direction. Suppose we meet a person and she tells us that we should move toward the north direction. This is useful information but we are not certain whether this information is accurate. Therefore, we move toward the north direction with a slow speed. After 10 minutes, we ask another person and she tells us that we should move toward the north direction. Then we are more confident that north direction is a correct direction and therefore we can speed up. After another 20 minutes, another person also says that we should move along the north direction. Then we are quite certain and will use larger speed.

This process is similar to a ball gaining velocity while rolling down a slope. In the beginning, we are located at the red point and we have velocity 0. The gravity will drag the ball to the right direction. In this process, the ball is gaining velocity due to the gravity and will move faster and faster.

A key concept in the above example is the velocity, which stores the historical information. Motivated by this phenomenon, we introduce a velocity to track historic gradients.

In the beginning, we initialize the velocity $v^{(0)} = 0$. Then we update the velocity as follows $$ v^{(t+1)} = \alpha \cdot v^{(t)} - \eta \nabla C(w^{(t)}) $$ According to this formula, we use both the previous velocity $v^{(t)}$ and the new information $\nabla C(w^{(t)})$ to update the velocity. Since previous velocity stores historic information, the new velocity will store both the historic information and the new information.
$\alpha \in [0,1)$ and update $$ w^{(t+1)} = w^{(t)} + v^{(t+1)} $$

By the update of velocity, we know that velocity is an exponentially decaying moving average of negative gradients $$ \begin{aligned} v^{(1)} &= \alpha \cdot v^{(0)} - \eta \nabla C(w^{(0)}) = - \eta \nabla C(w^{(0)}) \\ v^{(2)} &= \alpha \cdot v^{(1)} - \eta \nabla C(w^{(1)}) = -\alpha \cdot \eta \nabla C(w^{(0)}) - \eta \nabla C(w^{(1)}) \\ & \vdots \\ v^{(t)} &= -\alpha^{t-1} \cdot \eta \nabla C(w^{(0)}) -\alpha^{t-2} \cdot \eta \nabla C(w^{(1)}) - ... - \eta \nabla C(w^{(t-1)}) \end{aligned} $$

It is then clear that velocity accumulates previous gradients. We give larger weights to recent gradients in the velocity. This is reasonable since the recent gradient is more useful to the current update than the old gradients.

Note: There are two hyperparameters in gradient descent with momentum.

$\eta$: learning rate
$\alpha$: determines the influence of past gradients on the current update (usually set as 0.8, 0.9 or 0.99)

Example 1

One-dimensional Problem: Now we apply gradient descent (GD) and gradient descent with momentum (Momentum) to minimize a one-dimensional problem $C(w) = \frac{1}{20}w^2, w \in \mathbb{R}$. It is clear that the gradient $\nabla C(w) = \frac{1}{10}w$. Suppose we initialize the point $w^{(0)} = -0.8$. The left figure shows the behavior of GD while the right figure shows the behavior of Momentum.

From these figures we can see that GD updates slower and slower. The underlying reason is that the magnitude of gradient is $|w|/10$, which is becoming smaller and smaller. After 3 updates GD is still at a point far from the optimum. On the other hand side, Momentum will move faster and faster. The underlying reason is that the magnitude of velocity will become larger and larger since velocity would accumulate the previous gradient information. Within 3 updates, Momentum finds a point very close to the global minimum.

Example 2

One-dimensional Problem: Recall that GD has a problem of getting trapped to flat region and local optimum. By using the velocity, Momentum can solve these problems. The underlying reason is that even in flat region with a very small gradient, the velocity can still have a large magnitude. This velocity will allow us to escape this flat region and a local optimum!

1.3. Nesterov Accelerated Gradient (Nesterov Momentum)

A problem with Momentum is that it may frequently overshoot minima! This can be illustrated by the following example where the minimum is achieved at the zero point. We start at the point $w^{(0)} = −5$. It moves towards right with increasing speed. The figure shows the iterate $w^{(t)}$.

$w^{(14)}$ is the first iterate moving across optimum
$w^{(24)}$ is the first iterate moving across optimum after $w^{(24)}$
$w^{(34)}$ is the first iterate moving across optimum after $w^{(24)}$
$w^{(44)}$ is the first iterate moving across optimum after $w^{(34)}$ and so on

According to the above figure, we know that these iterates would fluctuate around the global minimum. The reason is that the magnitude of the velocity will increase quickly until it goes over the optimum. To avoid this problem, we need to slow down the increasing speed of the velocity.

An algorithm to reduce the oscillation of Momentum is Nesterov Accelerated Gradient (Nesterov Momentum). The intuition is to look ahead before updating.

Recall that for Momentum: $v^{(t+1)} = \alpha \cdot v^{(t)} - \eta \nabla C(w^{(t)})$
According to Momentum, we are going to move by at least by $\alpha \cdot v^{(t)}$ and a bit more by $\eta \nabla C(w^{(t)})$.The term $\alpha \cdot v^{(t)}$ is already (known), while the term $\nabla C(w^{(t)})$ (requires gradient computation).
Since we already know the term $\alpha v(t)$, why not virtually moving by $\alpha v(t)$ to get $ w^{(ahead)} = w^{(t)} + \alpha v(t) $, and do the gradient computation at the point $w^{(ahead)}$.
This has the same computation cost as the Momentum. However, $\nabla C(w^{(ahead)})$ may be more precise than $\nabla C(w^{(t)})$, since the point $w^{(ahead)}$ can be a better point than $w^{(t)}$ (note we make a better use of the velocity in getting $w^{(ahead)}$).

This leads to the famous algorithm Nesterov Accelerated Gradient (NAG), which can be described as below shows: $$ w^{(ahead)} = w^{(t)} + \alpha v^{(t)} $$

Calculate gradient and update velocity $$ v^{(t+1)} = \alpha v^{(t)} - \eta \nabla C(w^{(ahead)}) $$
update $w^{(t+1)}$ $$ w^{(t+1)} = w^{(t)} + v^{(t+1)} $$

The only difference between NAG and Momentum is that we choose a different point in the gradient computation. The gradient computation is the most expensive part in the gradient-based methods. Therefore, we need to be very careful in choosing the point for gradient computation. The following figure visualize the difference between Momentum and NAG:

So, how NAG Relaxes Oscillation? We now give an example to show how NAG relaxes the oscillation of Momentum. The left figure shows that GD with momentum overshoots the global minimum at $w^{(3)}$ due to a large velocity. We now consider NAG at this point. It is clear the derivative at $w^{(3)}$ is negative since the function value is decreasing at this point. The velocity is positive. According to NAG, we first build a look-ahead point. This look-ahead already overshoots the global minimum and the derivative at this look-ahead point is positive. Then we update the velocity (which is positive) by adding the negative gradient, and therefore we get a new velocity with a smaller magnitude. This shows that we move with a slower speed at $w^{(3)}$ to get $w^{(4)}$. Although $w^{(4)}$ also overshoots the global minimum, the level of overshooting is smaller than GD with momentum.

2. Stochastic Gradient Descent with Momentum

The function we minimize in machine learning often has a sum structure: $C(w) = \frac{1}{n} \sum^{n}_{i=1}C_i(w)$, $C_i(w)$ often corresponds to a loss with $i$-th training example.

However, GD does not use the sum structure. It just uses the gradient at $C$ to update iterates. This gradient computation requires to go through all the training examples, and can be expensive if $n$ is large. Nowadays we often encounter a data-set of large scale. In this case, GD is no longer efficient. This motivates a key question: Can we use the sum structure to develop a more efficient method?

A very famous algorithm to this aim is the stochastic gradient descent (SGD). The key idea is to estimate the true gradient $\nabla C(w_t)$ based on a randomly selected example.

Let us review Stochastic Gradient Descent (Robbins & Monro, 1951)

Initialize the weights w(0)
For $t = 0,1,...,T$ -Draw $i_t$ from $\{1, . . . , n\}$ with equal probability -Compute stochastic gradient $\nabla C_{i_t}(w^{(t)})$ and update $w^{(t+1)} = w^{(t)} - \eta_t \nabla C_{i_t}(w^{(t)})$

Note:

We say $C_{i_t}(w^{(t)})$ a stochastic gradient, which is a random variable. Since we only use a single example to build a stochastic gradient, the computation per iteration is $O(1)$ instead of $O(n)$. This greatly relaxes the computation burden.
While this stochastic gradient is not a correct gradient, it is correct on average: $$ \frac{1}{m} \sum^{m}_{i=1} \nabla C_{i}(w^{(t)}) \approx \nabla C(w^{(t)}) $$ where $i$ has the same likelihood to be any number in $\{1, 2, . . . , n\}$, $m$ is the number of sampling. The term in the left-hand side is the expectation of the stochastic gradient. This shows that on-average, the stochastic gradient is a good estimation of the true gradient. This explains why we can use the stochastic gradient in the iterate update. Indeed, people have already theoretically proved the SGD can find a minimizer of the problem under some assumptions.

Algorithm 1: SGD with Momentum
Set initial $w^{(0)} = 0$ and $v^{(0)} = 0$

For $t = 0,1,...,t$

$I$ ← random index from $\{1, 2, . . . , n\}$ with equal
approximate gradient with selected example $\hat{g}^{(t)} \leftarrow \frac{1}{m} \sum^{m}_{i=1} \nabla C_{i}(w^{(t)})$
update the velocity $v^{(t+1)} \leftarrow \alpha v^{(t)} - \eta \cdot \hat{g}^{(t)}$
update the model $w^{(t+1)} \leftarrow w^{(t)} + v^{(t+1)}$

end

Algorithm 2: SGD with Nesterov Momentum
Set initial $w^{(0)} = 0$ and $v^{(0)} = 0$

For $t = 0,1,...,t$

look ahead $w^{(ahead)} \leftarrow w^{(t)} + \alpha v^{(t)}$
$I$ ← random index from $\{1, 2, . . . , n\}$ with equal probability
approximate gradient with selected example $\hat{g}^{(ahead)} \leftarrow \frac{1}{m} \sum^{m}_{i=1} \nabla C_{i}(w^{(ahead)})$
update the velocity $v^{(t+1)} \leftarrow \alpha v^{(t)} - \eta \cdot \hat{g}^{(ahead)}$
update the model $w^{(t+1)} \leftarrow w^{(t)} + v^{(t+1)}$

end

3. AdaGrad, RMSProp and Adam

Learning rate is an important parameter which has huge influence on the performance of the algorithm. In the previous part, the step size is not-adaptive to the dataset, meaning that we do not adjust the learning rate according to the dataset. In this part, we will introduce adaptive gradient methods. We mainly discuss three methods: AdaGrad, RMSProp and Adam.

3.1. AdaGrad

Before introducing AdaGrad, we first give some motivation. Suppose we consider a binary classification problem on a dataset with three features. The details of the dataset are as follows

	$y$	$x_1$	$x_2$	$x_3$
$x^{(1)}$	1	1	0	0
$x^{(2)}$	-1	0.5	0	1
$x^{(3)}$	1	-0.5	0	1
$x^{(4)}$	-1	0	0	0
$x^{(5)}$	1	0.5	0	0
$x^{(6)}$	-1	1	0	0
$x^{(7)}$	1	-1	1	0
$x^{(8)}$	-1	-0.5	0	1

Here the features have different property (we use sub-index to differentiate features and super-index to differentiate examples)

The feature $x_1$ is a dense feature and is irrelevant to the prediction of the label $y$. By irrelevant we mean that when $x_1$ takes the same value the label $y$ can have different values. For example, we have $x_1 = 0.5$ for the second example and the corresponding label is −1. We have $x_1 = 0.5$ in the fifth example and the corresponding label is 1.
The feature $x_2$ is a sparse feature and is predictive. By sparsity we mean that most of examples have 0 in this feature. By predictive we mean that if $x_2 = 1$, the corresponding label is always 1.
In a similar way, one can show that $x_3$ is a sparse and predictive feature. if $x_3 = -1$, the corresponding label is always 1.

Suppose we apply SGD to this problem. To understand the behavior we first compute the gradient.

The loss of a model w on the i-th example is $C_i(w) = \frac{1}{2}(w^\mathsf{T}x^{(i)} - y^{(i)})^2$
The gradient becomes $\nabla C_i(w) = (w^\mathsf{T}x^{(i)} - y^{(i)})x^{(i)}$

where $[\alpha^{(i)} = w^\mathsf{T}x^{(i)} - y^{(i)}] \in \mathbb{R}$. According to the above equation, we know that the gradient can be represented by $x^{(i)}$ multiplied by a real number.

$$ \left( \begin{matrix} (\nabla C_1(w))^\mathsf{T}\\(\nabla C_2(w))^\mathsf{T}\\\vdots\\(\nabla C_n(w))^\mathsf{T} \end{matrix} \right) = \left( \begin{matrix} \alpha^{(1)}(x^{(1)})^\mathsf{T}\\\alpha^{(2)}(x^{(2)})^\mathsf{T}\\\vdots\\\alpha^{(n)}(x^{(n)})^\mathsf{T} \end{matrix} \right) = \left( \begin{matrix} \alpha^{(1)} & 0 & 0 \\ 0.5\alpha^{(2)} & 0 & \alpha^{(2)} \\ -0.5\alpha^{(3)} & \alpha^{(3)} & 0 \\ 0 & 0 & 0 \\ 0.5\alpha^{(5)} & 0 & 0 \\ \alpha^{(6)} & 0 & 0 \\ -\alpha^{(7)} & \alpha^{(7)} & 0 \\ -0.5\alpha^{(8)} & 0 & \alpha^{(8)} \end{matrix} \right) $$

A key observation is that: if $k$-th feature is sparse, then $k$-th coordinate of gradient is sparse!

We now use the above observation to analyze the performance of SGD. Notice that sparse features mean sparse coordinates of gradients. Now we are interested in the second feature $x_2$. This is a sparse feature, which means most of examples have 0 on this feature. Therefore, for most of iterates SGD will select an example with 0 on this feature. This means that SGD will not update the second coordinate $w_2$, i.e., $w^{(t+1)}_2 = w^{(t)}_2$. For example, if we have $i_t$ = 6, we have $$ w^{(t+1)} = w^{(t)} - \eta_t \left( \begin{matrix} \alpha^{(6)} \\ 0 \\ 0 \end{matrix} \right) $$

Then SGD make scarce updates on the second coordinate. However, we know that $x_2$ is a predictive feature. We wish effective updates on the corresponding coordinate. We summarize this behavior as follows

SGD will update frequently along dense features.
SGD will scarcely visit examples with non-zero values on sparse features.
SGD will update slowly along sparse features.
This is misleading if dense features are irrelevant and sparse features are relevant

We want to slow down updating on dense features and move fast on sparse features! This can be achieved by AdaGrad (Adaptive Gradient Algorithm). The idea is to introduce accumulated gradient norm square as follows $$ \hat g^{(t)} \leftarrow \nabla C_{i_t}(w^{(t)}) $$ $$ r^{(t+1)} \leftarrow r^{(t)} + \hat g^{(t)} \odot \hat g^{(t)} \Leftrightarrow \left( \begin{matrix} r_1^{(t+1)} \\ r_2^{(t+1)} \\ \vdots \end{matrix} \right) \leftarrow \left( \begin{matrix} r_1^{(t)} + (\hat g_1^{(t)})^2 \\ r_2^{(t)} + (\hat g_2^{(t)})^2 \\ \vdots \end{matrix} \right) $$ $$ w^{(t+1)} \leftarrow w^{(t)} - \frac{\eta}{\delta + \sqrt{r^{(t+1)}}} \odot \hat g^{(t)} \Leftrightarrow \left( \begin{matrix} w_1^{(t+1)} \\ w_2^{(t+1)} \\ \vdots \end{matrix} \right) \leftarrow \left( \begin{matrix} w_1^{(t)} - \frac{\eta \odot \hat g_1^{(t)}}{\delta + \sqrt{r_1^{(t+1)}}} \\ w_2^{(t)} - \frac{\eta \odot \hat g_2^{(t)}}{\delta + \sqrt{r_2^{(t+1)}}} \\ \vdots \end{matrix} \right) $$

where $\delta$ is a hyperparameter (often chosen as $10^{−6}$). We introduce $\delta$ to make sure that the denominator is not zero. The step size along the $i$-th feature is then $\eta / (\delta + \sqrt{r_i^{(t+1)}})$.

Since $r_i^{(t+1)}$ is different for different $i$, we have different learning rates along different features.
Decay learning rate in proportion to update history. We give more decays to a feature if there is already a lot of update on that feature. Frequent updates means that the accumulated gradient norm square is large, which further means the step size is small. This achieves the effect of slowing down the update. Furthermore, if a feature is updated scarcely then the corresponding accumulated gradient norm square is small, which further means the step size is large on the feature. The achieves the effect of speeding up the update.

Algorithm 3: AdaGrad
Set initial $w^{(0)} = 0$ and $r^{(0)} = (0,...,0)^\mathsf{T}$

For $t = 0,1,...,t$

$I$ ← random index from $\{1, 2, . . . , n\}$ with equal probability
approximate gradient with selected example $\hat{g}^{(t)} \leftarrow \nabla C_{i}(w^{(t)})$
update the accumulated gradient norm square $r^{(t+1)} \leftarrow r^{(t)} + \hat g^{(t)} \odot \hat g^{(t)}$
update the model $w^{(t+1)} \leftarrow w^{(t)} - \frac{\eta}{\delta + \sqrt{r^{(t+1)}}} \odot \hat g^{(t)}$

end

Note: AdaGrad Adjusts Learning Rate in Different Features. The update on the $i$-th feature is: $$ r_i^{(t+1)} \leftarrow r_i^{(t)} + (\hat g_i^{(t)})^2 $$

If $i$-th feature is sparse, then $r_i^{(t)}$ would be small (most $\hat g_i^{(t)}$ are 0).
If $i$-th feature is dense, then $r_i^{(t)}$ would be large.
By dividing the learning rate with $\sqrt{r^{(t)}}$, AdaGrad would increase learning rate in sparse features and decrease learning rate in dense features $$ w^{(t+1)} \leftarrow w^{(t)} - \frac{\eta}{\delta + \sqrt{r^{(t+1)}}} \odot \hat g^{(t)} $$

3.2. Root Mean Square Propagation (RMSProp)

RMSProp is a simple variant of AdaGrad. The intuition is as follows

Adagrad decays learning rate very aggressively $$ r^{(t+1)} \leftarrow r^{(t)} + \hat g^{(t)} \odot \hat g^{(t)} $$
Since $r^{(t+1)}$ continues to increase, the learning rate is becoming smaller and smaller. After a while the dense feature will receive small updates. This slows down the optimization process.
RMSProp is introduced to prevent rapid growth of denominator. The key idea is to update the accumulated gradient norm square as follows $$ r^{(t+1)} \leftarrow \beta \cdot r^{(t)} + (1-\beta) \cdot \hat g^{(t)} \odot \hat g^{(t)} $$

where $\beta$ is a hyperparameter (a typical choice is $\beta$ = 0.9). In this way, $r^{(t+1)}$ grows much more slowly and we can still have effective updates after many iterations.

Algorithm 4: RMSProp
Set initial $w^{(0)} = 0$ and $r^{(0)} = (0,...,0)^\mathsf{T}$

For $t = 0,1,...,t$

$I$ ← random index from $\{1, 2, . . . , n\}$ with equal probability
approximate gradient with selected example $\hat{g}^{(t)} \leftarrow \nabla C_{i}(w^{(t)})$
update the accumulated gradient norm square $r^{(t+1)} \leftarrow \beta \cdot r^{(t)} + (1-\beta) \cdot \hat g^{(t)} \odot \hat g^{(t)}$
update the model $w^{(t+1)} \leftarrow w^{(t)} - \frac{\eta}{\delta + \sqrt{r^{(t+1)}}} \odot \hat g^{(t)}$

end

3.3. Adam

Finally, we introduce Adam, which is a most popular optimization algorithm in training deep neural networks. Notice that momentum introduces a velocity to keep the historical information on the gradients. AdaGrad introduces a accumulated gradient norm square to give different learning rates for different features. This motivates the question:

Can we combine the idea of both algorithms to develop an efficient algorithm?

This leads to Adam implemented as follows

Algorithm 5: Adam
Set initial $w^{(0)} = 0$ and $r^{(0)} = 0, s^{(0)} = 0$

For $t = 0,1,...,t$

$I$ ← random index from $\{1, 2, . . . , n\}$ with equal probability
approximate gradient with selected example $\hat{g}^{(t)} \leftarrow \nabla C_{i}(w^{(t)})$
update the momentum $s^{(t+1)} \leftarrow \beta_1 \cdot s^{(t)} + (1-\beta_1) \cdot \hat g^{(t)}$
update the accumulated gradient norm square $r^{(t+1)} \leftarrow \beta_2 \cdot r^{(t)} + (1-\beta_2) \cdot \hat g^{(t)} \odot \hat g^{(t)}$
bias correction $\hat s^{(t+1)} \leftarrow s^{(t+1)}/(1-\beta_1^{t+1})$, $\hat r^{(t+1)} \leftarrow r^{(t+1)}/(1-\beta_2^{t+1})$
update the model $w^{(t+1)} \leftarrow w^{(t)} - \frac{\eta}{\delta + \sqrt{\hat r^{(t+1)}}} \odot \hat s^{(t+1)}$

end

Adam uses the idea of momentum to memorize previous gradients, and uses the idea of RMSProp to distinguish updates along features. Therefore, we can expect it inherit the advantage of both algorithms. Notice there is a bias correction step. The interpretation is very tricky and we do not give explanations here. If you are interested, you can check the link for details. There are four hyperparameters in Adam. Here are some typical choices: $\eta = 0.001$, $\beta_1 = 0.9$, $\beta_2 = 0.999$ and $\delta = 10^{−8}$.

Optimisation Algorithms in Neural Computation

1. Optimization with Momentum

1.1. Problems with Gradient Descent

1.2. Gradient Descent with Momentum

1.3. Nesterov Accelerated Gradient (Nesterov Momentum)

2. Stochastic Gradient Descent with Momentum

3. AdaGrad, RMSProp and Adam

3.1. AdaGrad

3.2. Root Mean Square Propagation (RMSProp)

3.3. Adam