Neural Computation Fundamentals

Jan 9, 2023·
Junhong Liu
Junhong Liu
· 47 min read

1 Linear Regression

Let us start it with an example. Suppose we want to predict the commute time to UoB. The data was shown below:

  • Distance(Km): [2.7, 4.1, 1.0, 5.2, 2.8]
  • Day(1: if weekday, 0: if weekend): [1, 1, 0, 1, 0]
  • Commute time(min): [25 33 15 45 22]

We aim to find a function $f: R^2 \rightarrow R$ such that $f(D, T_d) \approx T_c$, where $D$ is Distance, $T_d$ is Day, $T_c$ is Commute time.

1.1 Linear Model

First, we agree that all initial vectors are column vectors! Thus a linear regression model has below form:

$$ f(x) = w_0 + w_1x_1 + . . . + w_dx_d $$

We make $$ (w_0, w_1, w_2, . . . , w_d) = W^\mathsf{T}, \left(\begin{matrix}x_0=1\\x_1\\...\\x_d\end{matrix}\right) = X $$

then we have

$$ f(x) = W^\mathsf{T}X $$

In our example, $d=2$; where $x_1$ and $x_2$ correspond to the elements of $D$ and $T_d$, $w_0$ is a bias term, .

1.2 Performance Measure

Once we have initialized a linear regression model, we need a method to measure its performance.

  • The residual on the i-th data point can be defined as:

    $$ e^{(i)} = y^{(i)} − W^\mathsf{T}x^{(i)} $$

    where measures the difference between the true output $y^{(i)}$ and the predicted output $W^\mathsf{T}x^{(i)}$ of the model $W$ on the $i$-th example.

  • The following mean square error (MSR) $C(W)$ takes into account the errors for all $n$ data points:

    $$ C\left(W\right)= \frac{1}{2n}\sum_{i=1}^{n}\left(y^{(i)} − W^\mathsf{T}x^{(i)}\right)^2 $$

    According to first-order necessary optimality (I will explain later), when $C'(W)=0$, the value of $C(W)$ is minimal.


FOR EXAMPLE:

When $d = 1$,

$$ \begin{aligned} C(W) &= \frac{1}{2n}\sum_{i=1}^{n}\left(y^{(i)} − Wx^{(i)}\right)^2 \\ &= \frac{1}{2n}\sum_{i=1}^{n}\left( (x^{(i)})^2W^2 - 2y^{(i)}x^{(i)}W + (y^{(i)})^2)\right) \end{aligned} $$

where

  • $(x^{(i)})^2W^2$ is quadratic
  • $2y^{(i)}x^{(i)}W$ is linear
  • $(y^{(i)})^2$ is constant

It then follows that $$ \begin{aligned} C'(W) &= \frac{1}{2n}\sum_{i=1}^{n}\left( 2(x^{(i)})^2W - 2y^{(i)}x^{(i)}\right) \\ &= \frac{1}{n}\sum_{i=1}^{n}\left((x^{(i)})^2W\right) - \frac{1}{n}\sum_{i=1}^{n}\left(y^{(i)}x^{(i)}\right) \end{aligned} $$

According to the first-order optimality condition, we know the optimal $W^*$ satisfies $C'(W^*)=0$ , that is

$$ \frac{1}{n}\sum_{i=1}^{n}\left((x^{(i)})^2W^*\right) = \frac{1}{n}\sum_{i=1}^{n}\left(y^{(i)}x^{(i)}\right) $$

we can get solution:

$$ W^* = \frac{\sum_{i=1}^{n}\left(y^{(i)}x^{(i)}\right)}{\sum_{i=1}^{n}(x^{(i)})^2} $$

FOR EXAMPLE:

When $d = n$,

$$ C(W)= \frac{1}{2n}\sum_{i=1}^{n}\left({X^{(i)}}^\mathsf{T}W - y^{(i)}\right)^2 $$

where

$$ X^{(i)} = \left(\begin{matrix}x_0^{(i)}=1\\...\\x_j^{(i)}\\...\\x_d^{(i)}\end{matrix}\right) $$

$X$ is an $n×d$ matrix: the $i$-th row represents the $i$-th samples, the $j$-th column represents the $j$-th feature. That is

$$ X = \left(\begin{matrix}{X^{(1)}}^\mathsf{T}\\...\\{X^{(n)}}^\mathsf{T}\end{matrix}\right), Y = \left(\begin{matrix}y^{(1)}\\...\\y^{(n)}\end{matrix}\right) $$

then we have

$$ XW-Y = \left(\begin{matrix}{X^{(1)}}^\mathsf{T}\\...\\{X^{(n)}}^\mathsf{T}\end{matrix}\right)W - \left(\begin{matrix}y^{(1)}\\...\\y^{(n)}\end{matrix}\right) = \left(\begin{matrix}{X^{(1)}}^\mathsf{T}W-y^{(1)}\\...\\{X^{(n)}}^\mathsf{T}W-y^{(n)}\end{matrix}\right) $$

It follows that

$$ \begin{aligned} &(XW-Y)^\mathsf{T}(XW-Y) \\ =& \left(\begin{matrix}{X^{(1)}}^\mathsf{T}W-y^{(1)},&...&,{X^{(n)}}^\mathsf{T}W-y^{(n)}\end{matrix}\right) \left(\begin{matrix}{X^{(1)}}^\mathsf{T}W-y^{(1)}\\...\\{X^{(n)}}^\mathsf{T}W-y^{(n)}\end{matrix}\right) \\ =& \sum_{i=1}^{n}\left({X^{(i)}}^\mathsf{T}W - y^{(i)}\right)^2 \\ =& 2nC(W) \end{aligned} $$

Now, we have proved the mathematical relationship between loss function $C(W)$ and residual $(XW-Y)$, whcih can be used to speed up calculations.

$$ \begin{aligned} C(W) &= \frac{1}{2n}(XW-Y)^\mathsf{T}(XW-Y) \\ &= \frac{1}{2n}(W^\mathsf{T}X^\mathsf{T}-Y^\mathsf{T})(XW-Y) \\ &= \frac{1}{2n}(W^\mathsf{T}X^\mathsf{T}XW-W^\mathsf{T}X^\mathsf{T}Y-Y^\mathsf{T}XW+Y^\mathsf{T}Y) \\ &= \frac{1}{2n}(W^\mathsf{T}X^\mathsf{T}XW-2W^\mathsf{T}X^\mathsf{T}Y+Y^\mathsf{T}Y) \end{aligned} $$

Formula Derivation Tips:
Because $W^\mathsf{T}X^\mathsf{T}Y$ is a real number, and $W^\mathsf{T}X^\mathsf{T} = (XW)^\mathsf{T}$, we can get $W^\mathsf{T}X^\mathsf{T}Y = (W^\mathsf{T}X^\mathsf{T}Y)^\mathsf{T} = Y^\mathsf{T}XW$.

Meanwhile

  • $W^\mathsf{T}X^\mathsf{T}XW$ is quadratic
  • $2W^\mathsf{T}X^\mathsf{T}Y$ is linear
  • $Y^\mathsf{T}Y$ is constant

We can get

$$ C'(W) =\frac{1}{2n}(2X^\mathsf{T}XW-2X^\mathsf{T}Y) $$

By the first-order necessary condition, we know that optimal $W^∗$ satisfies $C'(W^*)=0$ , thus

$$ X^\mathsf{T}XW^* = X^\mathsf{T}Y $$

If $X^\mathsf{T}X$ is invertible, we get $$ \begin{cases} W^* = (X^\mathsf{T}X)^{-1}X^\mathsf{T}Y \\ Y^* = XW^* = X(X^\mathsf{T}X)^{-1}X^\mathsf{T}Y \end{cases} $$

1.3 Algorithm Implementation

Now, let us return to our example.

  • Distance(Km): [2.7, 4.1, 1.0, 5.2, 2.8]
  • Day(1: if weekday, 0: if weekend): [1, 1, 0, 1, 0]
  • Commute time(min): [25 33 15 45 22]

The code:

x0 = [1 1 1 1 1];
x1 = [2.7 4.1 1.0 5.2 2.8];
x2 = [1 1 0 1 0];
x = [x0;x1;x2]';

y = [25 33 15 45 22]';

%线性回归
plot3(x1,x2,y,"*-")

%算法
w = (x'*x)^(-1)*x'*y;

y1 = w(1)+w(2).*x1+w(3).*x2;
hold on
plot3(x1,x2,y1,"o-")
hold off
legend("training data","prediction")
title("Linear Regression Examples")

The results: $$ W^* = \left(\begin{matrix}6.08613445378149\\6.53361344537816\\2.11274509803922\end{matrix}\right), Y^* = \left(\begin{matrix}25.8396358543418\\34.9866946778712\\12.6197478991597\\42.1736694677872\\24.3802521008403\end{matrix}\right), Y = \left(\begin{matrix}25\\33\\15\\45\\22\end{matrix}\right) $$

Linear_Regression

2 Polynomial Regression

A basic nonlinear model is a low-degree polynomial, which leads to polynomial regression problems with the function of the following form:

$$ f(x) = w_0 + w_1x_1 + w_2x^2 + . . . + w_Mx^M $$

where $M\in N$ is the degree of the polynomial.

In this way, we transform a univariate nonlinear problem to a multivariate linear problem.

$$ f(x) = w_0 + w_1x_1 + w_2x^2 + . . . + w_Mx^M = W^\mathsf{T}\phi(x) $$

where

$$ \phi(x) = \left(\begin{matrix}1\\x\\x^2\\\vdots\\x^M\end{matrix}\right) $$

and we know

$$ X = \left(\begin{matrix}1&(x^{(1)})^1&(x^{(1)})^2&...&(x^{(1)})^M\\1&(x^{(2)})^1&(x^{(1)})^2&...&(x^{(2)})^M\\\vdots&\vdots&\vdots&\ddots&\vdots\\1&(x^{(n)})^1&(x^{(n)})^2&...&(x^{(n)})^M\end{matrix}\right) $$

so we have solution: $$ \begin{cases} W^* = (X^\mathsf{T}X)^{-1}X^\mathsf{T}Y \\ Y^* = XW^* = X(X^\mathsf{T}X)^{-1}X^\mathsf{T}Y \end{cases} $$

3 Regularization

Ideally, we want to get a model with a good behavior on training examples, meanwhile not complex for good generalization (not underfitting or overfitting). One way to handle this is to encourage the weights to be small (this way no feature will have too much influence on prediction). This is called regularization.

3.1 Regularized Least Square Regression

Given dataset $S = \lbrace (x^{(1)},y^{(1)}), . . . ,(x^{(n)},y^{(n)}) \rbrace$ and a regularization parameter $\lambda > 0$, find a model to minimize:

$$ C(W)= \frac{1}{2n}\sum_{i=1}^{n}\left(y^{(i)} - W^\mathsf{T}X^{(i)} \right)^2 + \frac{\lambda}{2}||W||_2^2 $$

where $||W||_2$ is 2-norm.


p-norm(P范数):
For a vector $v = (v_1, . . . , v_d)\in R^d$, the p-norm (2-norm is the Euclidean Norm) is defined as:

$$ ||V||_p = {(|v_1|^p + |v_2|^p + ... + |v_d|^p)}^{\frac{1}{p}} $$

The solution of the regularized least squares regression is

$$ W^* = {(\frac{1}{n}X^\mathsf{T}X + \lambda\mathbb{I})}^{-1}(\frac{1}{n}X^\mathsf{T}Y) $$

where, $\mathbb{I}_n\in R^{n \times n}$ is the identity matrix(单位矩阵).

PROOF:
According to our previous analysis, we know

$$ C(W) = \frac{1}{2n}(XW-Y)^\mathsf{T}(XW-Y) + \frac{\lambda}{2}||W||_2^2 $$

Because $$ \begin{aligned} (||W||_2^2)' &= ({(|w_1|^2 + |w_2|^2 + ... + |w_d|^2)}^{\frac{1}{2} \times 2})' \\ &= (|w_1|^2 + |w_2|^2 + ... + |w_d|^2)' \\ &= 2(|w_1| + |w_2| + ... + |w_d|) \\ &= 2W \end{aligned} $$

We have

$$ C'(W) =\frac{1}{2n}(2X^\mathsf{T}XW-2X^\mathsf{T}Y) + \lambda W $$

According to the first-order optimality condition, we know $C'(w^∗ ) = 0$,
so $$ \frac{1}{n}(X^\mathsf{T}XW^*-X^\mathsf{T}Y) + \lambda W^* = 0 $$

we can get $$ (\frac{1}{n}X^\mathsf{T}X + \lambda \mathbb{I})W^* = \frac{1}{n}X^\mathsf{T}Y $$

It then follows that

$$ W^* = (\frac{1}{n}X^\mathsf{T}X + \lambda \mathbb{I})^{-1}(\frac{1}{n}X^\mathsf{T}Y) $$
  • A nice property is that we do not require to assume that the matrix $X^\mathsf{T}X$ is invertible(可逆的). We can always find the above solution for the regularized least regression problem.

  • If $\lambda = 0$, then this becomes the solution of the least squares regression problem. If $\lambda = \infty$, we get $W^∗ = 0$, which is a trivial solution(平凡解). We need to choose an appropriate $\lambda$.

4 Gradient Descent(GD, 梯度下降)

Gradient descent is one of general algorithm for minimizing an objective function $C(W)$ (first proposed by Cauchy in 1847). It is an iterative algorithm, starting from $W^{(0)}$ and producing a new $W^{(t+1)}$ at each iteration as

$$ W^{(t+1)} = W^{(t)} - \eta_t\nabla C(W^{(t)}),\ (t = 0,1,...,n) $$

where $\eta_t > 0$ is the learning rate or step size. Therefore, GD uses negative gradient as a search direction and move along this direction with step size $\eta_t$.

PROOF:
We have

$$ C(V) \approx C(W) + \nabla C(W)^\mathsf{T}(V-W) $$

We take

$$ V = W - \eta \nabla C(W), (\eta \rightarrow 0) $$

So $$ \begin{aligned} C(V) &\approx C(W) + \nabla C(W)^\mathsf{T}(W - \eta \nabla C(W) - W) \\ &\approx C(W) + \nabla C(W)^\mathsf{T}(-\eta \nabla C(W)) \\ &\approx C(W) -\eta \nabla C(W)^\mathsf{T}\nabla C(W) < C(W) \end{aligned} $$

where

$$ \nabla C(W)^\mathsf{T}\nabla C(W) = ||\nabla C(W)||^2_2 > 0 $$

4.1 Gradient Descent for Linear Regression

  • For least square regression: $$ C(W) = \frac{1}{2n}(W^\mathsf{T}X^\mathsf{T}XW-2W^\mathsf{T}X^\mathsf{T}Y+Y^\mathsf{T}Y) $$ we can get $$ \nabla C(W) = \frac{1}{n}(X^\mathsf{T}XW - X^\mathsf{T}Y) $$

  • GD updates $W^{(t)}$ by $$ \begin{aligned} W^{(t+1)} &= W^{(t)} - \eta_t\nabla C(W^{(t)}) \\ &= W^{(t)} - \frac{\eta}{n}(X^\mathsf{T}XW^{(t)} - X^\mathsf{T}Y) \end{aligned} $$

4.1.1 Algorithm Implementation

Let us return to previous example.

  • Distance(Km): [2.7, 4.1, 1.0, 5.2, 2.8]
  • Day(1: if weekday, 0: if weekend): [1, 1, 0, 1, 0]
  • Commute time(min): [25 33 15 45 22]

We run GD with $w^{(0)} = (0, 0, 0)^\mathsf{T}$, $\eta = 0.02$ and $error<0.02$, then we get

x0 = [1 1 1 1 1];
x1 = [2.7 4.1 1.0 5.2 2.8];
x2 = [1 1 0 1 0];
x = [x0;x1;x2]';

a = 0.02; %步长
b = 0.02; %精确度

y = [25 33 15 45 22]';


%线性回归
plot3(x1,x2,y,"*-")

%算法
w_last = [0,0,0]';
w_next = w_last - a/length(y).*(x'*x*w_last - x'*y);
while (sum(abs(w_last-w_next)>=b))
    w_last = w_next;
    w_next = w_last - a/length(y).*(x'*x*w_last - x'*y);
end

y1 = w_next(1)+w_next(2).*x1+w_next(3).*x2;
hold on
plot3(x1,x2,y1,"o-")
hold off
legend("training data","prediction")
title("Linear Regression Examples")
$$ W^* = \left(\begin{matrix}2.23437521245057\\7.61927519021201\\1.51466332589316\end{matrix}\right), Y^* = \left(\begin{matrix}24.3210815519162\\34.9880668182130\\9.85365040266258\\43.3692695274462\\23.5683457450442\end{matrix}\right), Y = \left(\begin{matrix}25\\33\\15\\45\\22\end{matrix}\right) $$

4.2 Stochastic Gradient Descent

As we know,

$$ \nabla C(W^{(t)}) = \frac{1}{n}\sum_{i=1}^{n}\nabla C_i(W^{(t)}) $$

where $C_i(W^{(t)})$ is the cost by using the model $W^{(t)}$ to do prediction on the $i$-th example.

However, gradient computation can be expensive if $n$ is large, since it requires to go through all training examples to compute the gradient. So, people have developed a variant of gradient descent called Stochastic Gradient Descent (SGD).

$$ W^{(t+1)} = W^{(t)} - \eta_t\nabla C_i(W^{(t)}), (\eta_t = C/\sqrt{t}) $$

where, $\nabla C_i(W^{(t)})$ will change at every iteration, which is chosen randomly and uniformly from all examples. Besides, $\eta_t$ should be chosen an appropriate learning rate sequence, since the stochastic gradient is a noisy estimate of the true gradient, which would decrease the learning rate in the training process. A typical choice is $\eta_t = C/\sqrt{t}$, where $C$ is a parameter needed to tune according to the problem, $t$ is iterations. That is, we use smaller and smaller learning rates as we run more and more iterations.

Notice: $C_i(W^{(t)})$ is not a true gradient, which just come from one of all examples. No one can guarantee that SGD would behave as smoothly as GD (it may not decrease the objective function per iteration). However, since it is equally likely to be any example, the on-average of stochastic gradient is $\frac{1}{n}\sum_{i=1}^{n}\nabla C_i(W^{(t)}) = \nabla C(W^{(t)})$. Then, in the long run we expect that SGD would solve the optimization problem.

  • If we want a solution with a high accuracy and the sample size is small, then GD is a better choice. If we want a solution with a moderate accuracy and the problem size is large, then SGD is a better choice.
  • The computation cost of SGD is independent of the sample size and therefore SGD is especially efficient for large-scale problems.
  • GD is smooth since negative gradient is a direction where function values decrease. SGD is not stable since stochastic gradient is a noisy estimate of the true gradient.

4.3 Minibatch SGD

Minibatch SGD is a natural extension of SGD, which do not only choose one example to compute a stochastic gradient, but use several examples to compute a stochastic gradient.

$$ W^{(t+1)} = W^{(t)} - \frac{\eta_t}{b}\sum_{i\in B}^{}\nabla C_i(W^{(t)}) $$

Where, $b$ is the number of examples, $B$ is a batch of indices selected randomly from all examples, $\mathsf{length}(B) = b$.

e.g. if $b = 2$ and $B = \lbrace2, 6\rbrace$ then

$$ W^{(t+1)} = W^{(t)} - \frac{\eta_t}{2}(\nabla C_2(W^{(t)} + \nabla C_6(W^{(t)}) $$

It is clear that If $b = 1$, minibatch SGD is SGD.

Notice: There are two choices of random sampling:

  • Sampling with replacement: we sample $b$ indices one-by-one. After sampling an index $j$, we put it back to the bag so that the index $j$ is possible to be selected again in the future (it is possible $B = \lbrace2, 6,2\rbrace$)
  • Sampling without replacement: we sample $b$ indices one-by-one. After sampling an index $j$, we do not put it back to the bag so that we cannot select the index $j$ again (it is not possible $B = \lbrace2, 6,2\rbrace$).

4.3.1 Algorithm Implementation

We set $C = 0.06, b = 2$, so $\eta_t/b = 0.06/2\sqrt{t}$. Besides, when the difference between current $W_{next}$ and previous $W_{last}$ is smaller than $e = 0.02$, we stop iterating.

%Minibatch SGD

%Data
x0 = [1 1 1 1 1];
x1 = [2.7 4.1 1.0 5.2 2.8];
x2 = [1 1 0 1 0];
x = [x0;x1;x2]';

y = [25 33 15 45 22]';

plot3(x1,x2,y,"*-")

%Parameter
C = 0.06; %constant, used in learning rate(or step size)
b = 2; %number of sample
e = 0.02; %error

%Algorithm
w_last = [0,0,0]';

t = 1; %iterations
random = randi(length(x0),[1,b]);
x_t = x(random,:);
y_t = y(random);
w_next = w_last - C/(b*(t)^(0.5)).*(x_t'*x_t*w_last - x_t'*y_t);

while (sum(abs(w_last-w_next)>=e))
    w_last = w_next;
    t = t+1;
    random = randi(length(x0),[1,b]);
    x_t = x(random,:);
    y_t = y(random);
    w_next = w_last - C/(b*(t)^(0.5)).*(x_t'*x_t*w_last - x_t'*y_t);
end

y1 = w_next(1)+w_next(2).*x1+w_next(3).*x2;
hold on
plot3(x1,x2,y1,"o-")
hold off
legend("training data","prediction")
title("Minibatch SGD Examples")
$$ W^* = \left(\begin{matrix}2.01625066095198\\7.70762264986848\\1.35625066095198\end{matrix}\right), Y^* = \left(\begin{matrix}24.1830824765489\\34.9737541863647\\9.72387331082046\\43.4521391012201\\23.5975940805837\end{matrix}\right), Y = \left(\begin{matrix}25\\33\\15\\45\\22\end{matrix}\right) $$

Minibatch_SGD

4.4 Linear Models for Classification

Suppose we have $$ S = \{(x^{(1)}, y^{(1)}), . . . ,(x^{(n)}, y^{(n)}),\ (y^{(i)} \in \{−1, +1\}). $$

The performance of the model $W$ on $(X, y)$ can be measured by the 0-1 loss: $$ L(\alpha,\beta)\ =\ \mathbb{I}[\alpha \neq \beta] = \begin{cases} 1,& \mathsf{if}\ \alpha \neq \beta \\ 0,& otherwise. \end{cases} $$

Then the behavior of a model on $S$ can be measured by

$$ C(W) = \frac{1}{n}\sum_{i=1}^{n}\mathbb{I}[\mathsf{sgn}(W^\mathsf{T}x^{(i)}) \neq y^{(i)}] $$

Margin - The margin of a model $W$ on an example $(x, y)$ is defined as $yW^\mathsf{T}x$.

Margin-based loss - We consider the loss function of the form:

$$ L(\hat{y},y) = g(\hat{y}y) $$

where $\hat{y} = W^\mathsf{T}x$ is the predicted output.

Notice: we replace the non-continuous indicator function by a decreasing function $g$. If $g$ has good property such as differentiable, then we can get an easy optimization problem to solve.


There are several choices of loss function $g$.

$$ g(t) = max\{0, 1 − t\} $$ $$ g(t) = \frac{1}{2}(max\{0, 1 − t\})^2 $$ $$ g(t) = log(1+exp(-t)) $$

In this case, we consider $$ g(t) = \frac{1}{2}(max\{0, 1 − t\})^2 = \begin{cases} 0,& t \geq 1 \\ (1-t)^2,& otherwise. \end{cases} $$

This leads to the following loss function $$ L(\hat{y},y) = \frac{1}{2}(max\{0, 1 − \hat{y}y\})^2 = \frac{1}{2}(max\{0, 1 − yW^\mathsf{T}x\})^2 $$

So, we have

$$ C(W) = \frac{1}{2n}\sum_{i=1}^{n}(max\{0, 1 − y^{(i)}W^\mathsf{T}x^{(i)}\})^2 $$

where

$$ \begin{aligned} C_i(W) &= \frac{1}{2}(max\{0, 1 − y^{(i)}W^\mathsf{T}x^{(i)}\})^2 \\ &= \begin{cases} 0,& \mathsf{if}\ y^{(i)}W^\mathsf{T}x^{(i)} \geq 1 \\ \frac{1}{2}(1 − y^{(i)}W^\mathsf{T}x^{(i)})^2,& otherwise. \end{cases} \end{aligned} $$

Then, we can check the gradient of $C_i$ as follows $$ \nabla C_i(W) = \begin{cases} 0,& \mathsf{if}\ y^{(i)}W^\mathsf{T}x^{(i)} \geq 1 \\ -(1 − y^{(i)}W^\mathsf{T}x^{(i)})y^{(i)}x^{(i)},& otherwise. \end{cases} $$

Since $y^2(i) = 1$, we further get $$ \nabla C_i(W) = \begin{cases} 0,& \mathsf{if}\ y^{(i)}W^\mathsf{T}x^{(i)} \geq 1 \\ (W^\mathsf{T}x^{(i)}-y^{(i)})x^{(i)},& otherwise. \end{cases} $$

The first-order optimality condition implies: $$ \frac{1}{n}\sum_{i=1}^{n}\nabla C_i(W^*) = 0 $$

4.5 SGD for Linear Classification Models

Now, We apply SGD to solve the optimization problem in linear classification. Suppose we consider the following loss function for linear classification. $$ C_i(W) = \begin{cases} 0,& \mathsf{if}\ y^{(i)}W^\mathsf{T}X^{(i)} \geq 1 \\ \frac{1}{2}(1 − y^{(i)}W^\mathsf{T}X^{(i)})^2,& otherwise. \end{cases} $$

The $t$-th update should be $$ \begin{aligned} W^{(t+1)} &= W^{(t)} - \eta_t \nabla C_{it}W^{(t)} \\ &= \begin{cases} W^{(t)},& \mathsf{if}\ y^{(i)}{W^{(t)}}^\mathsf{T}X^{(i)} \geq 1 \\ W^{(t)} + (\eta_t(1-y^{(i)}{W^{(t)}}^\mathsf{T}X^{(i)}){y^{(i)}}^\mathsf{T}{X^{(i)}}^\mathsf{T})^\mathsf{T},& otherwise. \end{cases} \end{aligned} $$

4.5.1 Minibatch SGD Algorithm

Attention:
The difference between * and .*, also the order of matrix multiplication

%花萼长度(单位:CM)
Sepal.Length=[5.1, 4.9, 4.7, 4.6, 5, 5.4, 4.6, 5, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8, 5.7, 5.4, 5.1, 5.7, 5.1, 5.4, 5.1, 4.6, 5.1, 4.8, 5, 5, 5.2, 5.2, 4.7, 4.8, 5.4, 5.2, 5.5, 4.9, 5, 5.5, 4.9, 4.4, 5.1, 7, 6.4, 6.9, 5.5, 6.5, 5.7, 6.3, 4.9, 6.6, 5.2, 5, 5.9, 6, 6.1, 5.6, 6.7, 5.6, 5.8, 6.2, 5.6, 5.9, 6.1, 6.3, 6.1, 6.4, 6.6, 6.8, 6.7, 6, 5.7, 5.5, 5.5, 5.8, 6, 5.4, 6, 6.7, 6.3, 5.6, 5.5];
%花萼宽度(单位:CM)
Sepal.Width=[3.5, 3, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.4, 3, 3, 4, 4.4, 3.9, 3.5, 3.8, 3.8, 3.4, 3.7, 3.6, 3.3, 3.4, 3, 3.4, 3.5, 3.4, 3.2, 3.1, 3.4, 4.1, 4.2, 3.1, 3.2, 3.5, 3.6, 3, 3.4, 3.2, 3.2, 3.1, 2.3, 2.8, 2.8, 3.3, 2.4, 2.9, 2.7, 2, 3, 2.2, 2.9, 2.9, 3.1, 3, 2.7, 2.2, 2.5, 3.2, 2.8, 2.5, 2.8, 2.9, 3, 2.8, 3, 2.9, 2.6, 2.4, 2.4, 2.7, 2.7, 3, 3.4, 3.1, 2.3, 3, 2.5];
%花瓣长度(单位:CM)
Petal.Length=[1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.6, 1.4, 1.1, 1.2, 1.5, 1.3, 1.4, 1.7, 1.5, 1.7, 1.5, 1, 1.7, 1.9, 1.6, 1.6, 1.5, 1.4, 1.6, 1.6, 1.5, 1.5, 1.4, 1.5, 1.2, 1.3, 1.4, 1.3, 1.5, 4.7, 4.5, 4.9, 4, 4.6, 4.5, 4.7, 3.3, 4.6, 3.9, 3.5, 4.2, 4, 4.7, 3.6, 4.4, 4.5, 4.1, 4.5, 3.9, 4.8, 4, 4.9, 4.7, 4.3, 4.4, 4.8, 5, 4.5, 3.5, 3.8, 3.7, 3.9, 5.1, 4.5, 4.5, 4.7, 4.4, 4.1, 4];
%花瓣宽度(单位:CM)
Petal.Width = [0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.2, 0.1, 0.1, 0.2, 0.4, 0.4, 0.3, 0.3, 0.3, 0.2, 0.4, 0.2, 0.5, 0.2, 0.2, 0.4, 0.2, 0.2, 0.2, 0.2, 0.4, 0.1, 0.2, 0.2, 0.2, 0.2, 0.1, 0.2, 0.2, 1.4, 1.5, 1.5, 1.3, 1.5, 1.3, 1.6, 1, 1.3, 1.4, 1, 1.5, 1, 1.4, 1.3, 1.4, 1.5, 1, 1.5, 1.1, 1.8, 1.3, 1.5, 1.2, 1.3, 1.4, 1.4, 1.7, 1.5, 1, 1.1, 1, 1.2, 1.6, 1.5, 1.6, 1.5, 1.3, 1.3, 1.3];
%花朵品种(setosa=1;versicolor=2)
Species = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2];
Species = Species.*2-3;
X = [Sepal.Length; Sepal.Width; Petal.Length; Petal.Width];

% 算法实现
SizeSample = length(Species); % size of sample
nsample = 3; %抽取的样本数
C = 0.02; %步长
t = 1; %update
eta = C/((t)^(0.5));
w_last = ones(4,1);
nupdate = 200; % number of nupdate

random = randi(SizeSample,[1,nsample]);
X_t = X(:,random);
Y_t = Species(random);
IndexSample = (Y_t.*(w_last'*X_t) < 1);
X_t = X_t.*IndexSample;
Y_t = Y_t.*IndexSample;
w_next = w_last + ((eta/nsample).*(1-Y_t.*(w_last'*X_t))*(Y_t'.*X_t'))';


while (t < nupdate)
    w_last = w_next;
    t = t+1;
    eta = C/((t)^(0.5));
    random = randi(SizeSample,[1,nsample]);
    X_t = X(:,random);
    Y_t = Species(random);
    IndexSample = (Y_t.*(w_last'*X_t) < 1);
    X_t = X_t.*IndexSample;
    Y_t = Y_t.*IndexSample;
    w_next = w_last + ((eta/nsample).*(1-Y_t.*(w_last'*X_t))*(Y_t'.*X_t'))';
end

yy = w_next'*X;

If you want to use method of SGD, just set nsample = 1.

5 Perceptron and Neural Network

5.1 Perceptron

We first consider binary classification problems. In this case, we have a dataset of $n$ input/output pairs $$ S = \{(x^{(1)}, y^{(1)}), . . . ,(x^{(n)}, y^{(n)}),\ y^{(i)} \in \{−1, +1\}. $$

  • Training examples with $y = +1$ are called positive examples.
  • Training examples with $y = −1$ are called negative examples.

We set $$ \mathsf{sgn}(x) = \begin{cases} 1,& \mathsf{if}\ x>0 \\ 0,& \mathsf{if}\ x=0 \\ -1,& \mathsf{if}\ x<0. \end{cases} $$

Then, given a testing example x, predict the output as

$$ \mathsf{sgn}(W^\mathsf{T}X+b) $$

if $yW^\mathsf{T}X \le 0$, which means $y \neq \mathsf{sgn}(W^\mathsf{T}X)$


The following table shows an efficient Perceptron algorithm to build a perceptron.

Algorithm Steps / Shortcode Commentary
Initialize $W = 0$ we assume no bias for simplicity
While not all training examples are correctly classified \
For $(X,y) \in S$ Loop over each (feature, label) pair in the dataset
If $yW^\mathsf{T}X \le 0$ If the pair $(x,y)$ is misclassified
Do $W = W + yX$ Update the weight vector $W$

PROOF: $$ \begin{aligned} yW_{new}^\mathsf{T}X &= y(W_{old} + yX)^\mathsf{T}X \\ &= yW_{old}^\mathsf{T}X + y^2X^\mathsf{T}X \\ &= yW_{old}^\mathsf{T}X + C \end{aligned} $$

so

$$ \\ yW_{new}^\mathsf{T}X > yW_{old}^\mathsf{T}X $$

When $yW_{new}^\mathsf{T}X > 0$, which means prediction results of Perceptron algorithm are accurate.

5.2 Single-Layer Neural Network

A limitation of Perceptron is that it only constructs a linear classifier. Indeed, it classifies an example according to whether the example is on the one-side of the hyperplane or the other.

Here are some further drawbacks of perceptrons:

  • While the Perceptron algorithm can stop only if the data are linear separable, it does not stop if the data are linearly non-separable (there is no hyperplane that can perfectly separate positive examples from negative examples)

  • Weights $W$ are adjusted for misclassified data only (correctly classified data are not considered at all)

  • Multiple perceptron can form complex decision boundaries (piecewise-linear), but it is hard to train. For example, we partition the space into four regions with two perceptrons, which can separate the positive examples from negative examples.


In this subsection, we consider a method to improve perceptrons.

The idea is to replace the sgn function with a differentiable non-linear function. A popular choice is the sigmoid function $$ \sigma(x) = \frac{1}{1+e^{-x}} $$

which maps $(-\infty,+\infty) \to (0,1)$, besides $\sigma'(x) = \sigma(x)(1-\sigma(x))$.

PROOF: $$ \begin{aligned} \sigma(x) &= \frac{1}{1+e^{-x}} \\ \sigma'(x) &= \frac{-(-e^{-x})}{(1+e^{-x})^2} \\ \sigma'(x) &= \frac{1}{1+e^{-x}} \bullet \frac{e^{-x}+1-1}{1+e^{-x}} \\ \sigma'(x) &= \frac{1}{1+e^{-x}} (1-\frac{1}{1+e^{-x}}) \\ \sigma'(x) &= \sigma(x)(1-\sigma(x)) \end{aligned} $$

The only difference of single-layer neural network from perceptron is that it uses a highly nonlinear sigmoid function to replace the sign function in the perceptron.

The sigmoid function can output any real number, which means the single-layer neural network can be applied to regression problems. Furthermore, sign function is not continuous but sigmoid function is differentiable.

Thus, we set:

  • Dataset: $n$ input/output pairs $S =\{(x(1), y(1)), . . . ,(x(n), y(n))\}$

  • Mean-Square Error:

    $$ C(W) = \frac{1}{2n}\sum_{i=1}^{n}(\sigma(W^\mathsf{T}X^{(i)}+b)-y^{(i)})^2 $$
  • Unlike the linear regression problems, we do not have closed form solution for the minimizer of $C(w)$, but we can use Gradient Descent to get a small $C(w)$: $$ W = W - \eta\nabla C(W) \to w_i = w_i - \eta\frac{\partial C(W)}{\partial w_i} $$

Now, we need to find a way to compute gradient.


5.2.1 Chain Rule and Gradient Descent for Single-Layer Neural Network

5.2.1.1 Univariate Chain Rule

The composition of two functions $f$ and $g$ is defined by $$ (f \circ g)(x) = f(g(x)) $$

For example: $$ \mathsf{if} \begin{cases} f(x) = x^3 \\ g(x) = \mathsf{sin} x \end{cases} \\ f \circ g(x) = f(g(x)) = f(\mathsf{sin}x) = \mathsf{sin}^3 (x) \\ g \circ f(x) = g(f(x)) = g(x^3) = \mathsf{sin}(x^3). $$

One Dimensional Chain Rule:
$$ \frac{d}{dx}f(g(x)) = f'(g(x)) * g'(x) $$

In words, the derivative of $f(g(x))$ is the derivative of the outside function $f$ evaluated at the inside function $g$, times the derivative of the inside function.

Univariate Chain Rule:
Let us consider the following function $C$ ($\sigma$ is the sigmoid function) $$ C = \frac{1}{2}(\sigma(wx+b)-y)^2 $$

where $w,b \in \mathbb{R}$

We set $$ C = \frac{1}{2}p^2 \\ p = \sigma(q)-y \\ q = wx+b $$

Then, we have: $$ \begin{aligned} \frac{\partial C}{\partial w} &= \frac{\partial C}{\partial p}\frac{\partial p}{\partial w} \\ &= (\frac{1}{2}p^2)'\frac{\partial p}{\partial q}\frac{\partial q}{\partial w} \\ &= p\sigma'(q)x \\ &= (\sigma(q)-y)\sigma'(q)x \\ &= (\sigma(wx+b)-y)\sigma'(wx+b)x \end{aligned} $$

and

$$ \begin{aligned} \frac{\partial C}{\partial b} &= \frac{\partial C}{\partial p}\frac{\partial p}{\partial b} \\ &= (\frac{1}{2}p^2)'\frac{\partial p}{\partial q}\frac{\partial q}{\partial b} \\ &= p\sigma'(q) \\ &= (\sigma(q)-y)\sigma'(q) \\ &= (\sigma(wx+b)-y)\sigma'(wx+b) \end{aligned} $$

5.2.1.2 Multivariate Chain Rule

Now we consider multivariate chain rule. Suppose we have a function $f(x, y)$ and functions $x(t)$ and $y(t)$. Then we can use multivariate chain rule to compute the derivative of $f$ w.r.t. $t$, that is $$ \frac{d}{dx}f(x(t),y(t)) = \frac{\partial f}{\partial x}\frac{dx}{dt}+\frac{\partial f}{\partial y}\frac{dy}{dt} $$

For Example: $$ f(x,y) = y + e^{xy} \\ x(t) = \cos t \\ y(t) = t^2 $$

We have $$ \begin{aligned} \frac{df}{dt} &= \frac{\partial f}{\partial x}\frac{dx}{dt}+\frac{\partial f}{\partial y}\frac{dy}{dt} \\ &= (ye^{xy})*(-\sin t)+(1+xe^{xy})*2t \end{aligned} $$

Multivariate Chain Rule:

$$ \frac{\partial f}{\partial x_i} = \sum^{m}_{j=1}\frac{\partial f}{\partial u_j} \bullet \frac{\partial u_j}{\partial x_i} $$

According to above calculation,we make: $$ C_i(W) = \frac{1}{2}(\sigma(W^\mathsf{T}x^{(i)}+b)-y^{(i)})^2; \\ C(W) = \frac{1}{n}\sum^{n}_{i=1}C_i(W) $$

Then, we have $$ \frac{\partial C_i}{\partial w_j} = (\sigma(W^\mathsf{T}x^{(i)}+b)-y^{(i)})\sigma'(W^\mathsf{T}x^{(i)}+b)x^{(i)}_j $$ $$ \frac{\partial C_i}{\partial W} = (\sigma(W^\mathsf{T}x^{(i)}+b)-y^{(i)})\sigma'(W^\mathsf{T}x^{(i)}+b)X^{(i)}_j $$ $$ \frac{\partial C_i}{\partial b} = (\sigma(W^\mathsf{T}x^{(i)}+b)-y^{(i)})\sigma'(W^\mathsf{T}x^{(i)}+b) $$

We can plug the above gradient computation into the framework of gradient descent and get the following algorithm for training single-layer neural networks:

Algorithm Steps / Shortcode Commentary
Initialize $w^{(1)} = 0, b^{(1)} = 0$ we assume no bias for simplicity
for $t = 1, 2, . . . , T$ $T$ is the number of iterations
Update the model: $\\ W^{(t+1)} = W^{(t)} - \eta_t \nabla W,\\ b^{(t+1)} = b^{(t)} - \eta_t \nabla b$ $\nabla W = \frac{1}{n}\sum^{n}_{i=1}\frac{\partial C_i(W^{(t)})}{\partial W^{(t)}},\\ \nabla b = \frac{1}{n}\sum^{n}_{i=1}\frac{\partial C_i(W^{(t)})}{\partial b^{(t)}}$

5.2.2 Gradient Descent for Single-Layer NN

NOTE: Single-layer Neural Network still yields linear classifiers.

Single-layer neural networks can be used for regression since it outputs a real number. We can also use it to do classification. We can still use the sign of $W^\mathsf{T}X + b$ for prediction. Since $\sigma(0) = \frac{1}{2}$ we can predict it to be positive if $f(x) = \sigma(W^\mathsf{T}X + b) \geq \frac{1}{2}$. This is illustrated as follows

5.3 Feed-Forward Neural Networks

Feed-forward neural network is a very powerful model to address nonlinear data. The basic idea of feed-forward neural networks is to use several neurons. Between neurons there are some edges with direction.

  • We can connect lots of units/neurons together into a directed acyclic graph, which means if you follow the direction indicated by the edge, you will never return to the neuron you have visited.

  • Typically, units are grouped together into layers. We have an input layer, some hidden layers and an output layer.

  • The number of units in the input layer is equal to the number d of features. Output layer can have one neuron or more neuron, depending on the learning task.

  • Each unit (simplest case) in a layer is connected to each unit in the next layer. This is a fully connected neural network. We call it Feed-Forward neural network or multi-layer perceptron (MLP).

By introducing more neurons structured into layers, we get a highly nonlinear model with improved expression power.

5.3.1 Weights and Bias

We use edges to connect different neurons. Each edge is associated with a number called weight. Each neuron is associated with a number called bias. Both weights and bias are called trainable parameters. We need to train these parameters from the dataset, and once these parameters are fixed we get a model.

We set:

  • $w^3_{24}$ comes from 4-th neuron in 2nd layer to 2-nd neuron in 3-rd layer.
  • $b^2_{3}$ comes from the 3-rd node in the 2-nd layer.

Each neuron can be considered as a processor. It receives some input information from previous layers (or input feature), does some processing and then outputs it to neuron in the next layer. We need to introduce some notations to explain this:

Neura_networ_structure

  • $L$: number of layers (1-st layer is “input layer”, $L$-th layer is “output layer”)

  • $m$: “width” of network, the numbers of units in one layer. (can vary between layers)

  • $w^\ell_{jk}$: “weight” of connection between $k$-th unit in layer $(\ell-1)$, to $j$-th unit in layer $\ell$

  • $b^\ell_{j}$: “bias” of $j$-th unit in layer $\ell$

Let us consider the $j$-th unit in the $\ell$-th layer. We associate two numbers: $z^\ell_{j}$ and $a^\ell_{j}$ . $$ z^\ell_{j} = \sum_{k=1}^{m}w^\ell_{jk}a^{\ell-1}_{k} + b^\ell_{j} $$

Now, we get a linear function. To achieve nonlinearity, we apply the activation function to $z^\ell_{j}$ and get “activation” $$ a^\ell_{j} = \sigma(z^\ell_{j}) $$

In summary, each neuron in MLP receives activations from all the neurons in the previous layer, and outputs activations which are passed to all the neurons in the next layer.

5.3.2 Vectorization

While the above equations precisely describe the information flow of MLP, they are not convenient to implement in programming. We wish a vectorization of implementation, which means represent the computation in terms of matrices. We set

$$ W^\ell = \left( \begin{matrix} w^\ell_{11}&w^\ell_{12}\cdots&w^\ell_{1m}\\w^\ell_{21}&w^\ell_{22}\cdots&w^\ell_{2m}\\\vdots&\ddots&\vdots\\w^\ell_{m1}&w^\ell_{m2}\cdots&w^\ell_{mm}\\ \end{matrix} \right),\ b^\ell = \left( \begin{matrix} b^\ell_{1}\\b^\ell_{2}\\\vdots\\b^\ell_{m} \end{matrix} \right),\ (\ell = 2,...,L) $$

Then, we have $$ a^\ell = \sigma(z^\ell) = \sigma(W^\ell a^{\ell-1}+b^\ell),\ (\ell = 2,...,L) $$

where $$ W^\ell = \left( \begin{matrix} w^\ell_{11}&w^\ell_{12}\cdots&w^\ell_{1m}\\w^\ell_{21}&w^\ell_{22}\cdots&w^\ell_{2m}\\\vdots&\ddots&\vdots\\w^\ell_{m1}&w^\ell_{m2}\cdots&w^\ell_{mm}\\ \end{matrix} \right),\\ Z^\ell = \left( \begin{matrix} Z^\ell_{1}\\Z^\ell_{2}\\\vdots\\Z^\ell_{m} \end{matrix} \right),\ a^\ell = \left( \begin{matrix} a^\ell_{1}\\a^\ell_{2}\\\vdots\\a^\ell_{m} \end{matrix} \right),\ b^\ell = \left( \begin{matrix} b^\ell_{1}\\b^\ell_{2}\\\vdots\\b^\ell_{m} \end{matrix} \right) $$

PROOF:
$$ a^\ell_j = \sigma(z^\ell_j) = \sum_{k=1}^{m}w^\ell_{jk}a^{\ell-1}_{k} + b^\ell_{j} $$ $$ \begin{aligned} \left( \begin{matrix} Z^\ell_{1}\\Z^\ell_{2}\\\vdots\\Z^\ell_{m} \end{matrix} \right) &= \left( \begin{matrix} \sum_{k=1}^{m}w^\ell_{1k}a^{\ell-1}_{k} + b^\ell_{1}\\\sum_{k=1}^{m}w^\ell_{2k}a^{\ell-1}_{k} + b^\ell_{2}\\\vdots\\\sum_{k=1}^{m}w^\ell_{mk}a^{\ell-1}_{k} + b^\ell_{m} \end{matrix} \right) \\ &= \left( \begin{matrix} w^\ell_{11}&w^\ell_{12}\cdots&w^\ell_{1m}\\w^\ell_{21}&w^\ell_{22}\cdots&w^\ell_{2m}\\\vdots&\ddots&\vdots\\w^\ell_{m1}&w^\ell_{m2}\cdots&w^\ell_{mm}\\ \end{matrix} \right) \left( \begin{matrix} a^{\ell-1}_{1}\\a^{\ell-1}_{2}\\\vdots\\a^{\ell-1}_{m} \end{matrix} \right) + \left( \begin{matrix} b^\ell_{1}\\b^\ell_{2}\\\vdots\\b^\ell_{m} \end{matrix} \right) \end{aligned} $$

5.3.3 Forward Propagation to Compute Prediction

The following figure shows the implementation process of forward propagation.

Forward_propagation

There are several choices of activation functions. Now, we introduce another two activation functions: the ReLU activation and the Tanh activation. These activation functions are essential in constructing nonlinear models by neural networks. Their behavior is illustrated in the following figures.

Activation_functions

  • ReLU is simple to compute since it involves only the maximum operator. This is a continuous function but not differentiable. It can be checked from the definition that ReLU is not differentiable at the zero point. ReLU is widely used in modern neural networks due to its simplicity.

  • Sigmoid and Tanh are differentiable. However, these two use exponential function and require more computation (exponential function is more difficult to compute than a maximum operator). We can see the behavior of sigmoid and Tanh are very similar. The difference is the sigmoid function outputs value in [0, 1], while Tanh outputs values in [−1, 1].

5.3.4 Training MLPs

Now we discuss how to train a MLP from the dataset. As we discussed before, the problem becomes the tuning of weights and bias. We use a loss function to measure the performance of the MLP on an example $(X, y)$, we have $$ C_{(X, y)}(W^2,...,W^L,b^2,...,b^L) = (y - \sigma(W^L(\sigma...\sigma(W^2X+b^2))+b^L))^2 $$

which is the squared difference between the true output $y$ and the predicted output $z^L$. The objective function we want to minimize becomes $$ C(W,b) = \frac{1}{n}\sum_{(X,y)\in S}C_{(X, y)}(W^2,...,W^L,b^2,...,b^L) $$

where $S$ is the training dataset. We apply gradient descent to solve this problem. This leads to $$ \begin{aligned} W^2 &\Leftarrow W^2-\frac{\eta}{n}\sum_{(X,y)\in S}\frac{\partial C_{(X, y)}(W^2,...,W^L,b^2,...,b^L)}{\partial W^2} \\ W^3 &\Leftarrow W^3-\frac{\eta}{n}\sum_{(X,y)\in S}\frac{\partial C_{(X, y)}(W^2,...,W^L,b^2,...,b^L)}{\partial W^3} \\ \vdots \\ W^L &\Leftarrow W^L-\frac{\eta}{n}\sum_{(X,y)\in S}\frac{\partial C_{(X, y)}(W^2,...,W^L,b^2,...,b^L)}{\partial W^L} \end{aligned} $$

NOTE: To train MLPs, we need an efficient way to compute the gradients. This is achieved by the very famous backpropagation algorithm. The mathematical foundation is chain rule. Backpropagation uses a clever idea to avoid repeated computation!

6 Backpropagation Algorithm

let us review chain rule to compute the derivative of composite functions. It decomposes the derivative of a composite function to derivatives of simple functions.

One Dimensional Chain Rule: $$ \frac{d}{dx}f(g(x)) = f'(g(x)) * g'(x) $$

Multivariate Chain Rule: $$ \frac{\partial f}{\partial x_i} = \sum^{m}_{j=1}\frac{\partial f}{\partial u_j} * \frac{\partial u_j}{\partial x_i} $$ $$ \frac{\partial C}{\partial w} = (\sigma(w^\mathsf{T}x+b)-y)\sigma'(w^\mathsf{T}x+b)x_j $$ $$ \frac{\partial C}{\partial b} = (\sigma(w^\mathsf{T}x+b)-y)\sigma'(w^\mathsf{T}x+b) $$

While this approach is correct, it is not efficient to implement in computers. The underlying reason is that there are some repeated computations.

  • $(\sigma(w^\mathsf{T}x+b)-y)$ and $\sigma'(w^\mathsf{T}x+b)$ appear in both $\frac{\partial C}{\partial w}$ and $\frac{\partial C}{\partial b}$

This means that we have computed $(\sigma(w^\mathsf{T}x+b)-y)$ and $\sigma'(w^\mathsf{T}x+b)$ twice. Furthermore, there is no structure in this computation.

So, We can do the computation in a more structured way. The basic idea is to introduce some intermediate variables. In this way, we can store some intermediate results to avoid repeated computation.

$$ C = \frac{1}{2}(\sigma(wx+b)-y)^2 $$

where $w,b \in \mathbb{R}$

We set: $$ \begin{aligned} z &= wx+b \\ a &= \sigma(z) \\ C &= \frac{1}{2}(a-y)^2 \end{aligned} $$

Then, we have: $$ V1 = \frac{\partial C}{\partial a} = a-y $$ $$ V2 = \frac{\partial C}{\partial z} = \frac{\partial C}{\partial a}*\sigma'(z) = V1*\sigma'(z) $$ $$ \frac{\partial C}{\partial w} = V2*x $$ $$ \frac{\partial C}{\partial w} = V2 $$

  • $\frac{\partial C}{\partial z}$ is used twice. We can store it once it is computed, which avoids a repeated computation.

NOTE: Our goal is not to obtain closed-form solutions, but to be able to write a program that efficiently computes the erivatives!

6.1 Chain Rule in Computation Graph

The computation graph for linear regression is shown below:

Computation_Graph

Computation graph provides a clear explanation for chain rule. Let ($\{y_1, . . . , y_n\}$) be the successor of $x$. Then $$ \frac{\partial z}{\partial x} = \sum^{n}_{j=1}\frac{\partial z}{\partial y_j} . \frac{\partial y_j}{\partial x} $$

Chain_Rule_in_Com_Graph

Now, we introduce two types of gradients for our explanation.

  • Backpropagated gradient: the gradient of the final output w.r.t. successor ($\frac{\partial z}{\partial y_j}$).

  • Local gradient: the gradient of successor w.r.t. $x$ ($\frac{\partial y_j}{\partial x}$).

To compute the gradient of $z$ w.r.t. $x$, we need to compute the gradient of $z$ w.r.t. the successor of $x$ (backpropagated gradient). This means that we need a backward pass in computing gradients. We first compute the backpropagated gradient, which is multiplied by a local gradient (usually local gradient is easy to compute due to the computation graph). This finishes the computation in a route, then we need to take a summation over all the routes.

Example
For forward propagation to compute the loss,We now give an example to use computation graph for computing the loss function. We take a forward pass according the computation graph. In the beginning, we assign $w = 5$ and $x = 1$. The computation node then gives $u = 5$ and $\hat{y} = 6$. We continue this process til we compute the loss $J = 16$.

Forward_Computation

For backward propagation to compute the loss, We now apply the backward propagation to compute the gradient for the example. We need to compute the backpropagated gradient $\frac{\partial J}{\partial J}$, $\frac{\partial J}{\partial e}$, $\frac{\partial J}{\partial \hat{y}}$ and $\frac{\partial J}{\partial u}$, sequentially. In this process, we can use the result we have already computed.

Forward_Computation

6.2 Backpropagation Algorithm in MLPs

We have the following relationship between $z^\ell_{j}$ and $w^\ell_{jk}$

$$ z^\ell_{j} = \sum_{k=1}^{m}w^\ell_{jk}a^{\ell-1}_{k} + b^\ell_{j} $$ $$ a^{\ell}_{k} = \sigma (z^\ell_{j}) $$ $$ \frac{\partial z^\ell_{j}}{\partial w^\ell_{jk}} = \frac{\partial w^\ell_{jk}a^{\ell-1}_{k}}{\partial w^\ell_{jk}} = a^{\ell-1}_{k} $$

Then, we get

$$ \frac{\partial C}{\partial w^\ell_{jk}} = \frac{\partial C}{\partial z^\ell_{j}} \cdot \frac{\partial z^\ell_{j}}{\partial w^\ell_{jk}} = \frac{\partial C}{\partial z^\ell_{j}} \cdot a^{\ell-1}_{k} $$ $$ \frac{\partial C}{\partial b^\ell_{j}} = \frac{\partial C}{\partial z^\ell_{j}} \cdot \frac{\partial z^\ell_{j}}{\partial b^\ell_{j}} = \frac{\partial C}{\partial z^\ell_{j}} $$

Now, we know that $\delta^\ell_j (= \frac{\partial C}{\partial z^\ell_{j}})$ suffices to compute the derivative w.r.t. parameters, also back-propagated gradient.


Vectorization

We have derived $$ \frac{\partial C}{\partial w^\ell_{jk}} = \delta^\ell_j \cdot a^{\ell-1}_{k} $$ $$ \frac{\partial C}{\partial b^\ell_{j}} = \delta^\ell_j $$

which can be written in terms of matrices $$ \begin{aligned} \left( \begin{matrix} \frac{\partial C}{\partial w^\ell_{11}}&\cdots&\frac{\partial C}{\partial w^\ell_{1m}}\\\vdots&\ddots&\vdots\\\frac{\partial C}{\partial w^\ell_{m1}}&\cdots&\frac{\partial C}{\partial w^\ell_{mm}} \end{matrix} \right) &= \left( \begin{matrix} \delta^\ell_1 \cdot a^{\ell-1}_{1}&\cdots&\delta^\ell_1 \cdot a^{\ell-1}_{m}\\\vdots&\ddots&\vdots\\\delta^\ell_m \cdot a^{\ell-1}_{1}&\cdots&\delta^\ell_m \cdot a^{\ell-1}_{m} \end{matrix} \right) \\ &= \left( \begin{matrix} \delta^\ell_1 \\ \vdots \\ \delta^\ell_m \end{matrix} \right) \left( \begin{matrix} a^{\ell-1}_{1}, \cdots, a^{\ell-1}_{m} \end{matrix} \right) \\ &= \delta^\ell \cdot (a^{\ell-1})^{\mathsf{T}} \end{aligned} $$

In summary, we can represent the gradient w.r.t. both the weight matrices and bias in terms of backpropagated gradients $\delta^\ell$

$$ \frac{\partial C}{\partial W^\ell} = \delta^\ell_j \cdot (a^{\ell-1})^{\mathsf{T}} \\ \frac{\partial C}{\partial b^\ell} = \delta^\ell_j $$

6.3 Recursive Relationship on Backpropagated Gradients

It remans to compute the backpropagated gradients $\delta^\ell$. In this section, we provide a recursive relationship on Backpropagated Gradients. The basic idea is to first compute the backpropagated gradients $\delta^L$ for the output layer. Then we show how to compute $\delta^{L-1}$ as a function of $\delta^L$. We repeat this process until we arrive at the input layer.

Back-propagated Gradient for Output Layer
we know that $z^L_j$ has a successor $a^L_j$. By the chain rule, we get the backpropagated gradient for the output layer as follows $$ \begin{aligned} \delta^L_j &= \frac{\partial C}{\partial z^L_j} \\ &= \frac{\partial C}{\partial a^L_j} \bullet \frac{\partial a^L_j}{\partial z^L_j} \\ &= \frac{\partial C}{\partial a^L_j} \bullet \frac{\sigma (z^L_j)}{\partial z^L_j} \\ &= \frac{\partial C}{\partial a^L_j} \bullet \sigma' (z^L_j) \end{aligned} $$

where $\sigma' (z^L_j)$ is easy to compute. The term $\frac{\partial C}{\partial a^L_j}$ depends on the cost function. For example, for a regression problem in m dimensions, one could define as $$ C(a^L_1, ... , a^L_m) = \frac{1}{2} \sum^m_{k=1}(y_k-a^L_k)^2 $$

In this case, we know $$ \frac{\partial C}{\partial a^L_j} = -(y_j-a^L_j) $$

Therefore, we get the Back-propagated Gradient for Output Layer as follows $$ \delta^L_j =-\sigma' (z^L_j)(y_j-a^L_j) $$

Back-propagated Gradient for Hidden Layer
We now consider the hidden layer. According to the structure, the node $z^\ell_j$ has a successor $a^\ell_j$, which has succors $z^{\ell+1}_i$ in the next layer $(i = 1, ... , m)$. $$ z^{\ell+1}_k = \sum_r (w^{\ell+1}_{kr}a^\ell_r) \\ a^\ell_j = \sigma(z^{\ell}_j) $$

According to the chain rule, to compute $\frac{\partial C}{\partial a^\ell_j}$ we need to take a summation over all of the successors $$ \frac{\partial C}{\partial a^\ell_j} = \sum_k (\frac{\partial C}{\partial z^{\ell+1}_k} \frac{\partial z^{\ell+1}_k}{\partial a^{\ell}_j}) $$

This leads to $$ \begin{aligned} \delta^\ell_j &= \frac{\partial C}{\partial z^\ell_j} = \frac{\partial C}{\partial a^\ell_j} \bullet \frac{\partial a^\ell_j}{\partial z^\ell_j} \\ &= (\sum_k \frac{\partial C}{\partial z^{\ell+1}_k} \frac{\partial z^{\ell+1}_k}{\partial a^{\ell}_j}) \sigma'(z^{\ell}_j) \\ &= \sigma'(z^{\ell}_j) \sum_k (\delta^{\ell+1}_k w^{\ell+1}_{kj}) \end{aligned} $$

  • Note $\delta^{\ell+1}_k$ has already been computed since we compute back-propagated gradients from top to bottom!
  • This shows that the backpropagated gradients in the layer $\ell$ can be computed in terms of the backpropagated gradients in the layer $\ell+1$.
  • Therefore, we can first compute $\delta^L$. This is used to compute $\delta^{L-1}$, which is further used to compute $\delta^{L-2}$ and so on.

Summary
We can summarize the above discussions as follows:

  • Gradients w.r.t. $w^{\ell+1}_{jk}$ and $b^\ell_j$ can be represented by back-propagated gradients. $$ \frac{\partial C}{\partial w^{\ell}_{jk}} = \delta^\ell_j \cdot a^{\ell-1}_k,\ \frac{\partial C}{\partial b^\ell_j} = \delta^\ell_j $$

  • Back-propagated gradients can be computed in a backward manner $$ \delta^{\ell}_j = \begin{cases} \sigma' (z^L_j) \cdot \frac{\partial C}{\partial a^L_j},& \mathsf{if}\ \ell = L(output\ layer) \\ \sigma'(z^{\ell}_j) \sum_k (\delta^{\ell+1}_k w^{\ell+1}_{kj}),& otherwise(hidden\ layer). \end{cases} $$

Vectorization
We now show how to represent the computation of backpropagated gradients via matrices. We use $\odot$ to denote Hadamard product (elementwise multiplications), e.g, $(1, 2) \odot (3, 4) = (3, 8)$.

When $\ell = L$, we have the following vectorization for the output layer $$ \left( \begin{matrix} \delta^L_1 \\ \vdots \\ \delta^L_m \end{matrix} \right) = \left( \begin{matrix} \frac{\partial C}{\partial a^L_1} \cdot \sigma'(z^L_1) \\ \vdots \\ \frac{\partial C}{\partial a^L_m} \cdot \sigma'(z^L_m) \end{matrix} \right) \\ = \left( \begin{matrix} \frac{\partial C}{\partial a^L_1} \\ \vdots \\ \frac{\partial C}{\partial a^L_m} \end{matrix} \right) \odot \left( \begin{matrix} \sigma'(z^L_1)\\ \vdots \\ \sigma'(z^L_m) \end{matrix} \right) \\ = \nabla_{a^L}C \odot \sigma'(Z^L) $$

When $\ell < L$, we have the following vectorization for the hidden layer $$ \left( \begin{matrix} \delta^\ell_1 \\ \vdots \\ \delta^\ell_m \end{matrix} \right) = \left( \begin{matrix} \sigma'(z^{\ell}_1) \sum_k (\delta^{\ell+1}_k w^{\ell+1}_{k1}) \\ \vdots \\ \sigma'(z^{\ell}_m) \sum_k (\delta^{\ell+1}_k w^{\ell+1}_{km}) \end{matrix} \right) = \left( \begin{matrix} \sigma'(z^{\ell}_1) \\ \vdots \\ \sigma'(z^{\ell}_m) \end{matrix} \right) \odot \left( \begin{matrix} \sum_k (w^{\ell+1})^\mathsf{T}_{1k} \cdot \delta^{\ell+1}_k \\ \vdots \\ \sum_k (w^{\ell+1})^\mathsf{T}_{mk} \cdot \delta^{\ell+1}_k \end{matrix} \right) $$

we get

$$ \left( \begin{matrix} \delta^\ell_1 \\ \vdots \\ \delta^\ell_m \end{matrix} \right) = \sigma'(z^{\ell}) \odot ((W^{\ell+1})^\mathsf{T} \cdot \delta^{\ell+1}) $$

In summary, we get the following vectorization of Back-propagated Gradient $$ \delta^{\ell} = \begin{cases} \nabla_{a^L}C \odot \sigma'(Z^L),& \mathsf{if}\ \ell = L(output\ layer) \\ \sigma'(z^{\ell}) \odot ((W^{\ell+1})^\mathsf{T} \cdot \delta^{\ell+1}),& otherwise(hidden\ layer). \end{cases} $$

6.4 Backpropagation Algorithm

Now we are ready to explicitly state the backpropagation algorithm. We first apply forward pass to compute $z^\ell$ and $a^\ell$ for $\ell = 1,2,...,L$. After that we apply the backward pass to compute the backpropagated gradients $\sigma^L, \sigma^{L-1},...,\sigma^2$. Note that in this process we need to use $a^\ell$, which has already been computed in the forward pass.

Backpropagation_Algorithm

Backpropagation Algorithm Steps
Input: instance $(x, y)$, and parameters $W^\ell, b^\ell,\ell = 2,...,L$ Output: gradients

1: Set $a^1 = x$ ($x$ is input)

2: for $\ell = 2,...,L$ do

3: Compute the activations in $\ell$-th layer via $$ z^\ell = W^\ell a^{\ell-1} + b^\ell,\ a^\ell = \sigma(z^\ell) $$

4: Compute back-propagated gradient for output layer $$ \delta^L = \nabla_{a^L}C \odot \sigma'(z^L),\ \frac{\partial C}{\partial W^L} = \delta^L(a^{L-1})^\mathsf{T},\ \frac{\partial C}{\partial b^L} = \delta^L $$

5: for $\ell = L-1,...,2$ do

6: Compute back-propagated gradient for hidden layer $$ \delta^\ell = \sigma'(z^{\ell}) \odot ((W^{\ell+1})^\mathsf{T} \cdot \delta^{\ell+1}) $$

7: Compute gradients w.r.t. parameters $$ \frac{\partial C}{\partial W^\ell} = \delta^\ell(a^{\ell-1})^\mathsf{T},\ \frac{\partial C}{\partial b^\ell} = \delta^\ell $$

6.5 Gradient Descent for Feedfoward Networks

Backpropagation algorithm provides an efficient way to compute gradients. We can put these gradients into the framework of gradient descent. Note above algorithm computes the gradient of the loss on a single training example $(x, y)$. The dataset $S$ has $n$ examples.

  • Assume n training examples $(x^{(1)}, y^{(1)}), ... , (x^{(n)}, y^{(n)})$ and a cost function $$ C = \frac{1}{n} \sum^n_{i = 1}C_i, $$

where $C_i$ is the cost on the $i$-th example. E.g., we can define $C_i = \frac{1}{2}(y^{(i)}-a^L)^2$ where $a^L$ is the output of the network when $a^1 = x^{(i)}$.

  • Backpropagation gives us the gradient of the overall cost function as follows $$ \begin{aligned} \frac{\partial C}{\partial W^\ell} &= \frac{1}{n} \sum^n_{i = 1}\frac{\partial C_i}{\partial W^\ell} \\ \frac{\partial C}{\partial b^\ell} &= \frac{1}{n} \sum^n_{i = 1}\frac{\partial C_i}{\partial b^\ell} \end{aligned} $$

  • We can use gradient descent (or any gradient-based method) to optimize the weights $W$ and biases $b$

Mini-Batch Gradient Descent
Computing the gradient is expensive when the number of training examples $n$ is large.

  • We can randomly select a “mini-batch” $I ⊂ \{1, . . . , n\}$ of size $s$ per iteration,
    when $1 < s < n$, which is Mini-batch gradient descent,
    when $s = 1$, which is Stochastic gradient descent.
    A common choice is $s \in [20, 100]$.

  • Build stochastic gradients $$ \begin{aligned} \frac{\partial C}{\partial W^\ell} &\approx \frac{1}{s} \sum_{i \in I}\frac{\partial C_i}{\partial W^\ell} \\ \frac{\partial C}{\partial b^\ell} &\approx \frac{1}{s} \sum_{i \in I}\frac{\partial C_i}{\partial b^\ell} \end{aligned} $$

6.6 Summary

Neural nets will be very large. It is impractical to write down gradient formula by hand for all parameters. We need to use the structure of neural networks.

Computation Graph: decompose complex computations into several sequences of much simpler calculations (simplify programming)

  • forward pass to compute the cost (compute result of an operation and save any intermediates needed for gradient computation in memory)

  • backward pass to compute the gradient based on chain rule. If we want to compute the gradient for a node, we need to compute the gradients for all the successors.

Backpropagation algorithm: recursive application of the chain rule along a computation graph to compute the gradients of all.

  • It suffices to compute back-propagated gradients (the gradients on the weights and bias can be computed once we know the backpropagated gradients).

  • back-propagated gradients can be computed recursively (from top to bottom).


Exercise 1

Question 1 (Gradient)

The gradient of a function $f: \mathbb{R}^d \to \mathbb{R}$ is denoted by $\nabla f$. Which of the following statements is correct?
(A) The gradient $\nabla f$ is a vector with positive elements.
(B) The gradient $\nabla f$ is a function which maps vectors to vectors.
\(C\) The gradient $\nabla f$ is a function which maps vectors to scalars.
(D) The gradient $\nabla f$ is a vector with negative elements.

Question 2 (Minibatch SGD)

Which of the following statement is not true ($n$ is the sample size)?
(A) If the batch size is 1, then minibatch SGD becomes SGD
(B) If the batch size is $n$, then minibatch SGD (sampling without replacement) becomes gradient descent.
\(C\) If the batch size is $n$, then minibatch SGD (sampling with replacement) becomes gradient descent.
(D) As compared to SGD, minibatch SGD builds a better gradient estimator by using more examples.

Question 3 (Propagation)

Which of the following statement is not true?
(A) Forward propagation goes from the input layer to the output layer.
(B) Backward propagation aims to to compute a gradient of a function.
\(C\) Backward propagation is based on a chain rule.
(D) Forward and backward propagation are two independent processes.

Question 4 (Multi-Layer Perceptron)

Consider a fully-connected MLP with 4 layers: 1 input layer, 1 output layer and 2 hidden layers. Assume the input layer has 6 nodes, the two hidden layers have 5 and 10 nodes respectively, and the output layer has 3 nodes.
How many trainable parameters are there in this MLP?
(A): 100
(B): 128
\(C\): 160
(D): 180

Question 5 (Minimization)

Let 𝑎, 𝑏, 𝑐, 𝑑 be four numbers. Consider two points $x_1 = (a,b)^\mathsf{T}$ and $x_2 = (c,d)^\mathsf{T}$. Consider the following minimization problem

$$ \min_{X \in \mathbb{R}^2} ||X-X_1||^2_2+2||X-X_2||^2_2 $$

Which of the following is the minimizer?
(A):$ \left( \begin{matrix} \frac{a}{3}+\frac{c}{3} \\ \frac{b}{3}+\frac{d}{3} \end{matrix} \right) $ $~~~$ (B):$ \left( \begin{matrix} \frac{a}{2}+\frac{c}{2} \\ \frac{b}{2}+\frac{d}{2} \end{matrix} \right) $ $~~~$ (C):$ \left( \begin{matrix} \frac{a}{3}+\frac{2c}{3} \\ \frac{b}{3}+\frac{2d}{3} \end{matrix} \right) $ $~~~$ (D):$ \left( \begin{matrix} \frac{2a}{3}+\frac{c}{3} \\ \frac{2b}{3}+\frac{d}{3} \end{matrix} \right) $

Question 6 (Gradient Descent)

Consider a binary classification problem with the following training examples

X = [-0.5, -1, 0.5, -0.2, -0.8, -0.15, -1, 0;
    0.25, -0.1, 0, -0.3, 0, -0.5, 0, -0.25;
    -0.8, -0.1, 0.25, 0.2, -0.8, 0.05, -1, 0.25;
    -1, -1, 0.1, 0, -1, -0.25, -1, 0.1];

Y  = [1, 1, -1, -1, 1, -1, 1, -1];

Suppose we consider a linear model for classification $X \to W^\mathsf{T}X$, where $W = (𝑤1, 𝑤2, 𝑤3, 𝑤4)^\mathsf{T} \in \mathbb{R}^4$.We minimize the objective function (for simplicity we do not consider the bias in the linear model)

$$ C(W) = \frac{1}{n}\sum_{i=1}^{n}C_i(W) $$

where $$ C_i(W) = \begin{cases} 0,& \mathsf{if}\ y^{(i)}W^\mathsf{T}X^{(i)} \geq 1 \\ \frac{1}{2}(1 − y^{(i)}W^\mathsf{T}X^{(i)})^2,& otherwise. \end{cases} $$

Suppose we run gradient descent with $w_{(0)} = (0, 0, 0, 0)^\mathsf{T}$ and step size $\eta_t = \eta = 0.5$. What is $w^{(25)}$? (We only preserve three digits after the decimal point)
(A): $ \left( \begin{matrix} −0.721 \\ 1.262 \\ −1.204 \\ −0.484 \end{matrix} \right) $ $~~~$ (B): $ \left( \begin{matrix} −0.527 \\ 1.220 \\ −1.047 \\ −0.445 \end{matrix} \right) $ $~~~$ \(C\): $ \left( \begin{matrix} -0.536 \\ 1.260 \\ −1.257 \\ −0.565 \end{matrix} \right) $ $~~~$ (D): $ \left( \begin{matrix} -0.637 \\ 1.120 \\ −1.056 \\ −0.395 \end{matrix} \right) $

Question 7 (Stochastic Gradient Descent)

Let us consider the Problem 6, the same objective function. Suppose we run SGD to minimize loss function. Let $w^{(0)} = (0, 0, 0, 0)^\mathsf{T}$ and $\eta_t = \eta = 0.1$. Let $i_t = (t \mod 8) +1$, i.e., $i_0 = 1, i_7 = 8, i_8 = 1$. What is $w^{(80)}$? (We only preserve three digits after the decimal point)
(A): $ \left( \begin{matrix} −0.469 \\ 0.890 \\ −0.799 \\ −0.449 \end{matrix} \right) $ $~~~$ (B): $ \left( \begin{matrix} −0.480 \\ 0.706 \\ −0.879 \\ −0.549 \end{matrix} \right) $ $~~~$ \(C\): $ \left( \begin{matrix} -0.569 \\ 0.990 \\ −0.579 \\ −0.479 \end{matrix} \right) $ $~~~$ (D): $ \left( \begin{matrix} -0.669 \\ 0.706 \\ −0.764 \\ −0.462 \end{matrix} \right) $

Question 8 (Momentum)

Let us consider the Problem 6, the same objective function. Suppose we run Momentum to minimize loss function. Let $w^{(0)} = (0, 0, 0, 0)^\mathsf{T}$ and $\eta_t = \eta = 0.5$. Let the parameter $\alpha$ in the Momentum be 0.5. What is $w^{(25)}$? (We only preserve three digits after the decimal point)
(A): $ \left( \begin{matrix} −0.798 \\ 1.939 \\ −1.697 \\ −0.408 \end{matrix} \right) $ $~~~$ (B): $ \left( \begin{matrix} −0.798 \\ 1.849 \\ −1.667 \\ −0.422 \end{matrix} \right) $ $~~~$ \(C\): $ \left( \begin{matrix} -0.698 \\ 1.839 \\ −1.657 \\ −0.402 \end{matrix} \right) $ $~~~$ (D): $ \left( \begin{matrix} −0.598 \\ 1.829 \\ −1.557 \\ −0.502 \end{matrix} \right) $

Question 9 (Adaptive Gradient Descent)

Let us consider the Problem 6, the same objective function. Suppose we run AdaGrad to minimize loss function. Let $w^{(0)} = (0, 0, 0, 0)^\mathsf{T}$ and $\eta_t = \eta = 0.1$. Let the parameter $\delta$ in the AdaGrad be $10^{−6}$. Let $i_t = (t \mod 8) +1$, i.e., $i_0 = 1, i_7 = 8, i_8 = 1$. What is $w^{(80)}$? (We only preserve three digits after the decimal point)
(A): $ \left( \begin{matrix} −0.478 \\ 0.849 \\ −0.722 \\ −0.380 \end{matrix} \right) $ $~~~$ (B): $ \left( \begin{matrix} −0.498 \\ 0.858 \\ −0.612 \\ −0.360 \end{matrix} \right) $ $~~~$ \(C\): $ \left( \begin{matrix} −0.398 \\ 0.868 \\ −0.623 \\ −0.354 \end{matrix} \right) $ $~~~$ (D): $ \left( \begin{matrix} −0.698 \\ 0.862 \\ −0.632 \\ −0.352 \end{matrix} \right) $

Question 10 (Perceptron)

Let us consider the dataset in Problem 6. Suppose we apply the Perceptron algorithm to find a linear model. Suppose we initialize $w^{(0)} = (0, 0, 0, 0)^\mathsf{T}$. Suppose we go through the dataset once in this order:$(x(1),y(1)), (x(2),y(2)), . . . , (x(8), y(8))$. What is $w^{(8)}$?
(A): $ \left( \begin{matrix} −0.45 \\ 0.70 \\ −0.82 \\ −0.70 \end{matrix} \right) $ $~~~$ (B): $ \left( \begin{matrix} −0.40 \\ 0.75 \\ −0.84 \\ −0.70 \end{matrix} \right) $ $~~~$ \(C\): $ \left( \begin{matrix} −0.35 \\ 0.75 \\ −0.85 \\ −0.75 \end{matrix} \right) $ $~~~$ (D): $ \left( \begin{matrix} −0.48 \\ 0.70 \\ −0.75 \\ −0.85 \end{matrix} \right) $

Standard Answer

  1. B
  2. C
  3. D
  4. B
  5. C
  6. B
  7. A
  8. D
  9. A
  10. C

Answer 4

We have $5+10+3 = 18$ bias, $6*5+5*10+10*3 = 110$ weight, so we have 128 trainable parameters.

Answer 5

We set $$ f(X) = ||X-X_1||^2_2+2||X-X_2||^2_2 \\ \downarrow \\ f(x_1,x_2) = (x_1-a)^2+(x_2-b)^2+2(x_1-c)^2+2(x_2-d)^2 \\ \downarrow \\ \begin{cases} \frac{\partial f(x_1,x_2)}{\partial x_1} = 2(x_1-a) + 4(x_1-c) = 0 \\ \frac{\partial f(x_1,x_2)}{\partial x_2} = 2(x_2-b) + 4(x_2-d) = 0 \end{cases} \\ \downarrow \\ \begin{cases} x_1 = \frac{a}{3}+\frac{2c}{3} \\ x_2 = \frac{b}{3}+\frac{2d}{3} \end{cases} $$

Answer 6

X = [-0.5 -1 0.5 -0.2 -0.8 -0.15 -1 0;
    0.25 -0.1 0 -0.3 0 -0.5 0 -0.25;
    -0.8 -0.1 0.25 0.2 -0.8 0.05 -1 0.25;
    -1 -1 0.1 0 -1 -0.25 -1 0.1];

Species  = [1 1 -1 -1 1 -1 1 -1];

% 算法实现
SizeSample = length(Species); % size of sample
nsample = 8; %抽取的样本数
C = 0.5; %步长
t = 1; %update
eta = C;
w_last = zeros(4,1);
v_last = zeros(4,1);
nupdate = 25; % number of nupdate


X_t = X;
Y_t = Species;
IndexSample = (Y_t.*(w_last'*X_t) < 1);
X_t = X_t.*IndexSample;
Y_t = Y_t.*IndexSample;
w_next = w_last + ((eta/nsample).*(1-Y_t.*(w_last'*X_t))*(Y_t'.*X_t'))';

while (t < nupdate)
    w_last = w_next;
    t = t+1;
    eta = C;
    X_t = X;
    Y_t = Species;
    IndexSample = (Y_t.*(w_last'*X_t) < 1);
    X_t = X_t.*IndexSample;
    Y_t = Y_t.*IndexSample;
    w_next = w_last + ((eta/nsample).*(1-Y_t.*(w_last'*X_t))*(Y_t'.*X_t'))';
end

yy = w_next'*X;

Answer 7

X = [-0.5 -1 0.5 -0.2 -0.8 -0.15 -1 0;
    0.25 -0.1 0 -0.3 0 -0.5 0 -0.25;
    -0.8 -0.1 0.25 0.2 -0.8 0.05 -1 0.25;
    -1 -1 0.1 0 -1 -0.25 -1 0.1];

Species  = [1 1 -1 -1 1 -1 1 -1];

% 算法实现
SizeSample = length(Species); % size of sample
nsample = 1; %抽取的样本数
C = 0.1; %步长
t = 0; %update
eta = C;
w_last = zeros(4,1);
r_last = zeros(4,1);
nupdate = 80; % number of nupdate


X_t = X(:,mod(t,8)+1);
Y_t = Species(mod(t,8)+1);
IndexSample = (Y_t.*(w_last'*X_t) < 1);
X_t = X_t.*IndexSample;
Y_t = Y_t.*IndexSample;
w_next = w_last + ((eta/nsample).*(1-Y_t.*(w_last'*X_t))*(Y_t'.*X_t'))';

while (t < nupdate)
    w_last = w_next;
    t = t+1;
    eta = C;
    X_t = X(:,mod(t,8)+1);
    Y_t = Species(:,mod(t,8)+1);
    IndexSample = (Y_t.*(w_last'*X_t) < 1);
    X_t = X_t.*IndexSample;
    Y_t = Y_t.*IndexSample;
    w_next = w_last + ((eta/nsample).*(1-Y_t.*(w_last'*X_t))*(Y_t'.*X_t'))';
end

yy = w_next'*X;

Answer 8

X = [-0.5 -1 0.5 -0.2 -0.8 -0.15 -1 0;
    0.25 -0.1 0 -0.3 0 -0.5 0 -0.25;
    -0.8 -0.1 0.25 0.2 -0.8 0.05 -1 0.25;
    -1 -1 0.1 0 -1 -0.25 -1 0.1];

Species  = [1 1 -1 -1 1 -1 1 -1];

% 算法实现
SizeSample = length(Species); % size of sample
nsample = 8; %抽取的样本数
C = 0.5; %步长
t = 1; %update
eta = C;
w_last = zeros(4,1);
v_last = zeros(4,1);
nupdate = 25; % number of nupdate


X_t = X;
Y_t = Species;
IndexSample = (Y_t.*(w_last'*X_t) < 1);
X_t = X_t.*IndexSample;
Y_t = Y_t.*IndexSample;
v_next = 0.5.*v_last + ((eta/nsample).*(1-Y_t.*(w_last'*X_t))*(Y_t'.*X_t'))';
w_next = w_last + v_next;

while (t < nupdate)
    w_last = w_next;
    v_last = v_next;
    t = t+1;
    eta = C;
    X_t = X;
    Y_t = Species;
    IndexSample = (Y_t.*(w_last'*X_t) < 1);
    X_t = X_t.*IndexSample;
    Y_t = Y_t.*IndexSample;
    v_next = 0.5.*v_last + ((eta/nsample).*(1-Y_t.*(w_last'*X_t))*(Y_t'.*X_t'))';
w_next = w_last + v_next;
end

yy = w_next'*X;

Answer 9

X = [-0.5 -1 0.5 -0.2 -0.8 -0.15 -1 0;
    0.25 -0.1 0 -0.3 0 -0.5 0 -0.25;
    -0.8 -0.1 0.25 0.2 -0.8 0.05 -1 0.25;
    -1 -1 0.1 0 -1 -0.25 -1 0.1];

Species  = [1 1 -1 -1 1 -1 1 -1];

% 算法实现
SizeSample = length(Species); % size of sample
nsample = 1; %抽取的样本数
C = 0.1; %步长
t = 0; %update
eta = C;
w_last = zeros(4,1);
r_last = zeros(4,1);
nupdate = 80; % number of nupdate


X_t = X(:,mod(t,8)+1);
Y_t = Species(mod(t,8)+1);
IndexSample = (Y_t.*(w_last'*X_t) < 1);
X_t = X_t.*IndexSample;
Y_t = Y_t.*IndexSample;
g = ((1-Y_t.*(w_last'*X_t))*(Y_t'.*X_t'))';
r_next = r_last + g.^2;
w_next = w_last + (eta./(10^(-6)+r_next.^(0.5))).*g;

while (t < nupdate)
    w_last = w_next;
    r_last = r_next;
    t = t+1;
    eta = C;
    X_t = X(:,mod(t,8)+1);
    Y_t = Species(:,mod(t,8)+1);
    IndexSample = (Y_t.*(w_last'*X_t) < 1);
    X_t = X_t.*IndexSample;
    Y_t = Y_t.*IndexSample;
    g = ((1-Y_t.*(w_last'*X_t))*(Y_t'.*X_t'))';
    r_next = r_last + g.^2;
    w_next = w_last + (eta./(10^(-6)+r_next.^(0.5))).*g;
end

yy = w_next'*X;

Answer 10

X = [-0.5 -1 0.5 -0.2 -0.8 -0.15 -1 0;
    0.25 -0.1 0 -0.3 0 -0.5 0 -0.25;
    -0.8 -0.1 0.25 0.2 -0.8 0.05 -1 0.25;
    -1 -1 0.1 0 -1 -0.25 -1 0.1];

Species  = [1 1 -1 -1 1 -1 1 -1];

% 算法实现
SizeSample = length(Species); % size of sample
nsample = 8; %抽取的样本数
t = 0; %update
w_last = zeros(4,1);
nupdate = 8; % number of nupdate

X_t = X;
Y_t = Species;
%{
a = Y_t.*(w_last'*X_t) <= 0;
w_next = w_last + X_t(:,1)*Y_t(1)';
b = Y_t.*(w_next'*X_t) <= 0;
w_last = w_next;
w_next = w_last + X_t(:,6)*Y_t(6)';
c = Y_t.*(w_next'*X_t) <= 0;
%}

while (sum(Y_t.*(w_last'*X_t) <= 0) && t<nupdate)
    for i = 1:8
        if Y_t(i)*(w_last'*X_t(:,i)) <= 0
            t = t+1;
            w_next = w_last + (X_t(:,i)*Y_t(i)');
            break;
        end
    end
    w_last = w_next;
end

Exercise 2

Question 1

We build a simple classification network with the goal of classifying 32 by 32 greyscale images into 10 classes. The network consists of 3 layers. The first two layers are convolutional layers, the output of the second convolutional layer is then stretched into a 1D vector, and a fully connected layer is applied to obtain the output.For the convolution layers:

  • Ki corresponds to the number of filters for the ith layer.
  • Fi corresponds to filter size (assuming square filters, and the correct number of channels) for the ith layer.
  • Si corresponds to the stride for the ith layer.
  • Pi corresponds to the zero padding for the ith layer.

We describe the fully connected layer using the format $n_{in} \rightarrow n_{out}$, where $n_{in}$ is the number of neurons input into the layer, and $n_{out}$ is the number of neurons output from the layer.

In the network architectures described, check all feasible answers shown below.

  1. Layer 1: K1 = 1, F1 = 7, S1 = 1, P1 = 0 Layer 2: K2 = 2, F2 = 11, S2 = 1, P2 = 0 Layer 3: 512 -> 10

  2. Layer 1: K1 = 1, F1 = 3, S1 = 1, P1 = 0 Layer 2: K2 = 5, F2 = 3, S2 = 1, P2 = 1 Layer 3: 900 -> 10

  3. Layer 1: K1 = 1, F1 = 3, S1 = 1, P1 = 0 Layer 2: K2 = 2, F2 = 11, S2 = 1, P2 = 0 Layer 3: 1152 -> 11

  4. Layer 1: K1 = 1, F1 = 3, S1 = 1, P1 = 0 Layer 2: K2 = 1, F2 = 3, S2 = 1, P2 = 1 Layer 3: 900 -> 10

Question 2

Given the following statements relating to CNNs. Check all true statements:

  1. CNNs are more efficient than MLPs for inputs with a large number of features.

  2. We can use Stochastic Gradient Descent to optimize both convolutional neural networks and fully connected networks

  3. CNNs use a different version of backpropagation compared to fully connected networks called ‘backpropagation through space’.

  4. The gradient for each weight in the convolutional kernel only ever depends on one pixel in the image/feature map it is applied to.

  5. CNNs can be made more memory efficient by downsampling the feature maps.

Question 3

We apply N convolutional kernels of dimension HxWxC (H, W and C are the height, width and number of channels respectively) with stride = 1 and padding = 0 to a 11x11 colour image. This results in a 7 by 7 feature map with 10 channels. What are the correct values for N H W and C?

  1. N = 3, H = 3, W = 3 and C = 10

  2. N = 10, H= 5, W = 5 and C = 3

  3. N = 10, H = 5, W=5 and C = 1

  4. N = 10, H = 3, W = 3 and C=1

  5. N = 3, H = 5, W = 5 and C = 10

Question 4

Within a CNN we apply a 3x3 kernel with stride = 1 and padding = 1 to a 50x50 greyscale image to obtain an output image of size HxW. In order to produce a H x W output from the same input image, how many more parameters (including biases) would a fully connected layer require than the aforementioned convolutional layer?

  1. 6,250,490

  2. 2490

  3. 6,252,490

  4. 6,250,000

  5. 2500

Question 5

Given the following statements about RNNs. Check all True statements:

  1. The forward pass of an RNN can be applied to individual elements of the same sequence at exactly the same time.

  2. RNNS are primarily used to model data with a grid like structure.

  3. RNNs are only suitable for sequence-to-sequence modelling.

  4. As well as a hidden state LSTMs also have an additional state called the cell state.

  5. RNNs can be used to learn latent representations of sequential data.

Question 6

Given the following statements about Autoencoders. Check all True statements:

  1. Autoencoders implicitly perform clustering.

  2. Autoencoders can generate unseen data effectively.

  3. The encoder of an autoencoder can be used to train a classification network with limited training data.

  4. If an autoencoder was applied to the MNIST dataset, each class would have single latent representation.

  5. There may be some sparse locations in the latent space produced by an autoencoder.

Quesion 7

Given the following statements about the bottleneck layer found in Autoencoders. Check all True statements:

  1. The bottleneck forces the autoencoder to learn fine-grained details specific to individual images.

  2. The bottleneck layer should be the same size as the output of the network.

  3. The number of neurons in the bottleneck layer and the length of the latent representation are the same in an autoencoder.

  4. The bottleneck layer should be smaller than both the input and output layers.

  5. We need a bottleneck layer to make sure the network doesn’t learn the identity function.

Question 8

Given the following statements about latent variables/features. Check all True statements:

  1. If we re-train an autoencoder twice with different random seeds it will learn the same features in latent space.

  2. Increasing the size of the latent vector reduces the reconstruction loss.

  3. Latent variables can be directly observed in the raw data.

  4. Training an autoencoder to represent latent variables is an example of unsupervised learning.

Question 9

Given the following statements about the reparameterization trick. Check all True statements:

  1. The reparameterization trick for variational autoencoders utilises that we can sample a normally distributed variable with mean $\mu$ and variance $\sigma^2$ using the following: $z = \mu + \sigma^2 \epsilon$, where $\epsilon \sim N(0,1)$

  2. The reparameterization trick is required to allow gradients to flow through the sampling procedure.

  3. The reparameterization trick is only used at inference time, and not training time.

  4. The reparameterization trick for VAEs is required make sampling more efficient.

Question 10

Given the following statements relating to variational autoencoders. Check all True statements:

  1. KL divergence is measured between the reconstructed image and the ground truth image.

  2. Adding KL divergence to the loss function improves the reconstruction accuracy of VAEs.

  3. The regularisation loss a function of the encoder parameters only.

  4. KL divergence ensures that the predicted distribution in the latent space is a Uniform distribution.

  5. Given two datapoints $x_1$ and $x_2$ and their latent means $\mu_1$ and $\mu_2$ respectively, to interpolate between $x_1$ and $x_2$ we decode the latent variable $\mu^* = (1 - \alpha) \mu_1 + \alpha \mu_2$ for $\alpha \in [0,1]$

Standard Answer

  1. 1,4

  2. 1,2,5

  3. 2

  4. 3
    Solution:

    According to the question, the size of output image is $50 \times 50$.

    For convolutional layer, the number of parameters (including biases) is
    Per filter: $$ F \times F \times D $$

    Total number of parameters: weights + basis $$ F \times F \times D \times K + K $$

    where $F$ is the size of kernel, $D$ is the number of channels of input, $K$ is the number of filters. Thus, in this question, the number of parameters (including biases) is $3*3+1 = 10$.

    For fully connected layer,both size of input and output are $50 \times 50$, sothe number of parameters (including biases) is $(50 \times 50) \times 50 \times 50 + 50 \times 50 = 6252500$.

    To sum up, the answer is 6252490.

  5. 4,5

  6. 1,2,4

  7. 3,4,5

  8. 2,4

  9. 2

  10. 3,5