Neural Network Basics


zjl=โˆ‘kฯ‰jklaklโˆ’1+bjlz^l_j = \sum_k{\omega^l_{jk}a^{l-1}_k} + b^l_j
ajl=ฯƒ(zjl)a^l_j = \sigma\left(z^l_j\right)

Cost(Loss) function#

Quadratic cost function#

Cโ‰ก12โˆฅyโˆ’aLโˆฅ2=12โˆ‘i(yiโˆ’aiL)2C \equiv \frac{1}{2} \lVert \mathbf{y} - \mathbf{a}^L \rVert^2 = \frac{1}{2} \sum_i{\left(y_i - a^L_i\right)^2}
Cโ‰ฅ0(yย isย theย desiredย output)C \geq 0 \quad \left(\mathbf{y}\text{ is the desired output}\right)

Neural Network Training#

What we need to look for through neural network training are weights and biases to minimize the consequences of the cost function. When w\mathbf{w} is a vector representing weights and biases,

Cnext=C+ฮ”Cโ‰ˆC+โˆ‡Cโ‹…ฮ”wC_{next} = C + \Delta C \approx C + \nabla C \cdot \Delta \mathbf{w}

It must be โˆ‡Cโ‹…ฮ”w<0\nabla C \cdot \Delta \mathbf{w} < 0, because CC should decrease. Therfore, ฮ”w\Delta \mathbf{w} can be determined as

ฮ”w=โˆ’ฮทโˆ‡C=โˆ’ฯตโˆ‡Cโˆฅโˆ‡Cโˆฅ(ฯต>0)\Delta \mathbf{w} = - \eta \nabla C = - \epsilon \frac{\nabla C}{\lVert \nabla C \rVert} \quad ( \epsilon > 0)

ฮท\eta is called learning rate and ฯต\epsilon is called step. If the step is large, CC may diverge, and if the step is small, the convergence speed may be slow, so an appropriate value should be determined.

If ฮ”w\Delta \mathbf{w} is determined, then wnext\mathbf{w}_{next} can be

wnext=w+ฮ”w\mathbf{w}_{next} = \mathbf{w} + \Delta \mathbf{w}

Stochastic gradient descent#

โˆ‡C=1nโˆ‘xโˆ‡Cx\nabla C = \frac{1}{n}\sum_x{\nabla C_x}

When the number of training inputs is very large, this can take a long time. Stochastic gradient descent works by randomly picking out a small number mm of randomly chosen training inputs.

โˆ‡C=1nโˆ‘xโˆ‡Cxโ‰ˆ1mโˆ‘i=1mโˆ‡CXi\nabla C = \frac{1}{n}\sum_x{\nabla C_x} \approx \frac{1}{m}\sum^m_{i=1}{\nabla C_{X_i}}

Those random training inputs X1,X2,...,XmX_1, X_2, ..., X_m are called mini-batch.


Back-propagation is used to find โˆ‡C\nabla C, because it is difficult for a computer to obtain โˆ‡C\nabla C by differentiating Cost function.

Error ฮดjl\delta^l_j of neuron jj in layer ll is defined as

ฮดjlโ‰กโˆ‚Cโˆ‚zjl\delta^l_j \equiv \frac{\partial C}{\partial z^l_j}

Since zjlz^l_j was obtained from inference, If we know ฮดl+1\mathbf{\delta}^{l+1}, we can get ฮดjl\delta^l_j as below.

ฮดjl=โˆ‚Cโˆ‚zjl=โˆ‘iโˆ‚Cโˆ‚zil+1โˆ‚zil+1โˆ‚zjl(โˆ‚zil+1โˆ‚zjl=ฯ‰ijl+1โ€‰ฯƒโ€ฒ(zjl))=โˆ‘iโˆ‚Cโˆ‚zil+1ฯ‰ijl+1โ€‰ฯƒโ€ฒ(zjl)=โˆ‘iฮดil+1ฯ‰ijl+1โ€‰ฯƒโ€ฒ(zjl)\begin{aligned} \delta^l_j = \frac{\partial C}{\partial z^l_j} & = \sum_i{\frac{\partial C}{\partial z^{l+1}_i} \frac{\partial z^{l+1}_i}{\partial z^l_j}} \quad \left( \frac{\partial z^{l+1}_i}{\partial z^l_j} = \omega^{l+1}_{ij} \, \sigma' \left(z^l_j\right) \right)\\ & = \sum_i{\frac{\partial C}{\partial z^{l+1}_i} \omega^{l+1}_{ij} \, \sigma' \left(z^l_j\right)} \\ & = \sum_i{\delta^{l+1}_i \omega^{l+1}_{ij} \, \sigma' \left(z^l_j\right)} \end{aligned}

Since ajLa^L_j was obtained from inference and ฮดjL=(ajLโˆ’yj)โ€‰ฯƒโ€ฒ(zjL)\delta^L_j = (a^L_j - y_j) \, \sigma' \left( z^L_j \right), we can get the errors like this:

ฮดjL=(ajLโˆ’yj)โ€‰ฯƒโ€ฒ(zjL)\delta^L_j = (a^L_j - y_j) \, \sigma' \left( z^L_j \right)
ฮดjLโˆ’1=โˆ‘iฮดiLฯ‰ijLโ€‰ฯƒโ€ฒ(zjLโˆ’1)โ‹ฎ\delta^{L-1}_j = \sum_i{ \delta^L_i \omega^L_{ij} \, \sigma' \left(z^{L-1}_j\right)} \\ \vdots

Finally, โˆ‡C\nabla C can be obtained by using the errors obtained above.

โˆ‚Cโˆ‚bjl=โˆ‚Cโˆ‚zjlโˆ‚zjlโˆ‚bjl=ฮดjl\frac{\partial C}{\partial b^l_j} = \frac{\partial C}{\partial z^l_j} \frac{\partial z^l_j}{\partial b^l_j} = \delta^l_j
โˆ‚Cโˆ‚ฯ‰jkl=โˆ‚Cโˆ‚zjlโˆ‚zjlโˆ‚ฯ‰jkl=ฮดjlaklโˆ’1\frac{\partial C}{\partial \omega^l_{jk}} = \frac{\partial C}{\partial z^l_j} \frac{\partial z^l_j}{\partial \omega^l_{jk}} = \delta^l_j a^{l-1}_k


Set initail weights and biases to random and repeat process Inference -> Back-propagation -> weights and biases update. When it is judged that CC cannot be made smaller, the final weights and biases are determined.


Last updated on