Skip to main content

Neural Network Basics

Neuron(Perceptron)#

zjl=โˆ‘kฯ‰jklaklโˆ’1+bjlz^l_j = \sum_k{\omega^l_{jk}a^{l-1}_k} + b^l_j
ajl=ฯƒ(zjl)a^l_j = \sigma\left(z^l_j\right)

Loss function#

L2 loss function#

Lossโ‰ก12โˆฅyโˆ’aLโˆฅ2=12โˆ‘i(yiโˆ’aiL)2Loss \equiv \frac{1}{2} \lVert \mathbf{y} - \mathbf{a}^L \rVert^2 = \frac{1}{2} \sum_i{\left(y_i - a^L_i\right)^2}
Lossโ‰ฅ0(yย ๋Š”ย ์ฃผ์–ด์ง„ย ๋‹ต)Loss \geq 0 \quad \left(\mathbf{y}\text{ ๋Š” ์ฃผ์–ด์ง„ ๋‹ต}\right)

Neural Network Training#

Neural Network Training์„ ํ†ตํ•ด ์ฐพ์•„์•ผ ํ•  ๊ฒƒ์€ LossLoss์˜ ๊ฒฐ๊ณผ๋ฅผ ์ตœ์†Œํ™”ํ•˜๊ธฐ ์œ„ํ•œ weights์™€ biases์ž…๋‹ˆ๋‹ค. w\mathbf{w} ๊ฐ€ weights์™€ biases๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๋ฒกํ„ฐ์ผ ๋•Œ,

Lossnext=Loss+ฮ”Lossโ‰ˆLoss+โˆ‡Lossโ‹…ฮ”wLoss_{next} = Loss + \Delta Loss \approx Loss + \nabla Loss \cdot \Delta \mathbf{w}

LossLoss๋Š” ๊ฐ์†Œํ•ด์•ผํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ๋ฐ˜๋“œ์‹œ โˆ‡Lossโ‹…ฮ”w<0\nabla Loss \cdot \Delta \mathbf{w} < 0 ์กฐ๊ฑด์„ ๋งŒ์กฑํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ฮ”w\Delta \mathbf{w}๋Š” ์•„๋ž˜์™€ ๊ฐ™์ด ๊ฒฐ์ • ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ฮ”w=โˆ’ฮทโˆ‡Loss=โˆ’ฯตโˆ‡Lossโˆฅโˆ‡Lossโˆฅ(ฯต>0)\Delta \mathbf{w} = - \eta \nabla Loss = - \epsilon \frac{\nabla Loss}{\lVert \nabla Loss \rVert} \quad ( \epsilon > 0)

ฮท\eta๋Š” learning rate์ด๊ณ , ฯต\epsilon์€ step์ž…๋‹ˆ๋‹ค. ๋งŒ์•ฝ step์ด ๋„ˆ๋ฌด ํฌ๋‹ค๋ฉด, LossLoss๋Š” ๋ฐœ์‚ฐํ•˜๊ณ , ๋„ˆ๋ฌด ์ž‘์œผ๋ฉด, ์ˆ˜๋ ด์†๋„๊ฐ€ ๋Š๋ ค์ง‘๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ ์ ˆํ•œ ๊ฐ’์„ ์„ค์ •ํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

ฮ”w\Delta \mathbf{w} ์ด ๊ฒฐ์ •๋˜๋ฉด, wnext\mathbf{w}_{next}์€ ์•„๋ž˜์™€ ๊ฐ™์ด ๊ฒฐ์ •๋ฉ๋‹ˆ๋‹ค.

wnext=w+ฮ”w\mathbf{w}_{next} = \mathbf{w} + \Delta \mathbf{w}

Stochastic gradient descent#

โˆ‡Loss=1nโˆ‘xโˆ‡Lossx\nabla Loss = \frac{1}{n}\sum_x{\nabla Loss_x}

ํ›ˆ๋ จ์„ ์œ„ํ•œ ๋ฐ์ดํ„ฐ์…‹์ด ํฐ ๊ฒฝ์šฐ, ์‹œ๊ฐ„์ด ๋งŽ์ด ๊ฑธ๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ „์ฒด ๋ฐ์ดํ„ฐ์…‹์—์„œ mm ๋งŒํผ ๋ฐ์ดํ„ฐ๋ฅผ ์„ ํƒํ•˜์—ฌ mini-batch๋ฅผ ๋งŒ๋“ค์–ด ํ›ˆ๋ จํ•˜๋Š” ๋ฐฉ์‹์„ Stochastic gradient descent์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

โˆ‡Loss=1nโˆ‘xโˆ‡Lossxโ‰ˆ1mโˆ‘i=1mโˆ‡LossXi\nabla Loss = \frac{1}{n}\sum_x{\nabla Loss_x} \approx \frac{1}{m}\sum^m_{i=1}{\nabla Loss_{X_i}}

์ „์ฒด ๋ฐ์ดํ„ฐ์…‹์—์„œ ๋žœ๋ค์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์„ ํƒํ•ด ๋งŒ๋“  ๋ฐ์ดํ„ฐ์…‹ X1,X2,...,XmX_1, X_2, ..., X_m ์„ mini-batch๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

Forward-propagation#

Forward-propagation (or forward pass)๋Š” ์ž…๋ ฅ๋ถ€ํ„ฐ ์ถœ๋ ฅ์œผ๋กœ ์ด์–ด์ง€๋Š” ๋„คํŠธ์›Œํฌ๋ฅผ ์ˆœ์„œ๋Œ€๋กœ ๊ณ„์‚ฐํ•˜๊ณ  ๊ทธ ๊ฒฐ๊ณผ๋ฅผ ์ €์žฅํ•˜๋Š” ๊ณผ์ •์„ ๋งํ•ฉ๋‹ˆ๋‹ค.

Back-propagation#

zjl=โˆ‘kฯ‰jklaklโˆ’1+bjlz^l_j = \sum_k{\omega^l_{jk}a^{l-1}_k} + b^l_j
ajl=ฯƒ(zjl)a^l_j = \sigma\left(z^l_j\right)

LossLoss๋ฅผ ์ง์ ‘ ๋ฏธ๋ถ„ํ•˜๊ธฐ ์–ด๋ ต๊ธฐ ๋•Œ๋ฌธ์—, Back-propagation์„ ์‚ฌ์šฉํ•˜์—ฌ โˆ‡Loss\nabla Loss๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.

ll ๋ ˆ์ด์–ด์˜ jj ๋‰ด๋Ÿฐ์˜ ์—๋Ÿฌ ฮดjl\delta^l_j ๋Š” ์•„๋ž˜์™€ ๊ฐ™์ด ์ •์˜๋ฉ๋‹ˆ๋‹ค.

ฮดjlโ‰กโˆ‚Lossโˆ‚zjl\delta^l_j \equiv \frac{\partial Loss}{\partial z^l_j}

zjlz^l_j๋Š” forward propagation์„ ํ†ตํ•ด ๊ณ„์‚ฐ๋œ ๊ฐ’์ด๋ฏ€๋กœ, ฮดl+1\mathbf{\delta}^{l+1}๋ฅผ ์•ˆ๋‹ค๋ฉด, ฮดjl\delta^l_j๋ฅผ ์•„๋ž˜์™€ ๊ฐ™์ด ๊ตฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ฮดjl=โˆ‚Lossโˆ‚zjl=โˆ‘iโˆ‚Lossโˆ‚zil+1โˆ‚zil+1โˆ‚zjl(โˆ‚zil+1โˆ‚zjl=ฯ‰ijl+1โ€‰ฯƒโ€ฒ(zjl))=โˆ‘iโˆ‚Lossโˆ‚zil+1ฯ‰ijl+1โ€‰ฯƒโ€ฒ(zjl)=โˆ‘iฮดil+1ฯ‰ijl+1โ€‰ฯƒโ€ฒ(zjl)\begin{aligned} \delta^l_j = \frac{\partial Loss}{\partial z^l_j} & = \sum_i{\frac{\partial Loss}{\partial z^{l+1}_i} \frac{\partial z^{l+1}_i}{\partial z^l_j}} \quad \left( \frac{\partial z^{l+1}_i}{\partial z^l_j} = \omega^{l+1}_{ij} \, \sigma' \left(z^l_j\right) \right)\\ & = \sum_i{\frac{\partial Loss}{\partial z^{l+1}_i} \omega^{l+1}_{ij} \, \sigma' \left(z^l_j\right)} \\ & = \sum_i{\delta^{l+1}_i \omega^{l+1}_{ij} \, \sigma' \left(z^l_j\right)} \end{aligned}

L2 loss๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค๋ฉด, ajLa^L_j๋Š” forward propagation์„ ํ†ตํ•ด ๊ตฌํ•  ์ˆ˜ ์žˆ๊ณ  ฮดjL=(ajLโˆ’yj)โ€‰ฯƒโ€ฒ(zjL)\delta^L_j = (a^L_j - y_j) \, \sigma' \left( z^L_j \right) ์ด๋ฏ€๋กœ, ์•„๋ž˜์™€ ๊ฐ™์ด ์—๋Ÿฌ๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ฮดjL=(ajLโˆ’yj)โ€‰ฯƒโ€ฒ(zjL)\delta^L_j = (a^L_j - y_j) \, \sigma' \left( z^L_j \right)
ฮดjLโˆ’1=โˆ‘iฮดiLฯ‰ijLโ€‰ฯƒโ€ฒ(zjLโˆ’1)โ‹ฎ\delta^{L-1}_j = \sum_i{ \delta^L_i \omega^L_{ij} \, \sigma' \left(z^{L-1}_j\right)} \\ \vdots

๊ฒฐ๊ณผ์ ์œผ๋กœ โˆ‡Loss\nabla Loss๋Š” ์œ„ ์‹๋“ค์„ ํ†ตํ•ด ์•„๋ž˜์™€ ๊ฐ™์ด ๊ตฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

โˆ‚Lossโˆ‚bjl=โˆ‚Lossโˆ‚zjlโˆ‚zjlโˆ‚bjl=ฮดjl\frac{\partial Loss}{\partial b^l_j} = \frac{\partial Loss}{\partial z^l_j} \frac{\partial z^l_j}{\partial b^l_j} = \delta^l_j
โˆ‚Lossโˆ‚ฯ‰jkl=โˆ‚Lossโˆ‚zjlโˆ‚zjlโˆ‚ฯ‰jkl=ฮดjlaklโˆ’1\frac{\partial Loss}{\partial \omega^l_{jk}} = \frac{\partial Loss}{\partial z^l_j} \frac{\partial z^l_j}{\partial \omega^l_{jk}} = \delta^l_j a^{l-1}_k

Training#

์ดˆ๊ธฐ weights and biases๋ฅผ ๋žœ๋คํ•˜๊ฒŒ ์„ค์ •ํ•˜๊ณ , Forward-propagation -> Back-propagation -> weights and biases update๋ฅผ ๋ฐ˜๋ณตํ•˜๋Š” ๊ฒƒ์„ Training์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค. LossLoss๊ฐ€ ๋”์ด์ƒ ์ž‘์•„์งˆ ์ˆ˜ ์—†๋‹ค๊ณ  ํŒ๋‹จ๋  ๋•Œ, weights and biases ๊ฐ’์ด ์ตœ์ข… ๊ฒฐ๊ณผ๋ฌผ์ด ๋ฉ๋‹ˆ๋‹ค.

Initialization#

์ž…๋ ฅ์˜ ๋ถ„์‚ฐ๋ณด๋‹ค ์ถœ๋ ฅ์˜ ๋ถ„์‚ฐ์ด ์ปค์ง€๋ฉด ํ™œ์„ฑํ™” ํ•จ์ˆ˜์— ๋”ฐ๋ผ ์ˆ˜๋ ดํ•˜๋Š” ๋ถ€๋ถ„์ด ์ƒ๊ธธ ์ˆ˜ ์žˆ๊ณ , ์ด๊ฒƒ์€ Vanishing gradient ๋ฌธ์ œ๋ฅผ ์•ผ๊ธฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ณ€์ˆ˜๋ฅผ ์ ์ ˆํžˆ ์ดˆ๊ธฐํ™” ํ•ด์ฃผ๋Š” ๊ฒƒ์€ Vanishing gradient, Exploding gradient ๋“ฑ์˜ ๋ฌธ์ œ ํ•ด๊ฒฐ์— ๋„์›€์ด ๋˜๊ณ , ํ›ˆ๋ จ์†๋„๋ฅผ ๋†’์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Reference#

Last updated on