Optimizers
Contents
Optimizers#
Good old SGD#
Just have single step size for all weights
LR scheduling#
Simple schedule is
\[LR = \frac{1}{c+t}\]
Where c is a hyperparam
Momentum#
Adjust weight by part of previous step’s gradient plus this step’s gradient
\[ \begin{align}\begin{aligned}m_t = \alpha*m_{t-1} + \Delta w_t\\w_t = w_{t-1} - \mu * m_t\\
\alpha \text{ is a momentum-weighting hyperparam, usually 0.9}\\\mu \text{ is the learning rate}\end{aligned}\end{align} \]
Adagrad#
Shrinks learning rate over time to fine tune
Has a different learning rate for each param; in theory, different parameters are different distances away from their optimums
A big change in gradient leads to smallers steps over time
\[ \begin{align}\begin{aligned}\Delta w_t = \frac{\mu}{\sqrt{\sum_{\text{over all steps t}}{} g_t^2}} * g_t\\\mu \text{ is the learning rate}\end{aligned}\end{align} \]
Adam#
Combine momentum with individual param updates
Momentum:
\[m_t = B_1 * m_{t-1} + (1-B_1) * g_t\]
Exponentially weighted squares of grads:
\[ \begin{align}\begin{aligned}V_t = B_2*V_{t-1} + (1-B_2)*g^2_t\\B_1, B_2 \text{ set close to 1}\end{aligned}\end{align} \]
Since B_1 and B_2 start close to 1, m_t and v_t are very close to 0. Correct by:
\[ \begin{align}\begin{aligned}m_t = \frac{m_t}{1-B_1^t}\\v_t = \frac{v_t}{1-B_2^t}\end{aligned}\end{align} \]
Final Update:
\[\Delta w_t = \frac{\mu}{\sqrt{v_t} + \epsilon} * m_t\]
Common hyperparam values:
\[ \begin{align}\begin{aligned}B_1 = 0.9\\B_2 = 0.999\\\epsilon = 10^{-8}\end{aligned}\end{align} \]
Excercises#
Derive all optimizers
Why weight squares of grads exponentially in adam? - Sum could grow too large, permanently suppress LR