Optimizers ================== Good old SGD ------------------ Just have single step size for all weights LR scheduling ****************** Simple schedule is .. math:: LR = \frac{1}{c+t} Where c is a hyperparam Momentum ------------------ Adjust weight by part of previous step's gradient plus this step's gradient .. math:: m_t = \alpha*m_{t-1} + \Delta w_t w_t = w_{t-1} - \mu * m_t \alpha \text{ is a momentum-weighting hyperparam, usually 0.9} \mu \text{ is the learning rate} Adagrad ------------------ * Shrinks learning rate over time to fine tune * Has a different learning rate for each param; in theory, different parameters are different distances away from their optimums * A big change in gradient leads to smallers steps over time .. math:: \Delta w_t = \frac{\mu}{\sqrt{\sum_{\text{over all steps t}}{} g_t^2}} * g_t \mu \text{ is the learning rate} Adam ----------- * Combine momentum with individual param updates Momentum: .. math:: m_t = B_1 * m_{t-1} + (1-B_1) * g_t Exponentially weighted squares of grads: .. math:: V_t = B_2*V_{t-1} + (1-B_2)*g^2_t B_1, B_2 \text{ set close to 1} Since B_1 and B_2 start close to 1, m_t and v_t are very close to 0. Correct by: .. math:: m_t = \frac{m_t}{1-B_1^t} v_t = \frac{v_t}{1-B_2^t} Final Update: .. math:: \Delta w_t = \frac{\mu}{\sqrt{v_t} + \epsilon} * m_t Common hyperparam values: .. math:: B_1 = 0.9 B_2 = 0.999 \epsilon = 10^{-8} Excercises ---------------------- * Derive all optimizers * Why weight squares of grads exponentially in adam? - Sum could grow too large, permanently suppress LR