Notes  Machine Learning MT23, Gradient descent

Also, content from:
Flashcards
In general, given data $\mathcal D = \langle (\pmb x _ i, y _ i) \rangle^N _ {i = 1}$ and some parameters $\pmb w$ to optimise what does the objective function in a machine learning look like?
What is one of the most important hyperparameters related to gradient descent?
The learning rate.
The derivative of a typical objective function in a machine learning setting often looks like
\[\nabla_{\pmb w} \mathcal L(\pmb w, \mathcal D) = \frac{1}{N} \sum^N_{i=1} \nabla_{\pmb w} \ell(\pmb w; \pmb x_i, y_i) + \lambda \nabla_{\pmb w} \mathcal R(\pmb w)\]
What is the idea behind stochastic gradient descent, and why does it work?
Pick a random data point uniformly at random and compute $\pmb g _ i = \nabla _ {\pmb w} \ell(\pmb w; \pmb x _ i, y _ i)$ as the gradient. This works because
\[\mathbb E[\pmb g_i] = \frac{1}{N} \sum^N_{i=1} \nabla_{\pmb w} \ell(\pmb w; \pmb x_i, y_i)\]which is the same as the gradient of loss, minus the regularisation term. So add the gradient of the regularisation term (which doesn’t depend on the data point uses, and might need to be scaled appropriately) and then perform a gradient update.
What is minibatching in relation to stochastic gradient descent, and why is this better than standard stochastic gradient descent?
Picking a set of points and calculating the average gradient. This reduces the variance.
What is a practical, nonmathematical advantage of stochastic gradient descent over normal gradient descent?
You don’t need to load all the data into memory at once.