Stochastic gradient descent
Outline
Topics
- Gradient descent
- Stochastic gradient descent (SGD)
Rationale
Black-box VI uses SGD (or similar algorithms) to perform optimization over the variational parameters.
The “stochasticity” comes from the fact we approximate the integral in \(L\) using Monte Carlo.1
Gradient descent
Recall: the gradient \(\nabla L(\phi) = (\partial L/\partial \phi_1, \dots, \partial L/\partial \phi_K)\) is a vector pointing towards the direction of steepest ascent.
Gradient descent: at each iteration, move the current point in the direction opposite to the gradient, \[\phi^{(t+1)} \gets \phi^{(t)} - \epsilon_t \nabla L(\phi^{(t)}),\] where \(\epsilon_t > 0\) is called a step size.
Property: for the right choice of step size \(\epsilon_t\), the gradient descent algorithm converges to the minimum of the convex function (Boyd and Vandenberghe, 2004).
Stochastic gradient descent
Idea: replace \(\nabla L\) by a randomized approximation, \(\hat \nabla L\).
Definition: we say \(\hat \nabla L\) is unbiased if \(\mathbb{E}[\hat \nabla L] = \nabla L\).
Note: the expectation in “\(\mathbb{E}[\hat \nabla L]\)” is with respect to the randomness used to create the random approximation. In our situation it will correspond to the randomness of sampling from our current variational family, \(q_{\phi^{(t)}}\).
Property: if \(\hat \nabla L\) is unbiased, the objective function is convex, and the variance of \(\mathbb{E}[\hat \nabla L]\) is bounded, stochastic gradient descent converges to the optimal solution \(t \to \infty\) with \(\epsilon_t = t^{-\alpha}\) for \(\alpha \in (1/2, 1]\) (Benveriste et al., 1990).
Footnotes
On top of that, we may “sub-sample” datapoints at each iteration, adding additional stochasticity.↩︎