Stochastic gradient descent

Outline

Topics

  • Gradient descent
  • Stochastic gradient descent (SGD)

Rationale

Black-box VI uses SGD (or similar algorithms) to perform optimization over the variational parameters.

The “stochasticity” comes from the fact we approximate the integral in \(L\) using Monte Carlo.1

Gradient descent

Recall: the gradient \(\nabla L(\phi) = (\partial L/\partial \phi_1, \dots, \partial L/\partial \phi_K)\) is a vector pointing towards the direction of steepest ascent.

Gradient descent: at each iteration, move the current point in the direction opposite to the gradient, \[\phi^{(t+1)} \gets \phi^{(t)} - \epsilon_t \nabla L(\phi^{(t)}),\] where \(\epsilon_t > 0\) is called a step size.

Property: for the right choice of step size \(\epsilon_t\), the gradient descent algorithm converges to the minimum of the convex function (Boyd and Vandenberghe, 2004).

Stochastic gradient descent

Idea: replace \(\nabla L\) by a randomized approximation, \(\hat \nabla L\).

Definition: we say \(\hat \nabla L\) is unbiased if \(\mathbb{E}[\hat \nabla L] = \nabla L\).

Note: the expectation in “\(\mathbb{E}[\hat \nabla L]\)” is with respect to the randomness used to create the random approximation. In our situation it will correspond to the randomness of sampling from our current variational family, \(q_{\phi^{(t)}}\).

Property: if \(\hat \nabla L\) is unbiased, the objective function is convex, and the variance of \(\mathbb{E}[\hat \nabla L]\) is bounded, stochastic gradient descent converges to the optimal solution \(t \to \infty\) with \(\epsilon_t = t^{-\alpha}\) for \(\alpha \in (1/2, 1]\) (Benveriste et al., 1990).

Footnotes

  1. On top of that, we may “sub-sample” datapoints at each iteration, adding additional stochasticity.↩︎