# Stochastic gradient descent

## Outline

### Topics

- Gradient descent
- Stochastic gradient descent (SGD)

### Rationale

Black-box VI uses SGD (or similar algorithms) to perform optimization over the variational parameters.

The “stochasticity” comes from the fact we approximate the integral in \(L\) using Monte Carlo.^{1}

## Gradient descent

**Recall:** the gradient \(\nabla L(\phi) = (\partial L/\partial \phi_1, \dots, \partial L/\partial \phi_K)\) is a vector pointing towards the direction of steepest ascent.

**Gradient descent:** at each iteration, move the current point in the direction opposite to the gradient, \[\phi^{(t+1)} \gets \phi^{(t)} - \epsilon_t \nabla L(\phi^{(t)}),\] where \(\epsilon_t > 0\) is called a **step size**.

**Property:** for the right choice of step size \(\epsilon_t\), the gradient descent algorithm converges to the minimum of the convex function (Boyd and Vandenberghe, 2004).

## Stochastic gradient descent

**Idea:** replace \(\nabla L\) by a randomized approximation, \(\hat \nabla L\).

**Definition:** we say \(\hat \nabla L\) is **unbiased** if \(\mathbb{E}[\hat \nabla L] = \nabla L\).

**Note:** the expectation in “\(\mathbb{E}[\hat \nabla L]\)” is with respect to the randomness used to create the random approximation. In our situation it will correspond to the randomness of sampling from our current variational family, \(q_{\phi^{(t)}}\).

**Property:** if \(\hat \nabla L\) is unbiased, the objective function is convex, and the variance of \(\mathbb{E}[\hat \nabla L]\) is bounded, stochastic gradient descent converges to the optimal solution \(t \to \infty\) with \(\epsilon_t = t^{-\alpha}\) for \(\alpha \in (1/2, 1]\) (Benveriste et al., 1990).

## Footnotes

On top of that, we may “sub-sample” datapoints at each iteration, adding additional stochasticity.↩︎