# The KL divergence

## Outline

### Topics

- Kullback-Leibler (KL) divergence
- Evidence Lower Bound (ELBO)

### Rationale

In the previous page we saw that we need a notion of “closeness” of distributions. The KL divergence is the most frequent choice in the context of variational inference. We define the KL and explain why it is used in variational inference.

## Definition

Given two distribution \(\pi\) and \(q\), define the **Kullback-Leibler (KL) divergence** as: \[\operatorname{KL}(q \| \pi) = \int q(x) \log \frac{q(x)}{\pi(x)} \mathrm{d}x.\]

**Note:**

- the KL is asymmetric in its two arguments.
- We will put the variational approximation \(q\) as the distribution we average over.
- This is sometimes called the “backward” or “reverse” KL.
- More on this choice below.

## Why variational inference uses the reverse KL?

We cover the two key reasons below: it captures “closeness” and it can be optimized.

### Reverse KL captures the notion of “cloness”

**Property:** \(\operatorname{KL}(q_1 \| q_2 ) \ge 0\) with equality iff \(q_1 = q_2\).

**Proof:** since \(\log\) is a concave function, by Jensen’s inequality,

\[\begin{align*} \operatorname{KL}(q_1 \| q_2 ) &= \int q_1(x) \log \frac{q_1(x)}{q_2(x)} \mathrm{d}x \;\;\text{(definition)} \\ &= - \int q_1(x) \log \frac{q_2(x)}{q_1(x)} \mathrm{d}x \;\;\text{($-\log a = \log a^{-1}$)} \\ &\ge - \log \int {\color{red} q_1(x)} \frac{q_2(x)}{{\color{red} q_1(x)}} \mathrm{d}x \;\;\text{(Jensen's)} \\ &= - \log \int q_2(x) \mathrm{d}x \;\;\text{(red factors cancel)} \\ &= 0 \;\;\text{($q_2$ is a probability density)}. \end{align*}\]

### Reverse KL can be optimized

**Requirement:** we want to be able to optimize the objective function without having to compute the intractable normalization constant \(Z\).

Many other notions of distribution “closeness” do not satisfy this.

## Towards optimization of the reverse KL

We show that optimizing the reverse KL does not require knowing the intractable normalization constant \(Z\):

\[\begin{align*} \operatorname{arg\,min}_\phi \operatorname{KL}(q_\phi \| \pi) &= \operatorname{arg\,min}_\phi \int q_\phi(x) \log \frac{q_\phi(x)}{\pi(x)} \mathrm{d}x \\ &= \operatorname{arg\,min}_\phi \int q_\phi(x) \log \frac{q_\phi(x) Z}{\gamma(x)} \mathrm{d}x \\ &= \operatorname{arg\,min}_\phi \int q_\phi(x) \left[ \log q_\phi(x) + \log Z - \log \gamma(x) \right] \mathrm{d}x \\ &= \operatorname{arg\,min}_\phi \underbrace{\int q_\phi(x) \left[ \log q_\phi(x) - \log \gamma(x) \right] \mathrm{d}x}_{L(\phi)} + {\color{red} \log Z} \\ &= \operatorname{arg\,min}_\phi L(\phi) \;\;\text{(red term does not depend on $\phi$)}. \end{align*}\]

**Notice:** \(L(\phi)\) does not involve \(Z\)!

**Terminology:** the negative value of \(L\) is called the Evidence Lower BOund (ELBO), \(\operatorname{ELBO}(\phi) = -L(\phi)\).

**Question:** how did we get \(\log Z\) outside of the integral (step in red)?

- By definition
- \(-\log a = \log a^{-1}\)
- Jensen’s inequality
- Because \(\pi\) is a probability distribution.
- Because \(q\) is a probability distribution.

5: Because \(q\) is a probability distribution.

\[\int q_\phi(x) \log Z \mathrm{d}x = \log Z \int q_\phi(x) \mathrm{d}x = \log Z.\]