The KL divergence
Outline
Topics
- Kullback-Leibler (KL) divergence
- Evidence Lower Bound (ELBO)
Rationale
In the previous page we saw that we need a notion of “closeness” of distributions. The KL divergence is the most frequent choice in the context of variational inference. We define the KL and explain why it is used in variational inference.
Definition
Given two distribution
Note:
- the KL is asymmetric in its two arguments.
- We will put the variational approximation
as the distribution we average over. - This is sometimes called the “backward” or “reverse” KL.
- More on this choice below.
Why variational inference uses the reverse KL?
We cover the two key reasons below: it captures “closeness” and it can be optimized.
Reverse KL captures the notion of “closeness”
Property:
Proof: since
Reverse KL can be optimized
Requirement: we want to be able to optimize the objective function without having to compute the intractable normalization constant
Many other notions of distribution “closeness” do not satisfy this.
Towards optimization of the reverse KL
We show that optimizing the reverse KL does not require knowing the intractable normalization constant
Notice:
Terminology: the negative value of
Question: how did we get