Optimality

Outline

Topics

  • Average loss optimality of Bayes estimators
  • Limitations of this notion of optimality.

Rationale

Formalize and justify one of the motivations for Bayesian methods covered in the first lecture.

Setup

We go back to the decision theoretic setup from week 2.

Estimator

Definition: an estimator is a function \(\delta\) such that:

  • \(\delta\) takes a dataset \(y\) as input (i.e. the observations), \(\delta(y)\), and
  • returns an action, i.e., \(\delta(y) \in A\).

Examples:

  • the Bayes estimator,
  • maximum likelihood estimator (when the action is a parameter you are trying to estimate).

Optimality: motivation

Consider the following situation—you are watching a talk, where the presenter explains they did the following:

  • Let:
    • \(\delta_{\text{B}}\) denote the Bayes estimator, and
    • \(\delta_{\text{o}}\), another estimator, say based on a sophisticated deep neural network.
  • To compare the two estimators, the presenter generated a large number of synthetic datasets \((X^{(m)}, Y^{(m)})\) by forward simulation from the model.
  • They evaluated the performance of the Bayes estimator as follows:
    • apply the estimator: \(a^{(m)}= \delta_{\text{B}}(Y^{(m)})\),
    • compute the loss: \(l^{(m)}= L(a^{(m)}, X^{(m)})\), and
    • return the average loss as the measure of performance: \[\text{average loss} = \frac{1}{M} \sum l^{(m)}.\]
  • Then, they did the same with the other estimator, \(\delta_{\text{o}}\).
  • Suppose that the presenter reported a lower average loss for the other estimator, \(\delta_{\text{o}}\)

Poll: Which of these statements is true?

  1. The deep neural network can perform better because it involves more parameters (over-parameterization)
  2. The deep neural network can perform better because of a combination of lucky initialization and noisy gradients
  3. The deep neural network can perform better because of some other reason
  4. This cannot happen unless the loss function is non-convex
  5. This cannot happen because of some other reason

Correct answer is: “This cannot happen because of some other reason.”

Intuition: Bayes estimator is designed to minimize the average loss. So you cannot beat Bayes at its game!

Optimality of Bayesian estimators: theorem

Theorem: let \(\delta_{\text{B}}\) denote the Bayes estimator, and \(\delta_{\text{o}}\), any other estimator. Then: \[\mathbb{E}[L(\delta_{\text{B}}(Y), X)] \le \mathbb{E}[L(\delta_{\text{o}}(Y), X)]. \tag{1}\]

Proof: by the law of total expectation, we can rewrite the left hand side of Equation 1 as: \[\mathbb{E}[L(\delta_{\text{B}}(Y), X)] = \mathbb{E}[ {\color{red} \mathbb{E}[L(\delta_{\text{B}}(Y), X) | Y]}]. \tag{2}\] We will start by rewriting the part in red.

To do so, denote the objective being minimized in the Bayes estimator by: \[\mathcal{O}(a) = \mathbb{E}[L(a, X) | Y].\]

Note: by definition, we can rewrite our Bayes estimator by: \[\delta_{\text{B}}= \operatorname{arg\,min}_a \mathcal{O}(a),\] and hence: \[{\color{red} \mathbb{E}[L(\delta_{\text{B}}(Y), X) | Y]} = \mathcal{O}(\delta_{\text{B}}) = \mathcal{O}(\operatorname{arg\,min}_a \mathcal{O}(a)). \tag{3}\] Plugging-in Equation 3 into Equation 2, we get:

\[\mathbb{E}[ \mathbb{E}[L(\delta_{\text{B}}(Y), X) | Y]] = \mathbb{E}[ \mathcal{O}( \operatorname{arg\,min}_a \mathcal{O}(a))].\]

Now note that for all \(a' \in A\), \(\mathcal{O}( \operatorname{arg\,min}_a \mathcal{O}(a)) \le \mathcal{O}(a')\), in particular, this is true for \(a' = \delta_{\text{o}}(y)\), hence:

\[ \begin{align*} \mathbb{E}[L(\delta_{\text{B}}(Y), X)] &= \mathbb{E}[ \mathcal{O}( \operatorname{arg\,min}_a \mathcal{O}(a))] \\ &\le \mathbb{E}[ \mathcal{O}( \delta_{\text{o}}(Y))] \\ &= \mathbb{E}[L(\delta_{\text{o}}(Y), X)]. \end{align*} \]

Discussion: Why do you think the presenter observed \(\mathbb{E}[L(\delta_{\text{B}}(Y), X)] {\color{red} >} \mathbb{E}[L(\delta_{\text{o}}(Y), X)]\) in their experiment?

A key factor to take into account:

  • If the estimator \(\delta_{\text{B}}\) is based on an approximation of the posterior (e.g. SNIS), then the optimality property may not hold!

Some other more “practical” factors to keep in mind:

  • Software defects (more frequent than you may think—we will talk more about them later).
  • Publication bias or even, academic dishonesty (e.g. repeating simulation until a desired result is obtained).

Also, keep in mind this page implicitly assumes the model is well-specified (since we generate data from the joint distribution). The situation is more complex in mis-specified models (asymptotics becomes critical in that case).