# Optimality

## Outline

### Topics

- Average loss optimality of Bayes estimators
- Limitations of this notion of optimality.

### Rationale

Formalize and justify one of the motivations for Bayesian methods covered in the first lecture.

## Setup

We go back to the decision theoretic setup from week 2.

## Estimator

**Definition:** an estimator is a function \(\delta\) such that:

- \(\delta\) takes a dataset \(y\) as input (i.e. the observations), \(\delta(y)\), and
- returns an action, i.e., \(\delta(y) \in A\).

**Examples:**

- the Bayes estimator,
- maximum likelihood estimator (when the action is a parameter you are trying to estimate).

## Optimality: motivation

**Consider the following situation**—you are watching a talk, where the presenter explains they did the following:

- Let:
- \(\delta_{\text{B}}\) denote the Bayes estimator, and
- \(\delta_{\text{o}}\), another estimator, say based on a sophisticated deep neural network.

- To compare the two estimators, the presenter generated a large number of synthetic datasets \((X^{(m)}, Y^{(m)})\) by forward simulation from the model.
- They evaluated the performance of the Bayes estimator as follows:
- apply the estimator: \(a^{(m)}= \delta_{\text{B}}(Y^{(m)})\),
- compute the loss: \(l^{(m)}= L(a^{(m)}, X^{(m)})\), and
- return the average loss as the measure of performance: \[\text{average loss} = \frac{1}{M} \sum l^{(m)}.\]

- Then, they did the same with the other estimator, \(\delta_{\text{o}}\).
- Suppose that the presenter reported a lower average loss for the other estimator, \(\delta_{\text{o}}\)

**Poll:** Which of these statements is true?

- The deep neural network can perform better because it involves more parameters (over-parameterization)
- The deep neural network can perform better because of a combination of lucky initialization and noisy gradients
- The deep neural network can perform better because of some other reason
- This cannot happen unless the loss function is non-convex
- This cannot happen because of some other reason

Correct answer is: “This cannot happen because of some other reason.”

**Intuition:** Bayes estimator is designed to minimize the average loss. So you cannot beat Bayes at its game!

# Optimality of Bayesian estimators: theorem

**Theorem:** let \(\delta_{\text{B}}\) denote the Bayes estimator, and \(\delta_{\text{o}}\), any other estimator. Then: \[\mathbb{E}[L(\delta_{\text{B}}(Y), X)] \le \mathbb{E}[L(\delta_{\text{o}}(Y), X)]. \tag{1}\]

**Proof:** by the law of total expectation, we can rewrite the left hand side of Equation 1 as: \[\mathbb{E}[L(\delta_{\text{B}}(Y), X)] = \mathbb{E}[ {\color{red} \mathbb{E}[L(\delta_{\text{B}}(Y), X) | Y]}]. \tag{2}\] We will start by rewriting the part in red.

To do so, denote the objective being minimized in the Bayes estimator by: \[\mathcal{O}(a) = \mathbb{E}[L(a, X) | Y].\]

Note: by definition, we can rewrite our Bayes estimator by: \[\delta_{\text{B}}= \operatorname{arg\,min}_a \mathcal{O}(a),\] and hence: \[{\color{red} \mathbb{E}[L(\delta_{\text{B}}(Y), X) | Y]} = \mathcal{O}(\delta_{\text{B}}) = \mathcal{O}(\operatorname{arg\,min}_a \mathcal{O}(a)). \tag{3}\] Plugging-in Equation 3 into Equation 2, we get:

\[\mathbb{E}[ \mathbb{E}[L(\delta_{\text{B}}(Y), X) | Y]] = \mathbb{E}[ \mathcal{O}( \operatorname{arg\,min}_a \mathcal{O}(a))].\]

Now note that for all \(a' \in A\), \(\mathcal{O}( \operatorname{arg\,min}_a \mathcal{O}(a)) \le \mathcal{O}(a')\), in particular, this is true for \(a' = \delta_{\text{o}}(y)\), hence:

\[ \begin{align*} \mathbb{E}[L(\delta_{\text{B}}(Y), X)] &= \mathbb{E}[ \mathcal{O}( \operatorname{arg\,min}_a \mathcal{O}(a))] \\ &\le \mathbb{E}[ \mathcal{O}( \delta_{\text{o}}(Y))] \\ &= \mathbb{E}[L(\delta_{\text{o}}(Y), X)]. \end{align*} \]

**Discussion:** Why do you think the presenter observed \(\mathbb{E}[L(\delta_{\text{B}}(Y), X)] {\color{red} >} \mathbb{E}[L(\delta_{\text{o}}(Y), X)]\) in their experiment?

A key factor to take into account:

- If the estimator \(\delta_{\text{B}}\) is based on an approximation of the posterior (e.g. SNIS), then the optimality property may not hold!

Some other more “practical” factors to keep in mind:

- Software defects (more frequent than you may think—we will talk more about them later).
- Publication bias or even, academic dishonesty (e.g. repeating simulation until a desired result is obtained).

Also, keep in mind this page implicitly assumes the model is well-specified (since we generate data from the joint distribution). The situation is more complex in mis-specified models (asymptotics becomes critical in that case).