Bayesian calibration: mis-specified case

Outline

Topics

  • Guarantees when the model is mis-specified.
  • The Bernstein-von Mises (BvM) theorem.

Rationale

Model mis-specification is the norm. What happens to calibration in that case?

Misspecified models and calibration

What to do if your model is misspecified?

One obvious possibility is to make the model better. But in the following we show that under certain conditions, another possibility is to collect more data.

  • Now suppose we are using the “wrong prior”, i.e.:
    • data generation uses uniform prior…
    • …but we base our posterior computation on a different, non-uniform prior.
  • Similarly to the last page, let’s do it for small (10 launches), medium (100), and large datasets (1000), plotting the nominal coverage (dashed) against the actual coverage (solid line)
Code
suppressPackageStartupMessages(require("ggplot2"))
suppressPackageStartupMessages(require("dplyr"))
suppressPackageStartupMessages(require("tidyr"))
theme_set(theme_bw())

 # Using now the same non-uniform prior as before for posterior calculation
K <- 1000

rdunif <- function(max) { return(ceiling(max*runif(1))) }

posterior_distribution <- function(prior_probs, n_successes, n_trials){         
  K <- length(prior_probs)-1 # K+1 values that your p can assume
  n_fails <- n_trials - n_successes
  p <- seq(0, 1, 1/K)
  posterior_probs <-                                       # 1. this computes gamma(i)
    prior_probs *                                          #    - prior
    p^n_successes * (1-p)^n_fails                          #    - likelihood 
  posterior_probs <- posterior_probs/sum(posterior_probs)  # 2. normalize gamma(i)
  post_prob <- rbind(p, posterior_probs)
  return(post_prob)
}

high_density_intervals <- function(alpha, posterior_probs){
  ordered_probs = posterior_probs[,order(posterior_probs[2,], decreasing = TRUE)]
  cumulative_probs = cumsum(ordered_probs[2,])
  index = which.max(cumulative_probs >= (1-alpha))
  return(ordered_probs[,1:index, drop=FALSE])
}

hdi_coverage_pr <- function(n_datapoints) {
  n_inclusions <- 0
  for (repetition in seq(1:n_repeats)) {
    i <- rdunif(K + 1) - 1  # Always generate the data using a uniform prior
    true_p <- i/K
    x <- rbinom(1, n_datapoints, true_p)
    post <- posterior_distribution(prior_used_for_computing_posterior, x, n_datapoints)
    
    # This if is just a hacky way to check if true parameter is in the HDI credible interval
    if (sum(abs(true_p - high_density_intervals(alpha, post)[1,]) < 10e-10) == 1) {
      n_inclusions <- n_inclusions + 1
    }
  }
  return(n_inclusions/n_repeats) # Fraction of simulation where the true parameter was in interval
}

prior_used_for_computing_posterior <- dnorm(seq(0, 1, 1/K),mean = 0.2, sd=0.2)
prior_used_for_computing_posterior <- prior_used_for_computing_posterior / sum(prior_used_for_computing_posterior)

set.seed(1)

n_repeats <- 1000
alpha <- 0.1

df <- data.frame("n_observations" = c(10, 100, 1000))
df$coverage_pr <- sapply(df$n_observations, hdi_coverage_pr)

ggplot(data=df, aes(x=n_observations, y=coverage_pr)) +
  ylim(0, 1) +
  xlab("Number of observations") + 
  ylab("Actual coverage") +
  geom_hline(yintercept=1-alpha, linetype="dashed", color = "black") +
  geom_line()

  • Bad news: for small datasets we are no longer calibrated, in the worst possible way
    • Higher than the dashed line: would have meant inference is being conservative, i.e. more right than it actually claimed. That’s not too bad.
    • Lower than the dashed line: we are being overconfident or anti-conservative, which in some cases can be reckless
  • Good news: in this case, the problem gets quickly corrected as the dataset gets larger. Why?

Asymptotic calibration under misspecification of the prior

  • There is no general calibration theory for misspecified models, only special cases (outside of that, use simulation studies or cross-validation techniques)
  • An example where we do have some theory: when the data points \(y_i\) are iid given the parameters \(x\).

Proposition: under certain conditions (outlined below), even when the prior is misspecified, the actual coverage of credible intervals converges to the nominal coverage.

Outline:

  • the above proposition is a consequence of a stronger result: Bernstein–von Mises theorem (BvM).
  • BvM theorems provide conditions where the posterior converges to a normal as the data size increases.
    • Crucially, the parameters of the limiting normal do not depend on the prior!
    • As a result, when BvM holds, prior misspecification is eventually washed out by the likelihood.

BvM Theorem

Setup:

  • Assume there is a single true \(x^*\).
  • Generate datasets of increasing size \(n\),
    • \(Y_{1:2}, Y_{1:3}, \dots\),
    • \(Y_{1:n} = (Y_1, \dots, Y_n)\),
    • \(Y_i \sim p(\cdot | x^*)\).
  • Consider the sequence of (random) posteriors \(\Pi_n = p(\cdot | Y_{1:n})\)

Result: \(\Pi_n\) approaches a normal distribution.

  • Which normal distribution?
  • The same normal that the maximum likelihood estimator converges to!
    • Centered at \(x^*\),
    • variance is the inverse of the observed information.

\[ d(\Pi_n, \mathcal{N}(x^*, I_n^{-1})) \to 0, \text{ as }n\to \infty. \] where:

  • \(d\) is a distance between distributions called the total variation distance,
  • the observed information \(I_n\) is given by \[ I_n = - \frac{\partial^2}{\partial x^2} \log p(Y_{1:n} | x)\Bigg|_{x^*}, \]
  • convergence is in probability, with respect to the distribution \(p(\cdot | x^*)\).

Conditions include the following highlights:

  1. the prior \(p(x)\) is a density assumed to be non-zero in a neighborhood of the true parameter,
  2. we assume the posterior mean is consistent as \(n\to\infty\) (something you will explore in this week’s exercise and lab),
  3. we assume the model is parametric:
    • \(x\) is allowed to be a vector…
    • …but you are not allowed to make it longer as the number of observations increases.

References

  • BDA, Section 2 and Appendix B (note however that they use a non-standard definition of observed information).
  • Full technical conditions in e.g., van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University Press. ISBN 0-521-49603-9.