Exercise 8: Bayesian Workflow

Q.1: Calibration analysis via cross-validation

We have already discussed the concept of calibration for Bayesian inference, where we reviewed the behavior of both well-specified—the ideal situation—and mis-specified models—the most common case in practice. In the latter case, we demonstrated with a simulation experiment that mis-specification results in credible intervals that are not calibrated. However in practice, these tests are impossible to carry out because we do not have access to the true data-generating distribution.

One way to approximate the outcome of such an experiment is via cross-validation. In particular, this exercise will teach you how to use leave-one-out cross-validation to assess the calibration of the model for the Hubble dataset. Let’s begin by loading and formatting the data

suppressPackageStartupMessages(require(ggplot2))
suppressPackageStartupMessages(require(dplyr))

df = read.csv(url("https://github.com/UBC-Stat-ML/web447/raw/0d6eaa346d78abe4cd125e8fc688c9074d6331d9/data/hubble-1.csv")) %>%
  rename(distance = R..Mpc.) %>%
  rename(velocity = v..km.sec.)
df$velocity = df$velocity/1000

Consider the following starting point Stan code for the Hubble regression model from the previous lectures

data {
  int<lower=0> N; // number of observations
  vector[N] xs;   // independent variable
  vector[N] ys;   // dependent variable
}

parameters {
  real slope;
  real<lower=0> sigma;
}

model {
  // prior
  slope     ~ student_t(3, 0, 100);
  sigma     ~ exponential(0.001);

  // likelihood
  ys ~ normal(slope*xs, sigma);
}
  1. Modify the Stan code above to include
  1. In the data block: an additional real-valued argument x_pred for the independent variable for the left-out point.
  2. In a generated quantities block: generate a value y_pred given x_pred and the parameters \((\text{slope, sigma})\).

Create a compiled model called hubble_predict by passing the modified stan code to the function stan_model.

  1. Test your implementation by fitting the version of the model that leaves out the last row of the dataset. Hint: you should pass as data to stan the train_test_dta list defined below
N_obs = nrow(df)
N_train = N_obs-1
train_test_dta = list(
    N  = N_train,
    xs = df$distance[-N_obs], 
    ys = df$velocity[-N_obs], 
    x_pred = df$distance[N_obs]
)

Report the leave-one-out 80% credible interval for the left out observation.

  1. Repeat step 2 for all observations. That is, construct a matrix of nrow(df) rows and 2 columns called ci_limits, where the i-th row contains an 80% credible interval for the i-th observation left out. Report the plot generated by the following code
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(dplyr))

merged_df = df %>% 
  bind_cols(data.frame(CI_L = ci_limits[,1], CI_R = ci_limits[,2])) %>% 
  mutate(Inside_CI = (velocity >= CI_L & velocity <= CI_R)) 
merged_df %>% 
  ggplot(aes(x = 1:N_obs, y = velocity, ymin = CI_L, ymax = CI_R, color=Inside_CI)) +
  geom_point() + 
  geom_errorbar() +
  theme_minimal() +
  labs(x = "Point", y = "Velocity")

Is the fraction of left-out observations contained in their respective credible intervals similar to the nominal coverage level?

Q.2: Estimating asymptotic variance

Consider the code for the random-walk Metropolis–Hastings algorithm from the previous assignment, where we can tune the proposal standard deviation with the parameter proposal_sd:

# simple Metropolis-Hastings algorithm (normal proposal)
simple_mh = function(gamma, initial_point, n_iters, proposal_sd = 1) {
  samples = numeric(n_iters) 
  dim = length(initial_point)
  current_point = initial_point
  for (i in 1:n_iters) {
    proposal = rnorm(dim, mean = current_point, sd = proposal_sd) 
    ratio = gamma(proposal) / gamma(current_point) 
    if (runif(1) < ratio) {
      current_point = proposal
    } else {
      # rejection, nothing to do, i.e. we stay at the current point
    }
    # no matter if we reject or not, accumulate one sample
    samples[i] = current_point
  }
  return(samples)
}

Implement an R function, estimate_asymptotic_variance, that takes as input the parameters:

  • gamma (unnormalized target density)

  • proposal_sd (Metropolis–Hastings proposal standard deviation)

  • C (number of chains)

  • S (number of samples per chain).

The function should return an estimate of the asymptotic variance of \(X_1\) for the random-walk Metropolis–Hastings algorithm (simple_mh) based on the procedure outlined this week in the lecture notes.

Note: It is important to ensure that each of the \(C\) chains is initialized independently. I.e., make sure that each of the \(C\) different initial points is drawn from e.g., a standard normal distribution, with each of the chains starting at different values.

Consider the univariate function \(\gamma(x) = \exp(-\frac{1}{2} x^2)\) (i.e., the unnormalized density of the standard normal distribution).
Evaluate estimate_asymptotic_variance with \(C = 100\) and \(S = 1000\). Repeat this for proposal_sd taking on values in \(\{2^{-10}, 2^{-9}, \ldots, 2^{9}, 2^{10}\}\) and store the results in a data frame.

Given a data frame with columns proposal_sd and estimated_mcmc_variance, you may use the following code to plot your results.

library(ggplot2)
ggplot(df, aes(x = proposal_sd, y = estimated_mcmc_variance)) +
  geom_line() +
  geom_point() +
  scale_x_continuous(trans = "log2", breaks = proposal_sd_values) +
  labs(x = "Proposal SD (log2 scale)",
       y = "Estimated asymptotic variance",
       title = "Estimated asymptotic variance vs proposal SD") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

For the target distribution in question, what range of proposal standard deviation choices seems to be optimal (i.e., yields the lowest value of asymptotic variance)?

Do you think this same proposal standard deviation would be optimal for a target such as \(\gamma(x) = \exp(-\frac{1}{2000} x^2)\)? What about \(\gamma(x) = \exp(-\frac{1}{2} (x-100)^2)\)? (Hint: You don’t need to run any code to answer these last two questions.)