suppressPackageStartupMessages(require(ggplot2))
suppressPackageStartupMessages(require(dplyr))
= read.csv(url("https://github.com/UBC-Stat-ML/web447/raw/0d6eaa346d78abe4cd125e8fc688c9074d6331d9/data/hubble-1.csv")) %>%
df rename(distance = R..Mpc.) %>%
rename(velocity = v..km.sec.)
$velocity = df$velocity/1000 df
Exercise 8: Bayesian Workflow
Q.1: Calibration analysis via cross-validation
We have already discussed the concept of calibration for Bayesian inference, where we reviewed the behavior of both well-specified—the ideal situation—and mis-specified models—the most common case in practice. In the latter case, we demonstrated with a simulation experiment that mis-specification results in credible intervals that are not calibrated. However in practice, these tests are impossible to carry out because we do not have access to the true data-generating distribution.
One way to approximate the outcome of such an experiment is via cross-validation. In particular, this exercise will teach you how to use leave-one-out cross-validation to assess the calibration of the model for the Hubble dataset. Let’s begin by loading and formatting the data
Consider the following starting point Stan code for the Hubble regression model from the previous lectures
data {
int<lower=0> N; // number of observations
vector[N] xs; // independent variable
vector[N] ys; // dependent variable
}
parameters {
real slope;
real<lower=0> sigma;
}
model {
// prior
slope ~ student_t(3, 0, 100);
sigma ~ exponential(0.001);
// likelihood
ys ~ normal(slope*xs, sigma);
}
- Modify the Stan code above to include
- In the
data
block: an additional real-valued argumentx_pred
for the independent variable for the left-out point. - In a
generated quantities
block: generate a valuey_pred
givenx_pred
and the parameters \((\text{slope, sigma})\).
Create a compiled model called hubble_predict
by passing the modified stan code to the function stan_model
.
- Test your implementation by fitting the version of the model that leaves out the last row of the dataset. Hint: you should pass as data to stan the
train_test_dta
list defined below
= nrow(df)
N_obs = N_obs-1
N_train = list(
train_test_dta N = N_train,
xs = df$distance[-N_obs],
ys = df$velocity[-N_obs],
x_pred = df$distance[N_obs]
)
Report the leave-one-out 80% credible interval for the left out observation.
- Repeat step 2 for all observations. That is, construct a matrix of
nrow(df)
rows and 2 columns calledci_limits
, where thei
-th row contains an 80% credible interval for thei
-th observation left out. Report the plot generated by the following code
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(dplyr))
= df %>%
merged_df bind_cols(data.frame(CI_L = ci_limits[,1], CI_R = ci_limits[,2])) %>%
mutate(Inside_CI = (velocity >= CI_L & velocity <= CI_R))
%>%
merged_df ggplot(aes(x = 1:N_obs, y = velocity, ymin = CI_L, ymax = CI_R, color=Inside_CI)) +
geom_point() +
geom_errorbar() +
theme_minimal() +
labs(x = "Point", y = "Velocity")
Is the fraction of left-out observations contained in their respective credible intervals similar to the nominal coverage level?
Q.2: Estimating asymptotic variance
Consider the code for the random-walk Metropolis–Hastings algorithm from the previous assignment, where we can tune the proposal standard deviation with the parameter proposal_sd
:
# simple Metropolis-Hastings algorithm (normal proposal)
= function(gamma, initial_point, n_iters, proposal_sd = 1) {
simple_mh = numeric(n_iters)
samples = length(initial_point)
dim = initial_point
current_point for (i in 1:n_iters) {
= rnorm(dim, mean = current_point, sd = proposal_sd)
proposal = gamma(proposal) / gamma(current_point)
ratio if (runif(1) < ratio) {
= proposal
current_point else {
} # rejection, nothing to do, i.e. we stay at the current point
}# no matter if we reject or not, accumulate one sample
= current_point
samples[i]
}return(samples)
}
Implement an R function, estimate_asymptotic_variance
, that takes as input the parameters:
gamma
(unnormalized target density)proposal_sd
(Metropolis–Hastings proposal standard deviation)C
(number of chains)S
(number of samples per chain).
The function should return an estimate of the asymptotic variance of \(X_1\) for the random-walk Metropolis–Hastings algorithm (simple_mh
) based on the procedure outlined this week in the lecture notes.
Note: It is important to ensure that each of the \(C\) chains is initialized independently. I.e., make sure that each of the \(C\) different initial points is drawn from e.g., a standard normal distribution, with each of the chains starting at different values.
Consider the univariate function \(\gamma(x) = \exp(-\frac{1}{2} x^2)\) (i.e., the unnormalized density of the standard normal distribution).
Evaluate estimate_asymptotic_variance
with \(C = 100\) and \(S = 1000\). Repeat this for proposal_sd
taking on values in \(\{2^{-10}, 2^{-9}, \ldots, 2^{9}, 2^{10}\}\) and store the results in a data frame.
Given a data frame with columns proposal_sd
and estimated_mcmc_variance
, you may use the following code to plot your results.
library(ggplot2)
ggplot(df, aes(x = proposal_sd, y = estimated_mcmc_variance)) +
geom_line() +
geom_point() +
scale_x_continuous(trans = "log2", breaks = proposal_sd_values) +
labs(x = "Proposal SD (log2 scale)",
y = "Estimated asymptotic variance",
title = "Estimated asymptotic variance vs proposal SD") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
For the target distribution in question, what range of proposal standard deviation choices seems to be optimal (i.e., yields the lowest value of asymptotic variance)?
Do you think this same proposal standard deviation would be optimal for a target such as \(\gamma(x) = \exp(-\frac{1}{2000} x^2)\)? What about \(\gamma(x) = \exp(-\frac{1}{2} (x-100)^2)\)? (Hint: You don’t need to run any code to answer these last two questions.)