suppressPackageStartupMessages(require(rstan))
More Bayesian bricks
Outline
Topics
- More distributions to complement those tested in quiz 1.
- Motivation, realization and parameterization(s) for each.
- Reparameterization.
Rationale
Recall our model building strategy:
- start with observation, and find a distribution that “match its data type” (this creates the likelihood),
- i.e. such that support of the distribution \(=\) observation data type
- then look at the data types of each of the parameters of the distribution you just picked…
- …and search for a distributions that match each the parameters’ data type (this creates the prior),
- in the case of hierarchical models, recurse this process.
There are a few common data types for which we do not have talked much about distributions having realizations of that datatype. We now fill this gap.
Counts
- Support: \(\{0, 1, 2, 3, \dots\}\).
- Simple common choice is the Poisson distribution:
- \({\mathrm{Poisson}}(\lambda)\)
- Parameter: Mean \(\lambda > 0\).
- Motivation: law of rare events.
- Stan doc.
- \({\mathrm{Poisson}}(\lambda)\)
- Popular alternative, e.g., in bio-informatics: the negative binomial distribution:
- \({\mathrm{NegBinom}}(\mu, \phi)\)
- Mean parameter \(\mu > 0\) and concentration \(\phi > 0\).
- Motivation:
- Poisson’s variance is always the same as its mean.
- Consider \({\mathrm{NegBinom}}\) when empirically the variance is greater than the mean (“over-dispersion”).
- Stan doc.
Positive real numbers
- Support: \(\{x \in \mathbb{R}: x > 0\} = \mathbb{R}^+\)
- More common choice is the gamma distribution:
- \({\mathrm{Gam}}(\alpha, \beta)\)
- Parameters: Shape parameters \(\alpha > 0\) and rate \(\beta > 0\).
- Stan doc.
Question: consider the following Stan model:
data {
int<lower=0> n_obs;
vector<lower=0>[n_obs] observations;
}
parameters {
real<lower=0> shape;
real<lower=0> rate;
}
model {
// priors
1.0/100);
shape ~ exponential(1.0/100);
rate ~ exponential(
// likelihood
observations ~ gamma(shape, rate); }
Notice that neither of the parameters passed in the likelihood can be interpreted as a mean. However, you are asked to report a mean parameter for the population from which the observations come from. How would you proceed?
- Using
generated quantities { real next_obs = gamma_rng(shape, rate); }
, and reporting the posterior mean ofnext_obs
variable. - Use the posterior mean of the
shape
parameter, divided by the posterior mean of therate
parameter. - Using
transformed parameters { real mean_param = shape/rate; }
, and reporting the posterior mean of themean_param
variable. - All of the above
- None of the above
The most direct is 1, which is in line with what we did for prediction in many examples.
We will see very soon that 3 is also actually possible, this is known as Rao-Blackwellization. In summary, it is based on the fact that if \(Y \sim {\mathrm{Gam}}(\alpha, \beta)\), then \(\mathbb{E}[Y] = \alpha/\beta\).
Empirical example that both are equivalent:
data {
int<lower=0> n_obs;
vector<lower=0>[n_obs] observations;
}
parameters {
real<lower=0> shape;
real<lower=0> rate;
}
transformed parameters {
real rao_blackwellized_mean = shape/rate;
}
model {
// priors
1.0/100);
shape ~ exponential(1.0/100);
rate ~ exponential(
// likelihood
observations ~ gamma(shape, rate);
}
generated quantities {
real monte_carlo_mean = gamma_rng(shape, rate);
}
set.seed(1)
= rgamma(5, shape = 2, rate = 3) observations
= sampling(
fit
gamma_with_mean,seed = 1,
refresh = 0,
data = list(n_obs = length(observations), observations = observations),
iter = 100000
)
Warning: There were 1 divergent transitions after warmup. See
https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
to find out why this is a problem and how to eliminate them.
Warning: Examine the pairs() plot to diagnose sampling problems
fit
Inference for Stan model: anon_model.
4 chains, each with iter=1e+05; warmup=50000; thin=1;
post-warmup draws per chain=50000, total post-warmup draws=2e+05.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff
shape 5.91 0.02 2.91 1.69 3.77 5.43 7.52 12.83 34520
rate 6.63 0.02 3.37 1.72 4.15 6.07 8.50 14.66 34671
rao_blackwellized_mean 0.92 0.00 0.20 0.60 0.79 0.90 1.02 1.38 145150
monte_carlo_mean 0.92 0.00 0.49 0.23 0.60 0.84 1.15 2.09 190050
lp__ -0.66 0.01 1.10 -3.61 -1.08 -0.32 0.13 0.42 41775
Rhat
shape 1
rate 1
rao_blackwellized_mean 1
monte_carlo_mean 1
lp__ 1
Samples were drawn using NUTS(diag_e) at Mon Mar 18 14:14:21 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
Counter example that answer 2, a common misconception,1 is incorrect:
5.91 / 6.63
[1] 0.8914027
This is because we are interested in \(\mathbb{E}[\text{shape}/\text{rate}|Y]\), not \(\mathbb{E}[\text{shape}|Y]/\mathbb{E}[\text{rate}|Y]\).
This can be seen with the law of total expectation:
\[\begin{align*} \mathbb{E}[Y_\text{next}|Y_\text{obs}] &= \mathbb{E}[ \mathbb{E}[Y_\text{next} | \text{shape, rate}]|Y_\text{obs}] \\ &= \mathbb{E}[ \text{shape}/\text{rate} |Y_\text{obs}]. \end{align*}\]
An indeed the MC estimates for \(\mathbb{E}[\text{shape}/\text{rate}|Y]\) and \(\mathbb{E}[\text{shape}|Y]/\mathbb{E}[\text{rate}|Y]\) should be expected to be different:
\[ \frac{1}{M} \sum_{m=1}^M \frac{X^{(m)}_\text{shape}}{X^{(m)}_\text{rate}} \neq \frac{\frac{1}{M} \sum_{m=1}^M X^{(m)}_\text{shape}}{ \frac{1}{M} \sum_{m=1}^M X^{(m)}_\text{rate}}. \]
Categories
- Support: \(\{0, 1, 2, 3, \dots, K\}\), for some number of categories \(K\).
- All such distributions captured by the categorical distribution
- We first discussed it in Exercise 3.
- \({\mathrm{Categorical}}(p_1, \dots, p_K)\)
- Probabilities \(p_k > 0\), \(\sum_k p_k = 1\).
- Stan doc.
Simplex
Terminology for the set of valid parameters for the categorical, \(\{(p_1, p_2, \dots, p_K) : p_k > 0, \sum_k p_k = 1\}\): the \(K\)-simplex.
- Hence, if you need a prior over the parameters of a categorical, you need a distribution over the simplex!
- Common choice: the Dirichlet distribution:
- \({\mathrm{Dir}}(\alpha_1, \dots, \alpha_K)\)
- Concentrations \(\alpha_i > 0\).
- Stan doc.
Vectors
- Support: \(\mathbb{R}^K\)
- Common choice: the multivariate normal.
- \(\mathcal{N}(\mu, \Sigma)\)
- Mean vector \(\mu \in \mathbb{R}^K\), covariance matrix \(\Sigma \succ 0\), \(\Sigma\) symmetric.
- Stan doc.
Many others!
References
- Use wikipedia’s massive distribution list,
- and Stan’s documentation.
Alternative approach: reparameterization
- Suppose you need a distribution with support \([0, 1]\).
- We have seen above that one option is to use a beta prior.
- An alternative:
- define \(X \sim \mathcal{N}(0, 1)\),
- use a transformation to map it to \([0, 1]\), e.g. \(Y = \text{logistic}(X)\).
- This approach is known as “re-parametrization”
- For certain variational approaches, this helps inference.
- More on that in the last week.
Footnotes
The root cause of that mis-conception is often an attempt to apply the “plug-in principle” from frequentist statistics to a Bayesian context.↩︎