More Bayesian bricks

Outline

Topics

More distributions to complement those tested in quiz 1.
Motivation, realization and parameterization(s) for each.
Reparameterization.

Rationale

Recall our model building strategy:

start with observation, and find a distribution that “match its data type” (this creates the likelihood),
- i.e. such that support of the distribution \(=\) observation data type
then look at the data types of each of the parameters of the distribution you just picked…
- …and search for a distributions that match each the parameters’ data type (this creates the prior),
in the case of hierarchical models, recurse this process.

There are a few common data types for which we do not have talked much about distributions having realizations of that datatype. We now fill this gap.

Counts

Support: \(\{0, 1, 2, 3, \dots\}\).

Simple common choice is the Poisson distribution:
- \({\mathrm{Poisson}}(\lambda)\)
- Parameter: Mean \(\lambda > 0\).
- Motivation: law of rare events.
- Stan doc.
Popular alternative, e.g., in bio-informatics: the negative binomial distribution:
- \({\mathrm{NegBinom}}(\mu, \phi)\)
- Mean parameter \(\mu > 0\) and concentration \(\phi > 0\).
- Motivation:
  - Poisson’s variance is always the same as its mean.
  - Consider \({\mathrm{NegBinom}}\) when empirically the variance is greater than the mean (“over-dispersion”).
- Stan doc.

Positive real numbers

Support: \(\{x \in \mathbb{R}: x > 0\} = \mathbb{R}^+\)

More common choice is the gamma distribution:
- \({\mathrm{Gam}}(\alpha, \beta)\)
- Parameters: Shape parameters \(\alpha > 0\) and rate \(\beta > 0\).
- Stan doc.

Question: consider the following Stan model:

suppressPackageStartupMessages(require(rstan))

data {
  int<lower=0> n_obs;
  vector<lower=0>[n_obs] observations;
}

parameters {
  real<lower=0> shape;
  real<lower=0> rate;
}

model {
  // priors
  shape ~ exponential(1.0/100);
  rate ~ exponential(1.0/100);
  
  // likelihood
  observations ~ gamma(shape, rate);
}

Notice that neither of the parameters passed in the likelihood can be interpreted as a mean. However, you are asked to report a mean parameter for the population from which the observations come from. How would you proceed?

Click for choices

Using generated quantities { real next_obs = gamma_rng(shape, rate); }, and reporting the posterior mean of next_obs variable.
Use the posterior mean of the shape parameter, divided by the posterior mean of the rate parameter.
Using transformed parameters { real mean_param = shape/rate; }, and reporting the posterior mean of the mean_param variable.
All of the above
None of the above

Click for answer

The most direct is 1, which is in line with what we did for prediction in many examples.

We will see very soon that 3 is also actually possible, this is known as Rao-Blackwellization. In summary, it is based on the fact that if \(Y \sim {\mathrm{Gam}}(\alpha, \beta)\), then \(\mathbb{E}[Y] = \alpha/\beta\).

Empirical example that both are equivalent:

data {
  int<lower=0> n_obs;
  vector<lower=0>[n_obs] observations;
}

parameters {
  real<lower=0> shape;
  real<lower=0> rate;
}

transformed parameters {
  real rao_blackwellized_mean = shape/rate;
}

model {
  // priors
  shape ~ exponential(1.0/100);
  rate ~ exponential(1.0/100);
  
  // likelihood
  observations ~ gamma(shape, rate);
}

generated quantities {
  real monte_carlo_mean = gamma_rng(shape, rate);
}

set.seed(1)
observations = rgamma(5, shape = 2, rate = 3)

fit = sampling(
  gamma_with_mean,
  seed = 1,
  refresh = 0,
  data = list(n_obs = length(observations), observations = observations),       
  iter = 100000                   
)

Warning: There were 2 divergent transitions after warmup. See
https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
to find out why this is a problem and how to eliminate them.

Warning: Examine the pairs() plot to diagnose sampling problems

fit

Inference for Stan model: anon_model.
4 chains, each with iter=1e+05; warmup=50000; thin=1; 
post-warmup draws per chain=50000, total post-warmup draws=2e+05.

                        mean se_mean   sd  2.5%   25%   50%  75% 97.5%  n_eff
shape                   5.89    0.02 2.87  1.71  3.79  5.42 7.50 12.71  32635
rate                    6.61    0.02 3.34  1.73  4.15  6.07 8.47 14.58  32888
rao_blackwellized_mean  0.92    0.00 0.20  0.60  0.79  0.90 1.02  1.38 138264
monte_carlo_mean        0.92    0.00 0.49  0.23  0.60  0.84 1.15  2.09 184785
lp__                   -0.65    0.01 1.10 -3.60 -1.07 -0.31 0.13  0.42  40801
                       Rhat
shape                     1
rate                      1
rao_blackwellized_mean    1
monte_carlo_mean          1
lp__                      1

Samples were drawn using NUTS(diag_e) at Mon Mar 17 09:59:31 2025.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Counter example that answer 2, a common misconception,¹ is incorrect:

5.91 / 6.63

[1] 0.8914027

This is because we are interested in \(\mathbb{E}[\text{shape}/\text{rate}|Y]\), not \(\mathbb{E}[\text{shape}|Y]/\mathbb{E}[\text{rate}|Y]\).

This can be seen with the law of total expectation:

\[\begin{align*} \mathbb{E}[Y_\text{next}|Y_\text{obs}] &= \mathbb{E}[ \mathbb{E}[Y_\text{next} | \text{shape, rate}]|Y_\text{obs}] \\ &= \mathbb{E}[ \text{shape}/\text{rate} |Y_\text{obs}]. \end{align*}\]

An indeed the MC estimates for \(\mathbb{E}[\text{shape}/\text{rate}|Y]\) and \(\mathbb{E}[\text{shape}|Y]/\mathbb{E}[\text{rate}|Y]\) should be expected to be different:

\[ \frac{1}{M} \sum_{m=1}^M \frac{X^{(m)}_\text{shape}}{X^{(m)}_\text{rate}} \neq \frac{\frac{1}{M} \sum_{m=1}^M X^{(m)}_\text{shape}}{ \frac{1}{M} \sum_{m=1}^M X^{(m)}_\text{rate}}. \]

Simplex

Terminology for the set of valid parameters for the categorical, \(\{(p_1, p_2, \dots, p_K) : p_k > 0, \sum_k p_k = 1\}\): the \(K\)-simplex.

Hence, if you need a prior over the parameters of a categorical, you need a distribution over the simplex!
Common choice: the Dirichlet distribution:
- \({\mathrm{Dir}}(\alpha_1, \dots, \alpha_K)\)
- Concentrations \(\alpha_i > 0\).
- Stan doc.

Vectors

Support: \(\mathbb{R}^K\)

Common choice: the multivariate normal.
- \(\mathcal{N}(\mu, \Sigma)\)
- Mean vector \(\mu \in \mathbb{R}^K\), covariance matrix \(\Sigma \succ 0\), \(\Sigma\) symmetric.
- Stan doc.

Many others!

References

Use wikipedia’s massive distribution list,
and Stan’s documentation.

Alternative approach: reparameterization

Suppose you need a distribution with support \([0, 1]\).
We have seen above that one option is to use a beta prior.
An alternative:
- define \(X \sim \mathcal{N}(0, 1)\),
- use a transformation to map it to \([0, 1]\), e.g. \(Y = \text{logistic}(X)\).
This approach is known as “re-parametrization”
- For certain variational approaches, this helps inference.
- More on that in the last week.

Footnotes

The root cause of that mis-conception is often an attempt to apply the “plug-in principle” from frequentist statistics to a Bayesian context.↩︎