More Bayesian bricks

Outline

Topics

  • More distributions to complement those tested in quiz 1.
  • Motivation, realization and parameterization(s) for each.
  • Reparameterization.

Rationale

Recall our model building strategy:

  • start with observation, and find a distribution that “match its data type” (this creates the likelihood),
    • i.e. such that support of the distribution \(=\) observation data type
  • then look at the data types of each of the parameters of the distribution you just picked…
    • …and search for a distributions that match each the parameters’ data type (this creates the prior),
  • in the case of hierarchical models, recurse this process.

There are a few common data types for which we do not have talked much about distributions having realizations of that datatype. We now fill this gap.

Counts

  • Support: \(\{0, 1, 2, 3, \dots\}\).

  • Simple common choice is the Poisson distribution:
  • Popular alternative, e.g., in bio-informatics: the negative binomial distribution:
    • \({\mathrm{NegBinom}}(\mu, \phi)\)
    • Mean parameter \(\mu > 0\) and concentration \(\phi > 0\).
    • Motivation:
      • Poisson’s variance is always the same as its mean.
      • Consider \({\mathrm{NegBinom}}\) when empirically the variance is greater than the mean (“over-dispersion”).
    • Stan doc.

Positive real numbers

  • Support: \(\{x \in \mathbb{R}: x > 0\} = \mathbb{R}^+\)

  • More common choice is the gamma distribution:
    • \({\mathrm{Gam}}(\alpha, \beta)\)
    • Parameters: Shape parameters \(\alpha > 0\) and rate \(\beta > 0\).
    • Stan doc.

Question: consider the following Stan model:

suppressPackageStartupMessages(require(rstan))
data {
  int<lower=0> n_obs;
  vector<lower=0>[n_obs] observations;
}

parameters {
  real<lower=0> shape;
  real<lower=0> rate;
}

model {
  // priors
  shape ~ exponential(1.0/100);
  rate ~ exponential(1.0/100);
  
  // likelihood
  observations ~ gamma(shape, rate);
}

Notice that neither of the parameters passed in the likelihood can be interpreted as a mean. However, you are asked to report a mean parameter for the population from which the observations come from. How would you proceed?

  1. Using generated quantities { real next_obs = gamma_rng(shape, rate); }, and reporting the posterior mean of next_obs variable.
  2. Use the posterior mean of the shape parameter, divided by the posterior mean of the rate parameter.
  3. Using transformed parameters { real mean_param = shape/rate; }, and reporting the posterior mean of the mean_param variable.
  4. All of the above
  5. None of the above

The most direct is 1, which is in line with what we did for prediction in many examples.

We will see very soon that 3 is also actually possible, this is known as Rao-Blackwellization. In summary, it is based on the fact that if \(Y \sim {\mathrm{Gam}}(\alpha, \beta)\), then \(\mathbb{E}[Y] = \alpha/\beta\).

Empirical example that both are equivalent:

data {
  int<lower=0> n_obs;
  vector<lower=0>[n_obs] observations;
}

parameters {
  real<lower=0> shape;
  real<lower=0> rate;
}

transformed parameters {
  real rao_blackwellized_mean = shape/rate;
}

model {
  // priors
  shape ~ exponential(1.0/100);
  rate ~ exponential(1.0/100);
  
  // likelihood
  observations ~ gamma(shape, rate);
}

generated quantities {
  real monte_carlo_mean = gamma_rng(shape, rate);
}
set.seed(1)
observations = rgamma(5, shape = 2, rate = 3)
fit = sampling(
  gamma_with_mean,
  seed = 1,
  refresh = 0,
  data = list(n_obs = length(observations), observations = observations),       
  iter = 100000                   
)
Warning: There were 1 divergent transitions after warmup. See
https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
to find out why this is a problem and how to eliminate them.
Warning: Examine the pairs() plot to diagnose sampling problems
fit
Inference for Stan model: anon_model.
4 chains, each with iter=1e+05; warmup=50000; thin=1; 
post-warmup draws per chain=50000, total post-warmup draws=2e+05.

                        mean se_mean   sd  2.5%   25%   50%  75% 97.5%  n_eff
shape                   5.91    0.02 2.91  1.69  3.77  5.43 7.52 12.83  34520
rate                    6.63    0.02 3.37  1.72  4.15  6.07 8.50 14.66  34671
rao_blackwellized_mean  0.92    0.00 0.20  0.60  0.79  0.90 1.02  1.38 145150
monte_carlo_mean        0.92    0.00 0.49  0.23  0.60  0.84 1.15  2.09 190050
lp__                   -0.66    0.01 1.10 -3.61 -1.08 -0.32 0.13  0.42  41775
                       Rhat
shape                     1
rate                      1
rao_blackwellized_mean    1
monte_carlo_mean          1
lp__                      1

Samples were drawn using NUTS(diag_e) at Mon Mar 18 14:14:21 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Counter example that answer 2, a common misconception,1 is incorrect:

5.91 / 6.63
[1] 0.8914027

This is because we are interested in \(\mathbb{E}[\text{shape}/\text{rate}|Y]\), not \(\mathbb{E}[\text{shape}|Y]/\mathbb{E}[\text{rate}|Y]\).

This can be seen with the law of total expectation:

\[\begin{align*} \mathbb{E}[Y_\text{next}|Y_\text{obs}] &= \mathbb{E}[ \mathbb{E}[Y_\text{next} | \text{shape, rate}]|Y_\text{obs}] \\ &= \mathbb{E}[ \text{shape}/\text{rate} |Y_\text{obs}]. \end{align*}\]

An indeed the MC estimates for \(\mathbb{E}[\text{shape}/\text{rate}|Y]\) and \(\mathbb{E}[\text{shape}|Y]/\mathbb{E}[\text{rate}|Y]\) should be expected to be different:

\[ \frac{1}{M} \sum_{m=1}^M \frac{X^{(m)}_\text{shape}}{X^{(m)}_\text{rate}} \neq \frac{\frac{1}{M} \sum_{m=1}^M X^{(m)}_\text{shape}}{ \frac{1}{M} \sum_{m=1}^M X^{(m)}_\text{rate}}. \]

Categories

  • Support: \(\{0, 1, 2, 3, \dots, K\}\), for some number of categories \(K\).
  • All such distributions captured by the categorical distribution
  • We first discussed it in Exercise 3.
    • \({\mathrm{Categorical}}(p_1, \dots, p_K)\)
    • Probabilities \(p_k > 0\), \(\sum_k p_k = 1\).
    • Stan doc.

Simplex

Terminology for the set of valid parameters for the categorical, \(\{(p_1, p_2, \dots, p_K) : p_k > 0, \sum_k p_k = 1\}\): the \(K\)-simplex.

  • Hence, if you need a prior over the parameters of a categorical, you need a distribution over the simplex!
  • Common choice: the Dirichlet distribution:
    • \({\mathrm{Dir}}(\alpha_1, \dots, \alpha_K)\)
    • Concentrations \(\alpha_i > 0\).
    • Stan doc.

Vectors

  • Support: \(\mathbb{R}^K\)

  • Common choice: the multivariate normal.
    • \(\mathcal{N}(\mu, \Sigma)\)
    • Mean vector \(\mu \in \mathbb{R}^K\), covariance matrix \(\Sigma \succ 0\), \(\Sigma\) symmetric.
    • Stan doc.

Many others!

References

Alternative approach: reparameterization

  • Suppose you need a distribution with support \([0, 1]\).
  • We have seen above that one option is to use a beta prior.
  • An alternative:
    • define \(X \sim \mathcal{N}(0, 1)\),
    • use a transformation to map it to \([0, 1]\), e.g. \(Y = \text{logistic}(X)\).
  • This approach is known as “re-parametrization”
    • For certain variational approaches, this helps inference.
    • More on that in the last week.

Footnotes

  1. The root cause of that mis-conception is often an attempt to apply the “plug-in principle” from frequentist statistics to a Bayesian context.↩︎