`suppressPackageStartupMessages(require(rstan))`

# More Bayesian bricks

## Outline

### Topics

- More distributions to complement those tested in quiz 1.
- Motivation, realization and parameterization(s) for each.
- Reparameterization.

### Rationale

Recall our model building strategy:

- start with observation, and find a distribution that “match its data type” (this creates the likelihood),
- i.e. such that
**support**of the distribution \(=\) observation data type

- i.e. such that
- then look at the data types of each of the parameters of the distribution you just picked…
- …and search for a distributions that match each the parameters’ data type (this creates the prior),

- in the case of hierarchical models, recurse this process.

There are a few common data types for which we do not have talked much about distributions having realizations of that datatype. We now fill this gap.

## Counts

- Support: \(\{0, 1, 2, 3, \dots\}\).

- Simple common choice is the
**Poisson distribution**:- \({\mathrm{Poisson}}(\lambda)\)

- Parameter: Mean \(\lambda > 0\).
- Motivation: law of rare events.
- Stan doc.

- \({\mathrm{Poisson}}(\lambda)\)
- Popular alternative, e.g., in bio-informatics: the
**negative binomial distribution**:- \({\mathrm{NegBinom}}(\mu, \phi)\)
- Mean parameter \(\mu > 0\) and concentration \(\phi > 0\).
- Motivation:
- Poisson’s variance is always the same as its mean.
- Consider \({\mathrm{NegBinom}}\) when empirically the variance is greater than the mean (“over-dispersion”).

- Stan doc.

## Positive real numbers

- Support: \(\{x \in \mathbb{R}: x > 0\} = \mathbb{R}^+\)

- More common choice is the
**gamma distribution**:- \({\mathrm{Gam}}(\alpha, \beta)\)
- Parameters: Shape parameters \(\alpha > 0\) and rate \(\beta > 0\).
- Stan doc.

**Question:** consider the following Stan model:

```
data {
int<lower=0> n_obs;
vector<lower=0>[n_obs] observations;
}
parameters {
real<lower=0> shape;
real<lower=0> rate;
}
model {
// priors
1.0/100);
shape ~ exponential(1.0/100);
rate ~ exponential(
// likelihood
observations ~ gamma(shape, rate); }
```

Notice that neither of the parameters passed in the likelihood can be interpreted as a mean. However, you are asked to report a mean parameter for the population from which the observations come from. How would you proceed?

- Using
`generated quantities { real next_obs = gamma_rng(shape, rate); }`

, and reporting the posterior mean of`next_obs`

variable. - Use the posterior mean of the
`shape`

parameter, divided by the posterior mean of the`rate`

parameter. - Using
`transformed parameters { real mean_param = shape/rate; }`

, and reporting the posterior mean of the`mean_param`

variable. - All of the above
- None of the above

The most direct is 1, which is in line with what we did for prediction in many examples.

We will see very soon that 3 is also actually possible, this is known as Rao-Blackwellization. In summary, it is based on the fact that if \(Y \sim {\mathrm{Gam}}(\alpha, \beta)\), then \(\mathbb{E}[Y] = \alpha/\beta\).

Empirical example that both are equivalent:

```
data {
int<lower=0> n_obs;
vector<lower=0>[n_obs] observations;
}
parameters {
real<lower=0> shape;
real<lower=0> rate;
}
transformed parameters {
real rao_blackwellized_mean = shape/rate;
}
model {
// priors
1.0/100);
shape ~ exponential(1.0/100);
rate ~ exponential(
// likelihood
observations ~ gamma(shape, rate);
}
generated quantities {
real monte_carlo_mean = gamma_rng(shape, rate);
}
```

```
set.seed(1)
= rgamma(5, shape = 2, rate = 3) observations
```

```
= sampling(
fit
gamma_with_mean,seed = 1,
refresh = 0,
data = list(n_obs = length(observations), observations = observations),
iter = 100000
)
```

```
Warning: There were 1 divergent transitions after warmup. See
https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
to find out why this is a problem and how to eliminate them.
```

`Warning: Examine the pairs() plot to diagnose sampling problems`

` fit`

```
Inference for Stan model: anon_model.
4 chains, each with iter=1e+05; warmup=50000; thin=1;
post-warmup draws per chain=50000, total post-warmup draws=2e+05.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff
shape 5.91 0.02 2.91 1.69 3.77 5.43 7.52 12.83 34520
rate 6.63 0.02 3.37 1.72 4.15 6.07 8.50 14.66 34671
rao_blackwellized_mean 0.92 0.00 0.20 0.60 0.79 0.90 1.02 1.38 145150
monte_carlo_mean 0.92 0.00 0.49 0.23 0.60 0.84 1.15 2.09 190050
lp__ -0.66 0.01 1.10 -3.61 -1.08 -0.32 0.13 0.42 41775
Rhat
shape 1
rate 1
rao_blackwellized_mean 1
monte_carlo_mean 1
lp__ 1
Samples were drawn using NUTS(diag_e) at Mon Mar 18 14:14:21 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
```

Counter example that answer 2, a common misconception,^{1} is incorrect:

`5.91 / 6.63`

`[1] 0.8914027`

This is because we are interested in \(\mathbb{E}[\text{shape}/\text{rate}|Y]\), not \(\mathbb{E}[\text{shape}|Y]/\mathbb{E}[\text{rate}|Y]\).

This can be seen with the law of total expectation:

\[\begin{align*} \mathbb{E}[Y_\text{next}|Y_\text{obs}] &= \mathbb{E}[ \mathbb{E}[Y_\text{next} | \text{shape, rate}]|Y_\text{obs}] \\ &= \mathbb{E}[ \text{shape}/\text{rate} |Y_\text{obs}]. \end{align*}\]

An indeed the MC estimates for \(\mathbb{E}[\text{shape}/\text{rate}|Y]\) and \(\mathbb{E}[\text{shape}|Y]/\mathbb{E}[\text{rate}|Y]\) should be expected to be different:

\[ \frac{1}{M} \sum_{m=1}^M \frac{X^{(m)}_\text{shape}}{X^{(m)}_\text{rate}} \neq \frac{\frac{1}{M} \sum_{m=1}^M X^{(m)}_\text{shape}}{ \frac{1}{M} \sum_{m=1}^M X^{(m)}_\text{rate}}. \]

## Categories

- Support: \(\{0, 1, 2, 3, \dots, K\}\), for some number of categories \(K\).
- All such distributions captured by the
**categorical distribution** - We first discussed it in Exercise 3.
- \({\mathrm{Categorical}}(p_1, \dots, p_K)\)
- Probabilities \(p_k > 0\), \(\sum_k p_k = 1\).
- Stan doc.

## Simplex

**Terminology** for the set of valid parameters for the categorical, \(\{(p_1, p_2, \dots, p_K) : p_k > 0, \sum_k p_k = 1\}\): the **\(K\)-simplex**.

- Hence, if you need a prior over the parameters of a categorical, you need a distribution over the simplex!
- Common choice: the
**Dirichlet distribution**:- \({\mathrm{Dir}}(\alpha_1, \dots, \alpha_K)\)
- Concentrations \(\alpha_i > 0\).

- Stan doc.

## Vectors

- Support: \(\mathbb{R}^K\)

- Common choice: the
**multivariate normal**.- \(\mathcal{N}(\mu, \Sigma)\)
- Mean vector \(\mu \in \mathbb{R}^K\), covariance matrix \(\Sigma \succ 0\), \(\Sigma\) symmetric.
- Stan doc.

## Many others!

### References

- Use wikipedia’s massive distribution list,
- and Stan’s documentation.

### Alternative approach: reparameterization

- Suppose you need a distribution with support \([0, 1]\).
- We have seen above that one option is to use a beta prior.
- An alternative:
- define \(X \sim \mathcal{N}(0, 1)\),
- use a transformation to map it to \([0, 1]\), e.g. \(Y = \text{logistic}(X)\).

- This approach is known as “re-parametrization”
- For certain variational approaches, this helps inference.
- More on that in the last week.

## Footnotes

The root cause of that mis-conception is often an attempt to apply the “plug-in principle” from frequentist statistics to a Bayesian context.↩︎