# Data collection mechanisms

## Outline

### Topics

- Generalizing the censoring problem to other data collection mechanisms.
- Examples of other data collection mechanisms.

### Rationale

Rather than memorizing several data collection mechanisms, it is more important to recognize that it is simply a special (but important) example of probabilistic modelling and the first step of our Bayesian recipe.

## Examples

### Truncation

- In censoring, we knew how many \(H_i\)’s were above the detection limit.
- In
**truncation**, a different setup, we now have even less information:- we only observe the \(H_i\)’s that are below the limit…
- …we don’t know how many were above the limit.

- Mathematically, when the \(H_i\) have a continuous distribution this can be modelled as:

\[\begin{align*} X &\sim \text{prior}() \\ N &\sim \text{DiscreteDistribution}() \\ H_1, \dots, H_N &\sim \text{likelihood}(X) \\ I_i &= \mathbb{1}[H_i \le L] \\ Y &= \{ H_i : I_i = 1 \}. \end{align*}\]

- Here \(I_i\) is an “inclusion indicator”.
- Bayesian analysis will be based on \(X | Y\)

**Example:** CRISPR-Cas9 unique molecular identifier (UMI) family size. “Families of cells” that left zero progenies are not observed!

### Non-ignorable missingness

Truncation can be generalized as follows:

- Instead of a deterministic criterion based on \(H_i\) to decide if to include in the set of observations or not,
- make that decision based on some probability model \(p\) that could depend on \(h_i\) and \(x\), \(p(x, h_i) \in [0, 1]\):

\[\begin{align*} X &\sim \text{prior}() \\ N &\sim \text{DiscreteDistribution}() \\ H_1, \dots, H_N &\sim \text{likelihood}(X) \\ I_i &\sim {\mathrm{Bern}}(p(X, H_i)) \\ Y &= \{ H_i : I_i = 1 \}. \end{align*}\]

**Question:** how would you set \(p(x, h)\) to recover truncation as a special case of non-ignorable missingness?

### Additional readings

See this week’s readings, Chapter 8 of Gelman et al., 2013