Data collection mechanisms
Outline
Topics
- Generalizing the censoring problem to other data collection mechanisms.
- Examples of other data collection mechanisms.
Rationale
Rather than memorizing several data collection mechanisms, it is more important to recognize that it is simply a special (but important) example of probabilistic modelling and the first step of our Bayesian recipe.
Examples
Truncation
- In censoring, we knew how many \(H_i\)’s were above the detection limit.
- In truncation, a different setup, we now have even less information:
- we only observe the \(H_i\)’s that are below the limit…
- …we don’t know how many were above the limit.
- Mathematically, when the \(H_i\) have a continuous distribution this can be modelled as:
\[\begin{align*} X &\sim \text{prior}() \\ N &\sim \text{DiscreteDistribution}() \\ H_1, \dots, H_N &\sim \text{likelihood}(X) \\ I_i &= \mathbb{1}[H_i \le L] \\ Y &= \{ H_i : I_i = 1 \}. \end{align*}\]
- Here \(I_i\) is an “inclusion indicator”.
- Bayesian analysis will be based on \(X | Y\)
Example: CRISPR-Cas9 unique molecular identifier (UMI) family size. “Families of cells” that left zero progenies are not observed!
Non-ignorable missingness
Truncation can be generalized as follows:
- Instead of a deterministic criterion based on \(H_i\) to decide if to include in the set of observations or not,
- make that decision based on some probability model \(p\) that could depend on \(h_i\) and \(x\), \(p(x, h_i) \in [0, 1]\):
\[\begin{align*} X &\sim \text{prior}() \\ N &\sim \text{DiscreteDistribution}() \\ H_1, \dots, H_N &\sim \text{likelihood}(X) \\ I_i &\sim {\mathrm{Bern}}(p(X, H_i)) \\ Y &= \{ H_i : I_i = 1 \}. \end{align*}\]
Question: how would you set \(p(x, h)\) to recover truncation as a special case of non-ignorable missingness?
Additional readings
See this week’s readings, Chapter 8 of Gelman et al., 2013