Data collection mechanisms

Outline

Topics

  • Generalizing the censoring problem to other data collection mechanisms.
  • Examples of other data collection mechanisms.

Rationale

Rather than memorizing several data collection mechanisms, it is more important to recognize that it is simply a special (but important) example of probabilistic modelling and the first step of our Bayesian recipe.

Examples

Truncation

  • In censoring, we knew how many \(H_i\)’s were above the detection limit.
  • In truncation, a different setup, we now have even less information:
    • we only observe the \(H_i\)’s that are below the limit…
    • …we don’t know how many were above the limit.
  • Mathematically, when the \(H_i\) have a continuous distribution this can be modelled as:

\[\begin{align*} X &\sim \text{prior}() \\ N &\sim \text{DiscreteDistribution}() \\ H_1, \dots, H_N &\sim \text{likelihood}(X) \\ I_i &= \mathbb{1}[H_i \le L] \\ Y &= \{ H_i : I_i = 1 \}. \end{align*}\]

  • Here \(I_i\) is an “inclusion indicator”.
  • Bayesian analysis will be based on \(X | Y\)

Example: CRISPR-Cas9 unique molecular identifier (UMI) family size. “Families of cells” that left zero progenies are not observed!

Non-ignorable missingness

Truncation can be generalized as follows:

  • Instead of a deterministic criterion based on \(H_i\) to decide if to include in the set of observations or not,
  • make that decision based on some probability model \(p\) that could depend on \(h_i\) and \(x\), \(p(x, h_i) \in [0, 1]\):

\[\begin{align*} X &\sim \text{prior}() \\ N &\sim \text{DiscreteDistribution}() \\ H_1, \dots, H_N &\sim \text{likelihood}(X) \\ I_i &\sim {\mathrm{Bern}}(p(X, H_i)) \\ Y &= \{ H_i : I_i = 1 \}. \end{align*}\]

Question: how would you set \(p(x, h)\) to recover truncation as a special case of non-ignorable missingness?

Additional readings

See this week’s readings, Chapter 8 of Gelman et al., 2013