Censoring

Outline

Topics

  • Recognizing censoring
  • Modelling censoring

Rationale

Censoring is an example of a data collection process. We will see a case study where ignoring the data collection process literally cost many lives…

Example: Chernobyl, 1986

  • In 1986, the Chernobyl Nuclear Power Plant exploded.
  • Initially, the reactor crew chief, A. Akimov, assumed the core reactor was intact.

Data generation

  • The numbers used in this page are synthetic but inspired by a true story.
  • Let us not look at the data generating code just yet, we will come back to it.
Code
set.seed(1)

# detection limit: value higher than that stay at the limit
limit = 1.1 

n_measurements = 10

true_mean = 5.6

# data, if we were able to observe perfectly
y = rexp(n_measurements, 1.0/true_mean)

# number of measurements higher than the detection limit
n_above_limit = sum(y >= limit)
n_below_limit = sum(y < limit)

# subset of the measurements that are below the limit:
data_below_limit = y[y < limit]

# measurements: those higher than the limit stay at the limit
measurements = ifelse(y < limit, y, limit)

Measuring radiation levels

  • A dosimeter is a device measuring ionizing radiation.
    • Unit: roentgens per second (R/s).
  • Suppose that:
    • If ionizing radiation is greater than 1.5 R/s \(\Rightarrow\) we think the reactor is breached (radiation coming out of reactor core).
    • If ionizing radiation is smaller than 1.5 R/s \(\Rightarrow\) we think the reactor is not breached.
  • We send 10 workers each with a dosimeter.
  • The average over the 10 readings is:
mean(measurements)
[1] 1.012227
  • All good?
  • Let us look at the raw data and histogram.
suppressPackageStartupMessages(require("ggplot2"))
df = data.frame(measurements = measurements)

ggplot(df, aes(x = measurements)) +
  geom_histogram() + 
  geom_rug(alpha = 0.1) + 
  theme_minimal()

measurements
 [1] 1.1000000 1.1000000 0.8159577 0.7828535 1.1000000 1.1000000 1.1000000
 [8] 1.1000000 1.1000000 0.8234575

Question: is this concerning?

Bayesian approach to censoring

  • What we have encountered in the Chernobyl example is known as censoring.
  • Solution: modelling the censoring process.

Mathematical description

  • Let \(L\) denote a detection limit (here \(L = 1.1\))
  • Let \(X\) denote the unknown parameter (here, the true mean we try to recover).
  • Let \(i\) denote the observation index (worker \(i \in \{1, 2, \dots, 10\}\) in our example)
  • Let \(H_i\) denote the measurements before censoring (i.e., if we had a perfect measurement device immune to saturation effects).
  • Let \(C_i\) denote a binary indicator on the censoring (i.e., equal to one if the imperfect device has a saturation, zero otherwise).
  • Let \(Y_i = C_i L + (1 - C_i) H_i\).

In the Chernobyl example, we will use the following model:

\[\begin{align*} X &\sim {\mathrm{Exp}}(1/100) \\ H_i &\sim {\mathrm{Exp}}(X) \\ C_i &= \mathbb{1}[H_i \ge L], \\ Y_i &= C_i L + (1 - C_i) H_i. \end{align*}\]

The goal is to compute \(\mathbb{E}[X | Y]\).

Stan implementation

suppressPackageStartupMessages(require(rstan))
data {
1  int<lower=0> n_above_limit;
2  int<lower=0> n_below_limit;
  real<lower=0> limit;
3  vector<upper=limit>[n_below_limit] data_below_limit;
  
}

parameters {
  real<lower=0> rate; 
4  vector<lower=limit>[n_above_limit] data_above_limit;
}

model {
  // prior
  rate ~ exponential(1.0/100);
  
  // likelihood
  data_above_limit ~ exponential(rate);
  data_below_limit ~ exponential(rate); 
}

generated quantities {
  real mean = 1.0/rate;
}
1
The number of \(H_i\)’s above the detection limit.
2
The number of \(H_i\)’s below the detection limit.
3
The \(H_i\)’s below the limit are observed, so they go in the data block.
4
The \(H_i\)’s above the limit are not observed, so they go in the parameters block (a better name would have been “latent block”).
fit = sampling(
  chernobyl_naive,
  seed = 1,
  chains = 1,
  data = list(
            limit = limit,
            n_above_limit = n_above_limit, 
            n_below_limit = n_below_limit,
            data_below_limit = data_below_limit
          ),       
  iter = 100000                   
)

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 1.6e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.16 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:     1 / 100000 [  0%]  (Warmup)
Chain 1: Iteration: 10000 / 100000 [ 10%]  (Warmup)
Chain 1: Iteration: 20000 / 100000 [ 20%]  (Warmup)
Chain 1: Iteration: 30000 / 100000 [ 30%]  (Warmup)
Chain 1: Iteration: 40000 / 100000 [ 40%]  (Warmup)
Chain 1: Iteration: 50000 / 100000 [ 50%]  (Warmup)
Chain 1: Iteration: 50001 / 100000 [ 50%]  (Sampling)
Chain 1: Iteration: 60000 / 100000 [ 60%]  (Sampling)
Chain 1: Iteration: 70000 / 100000 [ 70%]  (Sampling)
Chain 1: Iteration: 80000 / 100000 [ 80%]  (Sampling)
Chain 1: Iteration: 90000 / 100000 [ 90%]  (Sampling)
Chain 1: Iteration: 100000 / 100000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.538 seconds (Warm-up)
Chain 1:                0.72 seconds (Sampling)
Chain 1:                1.258 seconds (Total)
Chain 1: 
fit
Inference for Stan model: anon_model.
1 chains, each with iter=1e+05; warmup=50000; thin=1; 
post-warmup draws per chain=50000, total post-warmup draws=50000.

                      mean se_mean   sd   2.5%    25%    50%    75%  97.5%
rate                  0.40    0.00 0.20   0.11   0.25   0.36   0.51   0.86
data_above_limit[1]   4.44    0.03 4.74   1.17   1.86   3.01   5.23  16.18
data_above_limit[2]   4.44    0.03 4.87   1.16   1.85   2.99   5.23  16.30
data_above_limit[3]   4.44    0.03 5.01   1.16   1.84   2.96   5.20  16.40
data_above_limit[4]   4.42    0.03 4.78   1.17   1.86   3.01   5.21  16.00
data_above_limit[5]   4.45    0.03 5.04   1.16   1.86   3.00   5.25  16.24
data_above_limit[6]   4.43    0.03 4.60   1.16   1.85   3.00   5.23  16.10
data_above_limit[7]   4.45    0.03 4.97   1.16   1.83   2.99   5.29  16.26
mean                  3.35    0.02 2.43   1.16   1.97   2.74   3.96   9.20
lp__                -19.27    0.02 2.25 -24.59 -20.55 -18.92 -17.61 -15.96
                    n_eff Rhat
rate                29917    1
data_above_limit[1] 31929    1
data_above_limit[2] 29991    1
data_above_limit[3] 30474    1
data_above_limit[4] 30375    1
data_above_limit[5] 29617    1
data_above_limit[6] 29820    1
data_above_limit[7] 33738    1
mean                17755    1
lp__                16351    1

Samples were drawn using NUTS(diag_e) at Mon Mar 18 23:08:34 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
suppressPackageStartupMessages(require(bayesplot))
suppressPackageStartupMessages(require(ggplot2))
mcmc_areas_ridges(fit, pars = c("mean")) + 
  theme_minimal() + 
  scale_x_continuous(limits = c(0, 10)) 

Back to the question of whether the radiation is greater than 1.5 R/s:

samples = extract(fit)$mean

sum(samples > 1.5) / length(samples)
[1] 0.90094