Censoring

Outline

Topics

  • Recognizing censoring
  • Modelling censoring

Rationale

Censoring is an example of a data collection process. We will see a case study where ignoring the data collection process literally cost many lives…

Example: Chernobyl, 1986

  • In 1986, the Chernobyl Nuclear Power Plant exploded.
  • Initially, the reactor crew chief, A. Akimov, assumed the core reactor was intact.

Data generation

  • The numbers used in this page are synthetic but inspired by a true story.
  • Let us not look at the data generating code just yet, we will come back to it.
Code
set.seed(1)

# detection limit: value higher than that stay at the limit
limit = 1.1 

n_measurements = 10

true_mean = 5.6

# data, if we were able to observe perfectly
y = rexp(n_measurements, 1.0/true_mean)

# number of measurements higher than the detection limit
n_above_limit = sum(y >= limit)
n_below_limit = sum(y < limit)

# subset of the measurements that are below the limit:
data_below_limit = y[y < limit]

# measurements: those higher than the limit stay at the limit
measurements = ifelse(y < limit, y, limit)

Measuring radiation levels

  • A dosimeter is a device measuring ionizing radiation.
    • Unit: roentgens per second (R/s).
  • Suppose that:
    • If ionizing radiation is greater than 1.5 R/s \(\Rightarrow\) we think the reactor is breached (radiation coming out of reactor core).
    • If ionizing radiation is smaller than 1.5 R/s \(\Rightarrow\) we think the reactor is not breached.
  • We send 10 workers each with a dosimeter.
  • The average over the 10 readings is:
mean(measurements)
[1] 1.012227
  • All good?
  • Let us look at the raw data and histogram.
suppressPackageStartupMessages(require("ggplot2"))
df = data.frame(measurements = measurements)

ggplot(df, aes(x = measurements)) +
  geom_histogram() + 
  geom_rug(alpha = 0.1) + 
  theme_minimal()

measurements
 [1] 1.1000000 1.1000000 0.8159577 0.7828535 1.1000000 1.1000000 1.1000000
 [8] 1.1000000 1.1000000 0.8234575

Question: is this concerning?

  1. No concerns, all good. Let’s go in the reactor to inspect.
  2. It is unusual that there are many repeated measured values, but this can happen when measuring discrete distributions.
  3. It is unusual that there are many repeated measured values, this is due to high noise.
  4. It is concerning that there are repeated measured values, this could be due to the dosimeters having a detection limit.
  5. None of the above.

The correct answer is 4: dosimeters can suffer from a saturation effect, i.e. if the radiation is stronger than the detection limit, the measurement can underestimate the true value (potentially by a huge amount).

Sadly, in reality the reactor crew chief A. Akimov picked option 1. In reality, the effect was even more extreme: the detection limit of the dosimeters available was 0.001 R/s whereas in the worst hit region the true value was round 5.6 R/s. A lethal dose is around 500 R.

A. Akimov and his crew went in the reactor. Most, including Akimov, died from radiation exposure within 3 weeks.

Go back to the generating code, Section 2.1, to see how I generated the data.

Bayesian approach to censoring

  • What we have encountered in the Chernobyl example is known as censoring.
  • Solution: modelling the censoring process.

Mathematical description

  • Let \(L\) denote a detection limit (here \(L = 1.1\))
  • Let \(X\) denote the unknown parameter (here, the true mean we try to recover).
  • Let \(i\) denote the observation index (worker \(i \in \{1, 2, \dots, 10\}\) in our example)
  • Let \(H_i\) denote the measurements before censoring (i.e., if we had a perfect measurement device immune to saturation effects).
  • Let \(C_i\) denote a binary indicator on the censoring (i.e., equal to one if the imperfect device has a saturation, zero otherwise).
  • Let \(Y_i = C_i L + (1 - C_i) H_i\).

In the Chernobyl example, we will use the following model:

\[\begin{align*} X &\sim {\mathrm{Exp}}(1/100) \\ H_i &\sim {\mathrm{Exp}}(X) \\ C_i &= \mathbb{1}[H_i \ge L], \\ Y_i &= C_i L + (1 - C_i) H_i. \end{align*}\]

The goal is to compute \(\mathbb{E}[X | Y]\).

Stan implementation

suppressPackageStartupMessages(require(rstan))
data {
1  int<lower=0> n_above_limit;
2  int<lower=0> n_below_limit;
  real<lower=0> limit;
3  vector<upper=limit>[n_below_limit] data_below_limit;
  
}

parameters {
  real<lower=0> rate; 
4  vector<lower=limit>[n_above_limit] data_above_limit;
}

model {
  // prior
  rate ~ exponential(1.0/100);
  
  // likelihood
  data_above_limit ~ exponential(rate);
  data_below_limit ~ exponential(rate); 
}

generated quantities {
  real mean = 1.0/rate;
}
1
The number of \(H_i\)’s above the detection limit.
2
The number of \(H_i\)’s below the detection limit.
3
The \(H_i\)’s below the limit are observed, so they go in the data block.
4
The \(H_i\)’s above the limit are not observed, so they go in the parameters block (a better name would have been “latent block”).
fit = sampling(
  chernobyl_naive,
  seed = 1,
  chains = 1,
  data = list(
            limit = limit,
            n_above_limit = n_above_limit, 
            n_below_limit = n_below_limit,
            data_below_limit = data_below_limit
          ),       
  iter = 100000                   
)

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 1.6e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.16 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:     1 / 100000 [  0%]  (Warmup)
Chain 1: Iteration: 10000 / 100000 [ 10%]  (Warmup)
Chain 1: Iteration: 20000 / 100000 [ 20%]  (Warmup)
Chain 1: Iteration: 30000 / 100000 [ 30%]  (Warmup)
Chain 1: Iteration: 40000 / 100000 [ 40%]  (Warmup)
Chain 1: Iteration: 50000 / 100000 [ 50%]  (Warmup)
Chain 1: Iteration: 50001 / 100000 [ 50%]  (Sampling)
Chain 1: Iteration: 60000 / 100000 [ 60%]  (Sampling)
Chain 1: Iteration: 70000 / 100000 [ 70%]  (Sampling)
Chain 1: Iteration: 80000 / 100000 [ 80%]  (Sampling)
Chain 1: Iteration: 90000 / 100000 [ 90%]  (Sampling)
Chain 1: Iteration: 100000 / 100000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.538 seconds (Warm-up)
Chain 1:                0.72 seconds (Sampling)
Chain 1:                1.258 seconds (Total)
Chain 1: 
fit
Inference for Stan model: anon_model.
1 chains, each with iter=1e+05; warmup=50000; thin=1; 
post-warmup draws per chain=50000, total post-warmup draws=50000.

                      mean se_mean   sd   2.5%    25%    50%    75%  97.5%
rate                  0.40    0.00 0.20   0.11   0.25   0.36   0.51   0.86
data_above_limit[1]   4.44    0.03 4.74   1.17   1.86   3.01   5.23  16.18
data_above_limit[2]   4.44    0.03 4.87   1.16   1.85   2.99   5.23  16.30
data_above_limit[3]   4.44    0.03 5.01   1.16   1.84   2.96   5.20  16.40
data_above_limit[4]   4.42    0.03 4.78   1.17   1.86   3.01   5.21  16.00
data_above_limit[5]   4.45    0.03 5.04   1.16   1.86   3.00   5.25  16.24
data_above_limit[6]   4.43    0.03 4.60   1.16   1.85   3.00   5.23  16.10
data_above_limit[7]   4.45    0.03 4.97   1.16   1.83   2.99   5.29  16.26
mean                  3.35    0.02 2.43   1.16   1.97   2.74   3.96   9.20
lp__                -19.27    0.02 2.25 -24.59 -20.55 -18.92 -17.61 -15.96
                    n_eff Rhat
rate                29917    1
data_above_limit[1] 31929    1
data_above_limit[2] 29991    1
data_above_limit[3] 30474    1
data_above_limit[4] 30375    1
data_above_limit[5] 29617    1
data_above_limit[6] 29820    1
data_above_limit[7] 33738    1
mean                17755    1
lp__                16351    1

Samples were drawn using NUTS(diag_e) at Mon Mar 18 23:08:34 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
suppressPackageStartupMessages(require(bayesplot))
suppressPackageStartupMessages(require(ggplot2))
mcmc_areas_ridges(fit, pars = c("mean")) + 
  theme_minimal() + 
  scale_x_continuous(limits = c(0, 10)) 

Back to the question of whether the radiation is greater than 1.5 R/s:

samples = extract(fit)$mean

sum(samples > 1.5) / length(samples)
[1] 0.90094