Exercise 1: discrete probabilistic inference

Grading

Our priority with the weekly exercises is to provide timely feedback and an incentive to stay on top of the material so that lectures can be more effective.

We will select one or more questions that will be graded in more detail. For the other question(s), we will use the following particiation-centric binary scheme:

1 point if something reasonable was attempted,
0 otherwise.

Goals

Bring back to memory discrete probability (axioms, basic properties, conditioning, discrete Bayes rule).
Introduce forward discrete simulation.
Review expectation and the law of large numbers.

Setup

This exercise is centered around the following scenario:

Imagine a bag with 3 coins each with a different probability parameter \(p\)
Coin \(i\in \{0, 1, 2\}\) has bias \(i/2\)—in other words:
- First coin: bias is \(0/2 = 0\) (i.e. both sides are “heads”, \(p = 0\))
- Second coin: bias is \(1/2 = 0.5\) (i.e. standard coin, \(p = 1/2\))
- Third coin: bias is \(2/2 = 1\) (i.e. both sides are “tails”, \(p = 1\))

Consider the following two steps sampling process
- Step 1: pick one of the three coins, but do not look at it!
- Step 2: flip the coin 4 times
Mathematically, this probability model can be written as follows: \[ \begin{align*} X &\sim {\mathrm{Unif}}\{0, 1, 2\} \\ Y_i | X &\sim {\mathrm{Bern}}(X/2) \end{align*} \tag{1}\]

Q.1: sampling from a joint distribution

Compute \(\mathbb{E}[(1+Y_1)^X]\) mathematically (with a precise mathematical derivation).
Write an R function called forward_sample that samples (“simulates”) from the joint distribution of \((X, Y_1, \dots, Y_4)\). As a general practice, fix the seed, and submit both the code and the output (here, a single sample).
How can your code and the law of large number be used to approximate \(\mathbb{E}[(1+Y_1)^X]\)?
Compare the approximation from your code with you answer in part 1.

Big idea

Part 4 of this question illustrates a big idea in this course:
strategies to validate inference, i.e. ensuring it is bug-free in both the code and in the math. In a nutshell, this is possible thanks to theory: we use results that provide two ways to do the same thing, and verifying they indeed agree.

Q.2: computing a conditional

Suppose now that you observe the outcome of the 4 coin flips, but not the type of coin that was picked. Say you observe: “heads”, “heads”, “heads”, “heads” = [0, 0, 0, 0]. Given that observation, what is the probability that you picked the standard coin (i.e., the one with \(p = 1/2\))?

Write mathematically: “Given you observe 4 heads, what is the probability that you picked the standard coin?”
Compute the numerical value of the expression defined in part 1 (with a precise mathematical derivation).

Q.3: non uniform prior on coin types

We now modify the problem as follows: I stuffed the bag with 100 coins: 98 standard (fair) coins, 1 coin with only heads, and 1 coin with only tails. The rest is the same: pick one of the coins, flip it 4 times.

Write the joint distribution of this modified model. Use the \(\sim\) notation as in Equation 1. Hint: use a Categorical distribution.
Compute the probability that you picked one of the fair coins, given you see [0, 0, 0, 0].

Q.4: a first posterior inference algorithm

We now generalize to having \(K + 1\) types of coins such that:

coin type \(k \in \{0, 1, \dots, K\}\) has bias \(k/K\)
the fraction of coins in the bag of type \(k\) is \(\rho_k\).

We consider the same observation as before: “you observe 4 heads”. We want to find the conditional probability \(\pi_k\) for all \(k\) that we picked coin type \(k \in \{0, 1, \dots, K\}\) from the bag given the observation.

Write an R function called posterior_given_four_heads taking as input a vector \(\rho = (\rho_0, \rho_1, \dots, \rho_K)\) and returning \(\pi = (\pi_0, \pi_1, \dots, \pi_K)\).
Test your code by making sure you can recover the answer in Q. 3 as a special case. Report what values of \(K\) and \(\rho\) you used.
Show the output for \(\rho \propto (1, 2, 3, \dots, 10)\). Here \(\propto\) means “proportional to”; try to infer what it means in this context.

Q.5: generalizing observations

We now generalize Q. 4 as follows: instead of observing 4 “heads” out of 4 observations, say we observe n_heads out of n_observations, where n_heads and n_observations will be additional arguments passed into a new R function.

Write the joint distribution of this modified model. Use the \(\sim\) notation as in Equation 1. Hint: use a Binomial distribution.
Write an R function called posterior taking three input arguments in the following order: a vector \(\rho\) as in Q. 4, as well as two integers, n_heads and n_observations.
Test your code by making sure you can recover the answer in Q. 3 as a special case.
Show the output for \(\rho \propto (1, 2, 3, \dots, 10)\) and n_heads = 2 and n_observations = 10.