Exercise: probabilistic ranking

Original by Carl Rasmussen and Manon Kok for CUED course 4f13. This version adapted by Damon Wischik.

In this assignment, you'll be using the (binary) results of the 2011 ATP men's tennis singles for 107 players in a total of 1801 games (which these players played against each other in the 2011 season), to compute probabilistic rankings of the skills of these players.

What to submit. Your answers should contain an explanation of what you do, and 2–4 central commands to achieve it. The focus of your answer should be interpretation: explain what the numerical values and graphs you produce mean, and why they are as they are. The text of your answer to each question should be no more than a paragraph or two. Marks will be awarded based on the clarity and insight in your explanations.

DO NOT SUBMIT FULL SOURCE CODE, unless it is as an appendix. Do not repeat the question text in your answers. If you submit your answers as a Jupyter notebook, structure the notebook in two sections: a section at the top for the examiner to read with just your answers and trimmed code snippets, and a section at the bottom with all your working code.

Data import

The match data is provided as https://www.cl.cam.ac.uk/teaching/2122/DataSci/data/tennis_data.mat. It contains a vector $W$ of length 107 whose $i$th entry is the name of player $i$; and an array $G$ of dimension $1801\times2$, one row per game, where the first column contains the identity of the player who won and the second column contains the identify of the player who lost. Note that this convention means that the variable $y_g$ (the game outcome) in lecture notes is always $+1$, and can consequently be ignored. Some rows will appear more than once (corresponding to two players having played each other several times with the same outcome).

The appendix at the bottom of this document shows a crude ranking of the players, based on the fraction of their games that they win.

Question (a)

Complete the following code for Gibbs sampling, by adding the lines required to sample from the conditional distributions needed for Gibbs sampling for the ranking model discussed in lectures. Run the Gibbs sampler, e.g. for 1100 iterations. Plot some of the sampled player skills as a function of the Gibbs iteration (see the appendix for some tips on multiple plotting).

Does it look as though the Gibbs sampler is able to move around the posterior distribution? What are the burn-in and mixing times for the Gibbs sampler? It may be helpful to look at autocorrelation plots; see the appendix.

Question (b)

Do inference in the model instead, by running message passing and expectation propagation, using the code below. For the same players you plotted in part (a), show the mean and standard deviation of their skill levels as a function of the iteration. How long does it take to converge?

(See the Python programming note in the appendix, about the yield statement and how to use gaussian_ep.)

Question (c)

Explain the concept of convergence for both the Gibbs sampler and the message passing algorithms. What type of object are we converging to in the two cases, and how do you judge convergence? For each of the players you plotted in (a), find the mean and standard deviation of their skill as found by the converged algorithm in (b), and annotate the plot from part (a) to show those values. Explain what you see.

Question (d)

Who is better out of W[0] Rafael Nadal and W[15] Novak Djokovic? There are several ways we could answer this.

Calculate all these six probabilities. Explain how they relate to each other. Which answer is best? Why? Support your explanation with appropriate plots.

Question (e)

Jarrett (Biometrika, 1979) compiled a dataset of the dates of coal mining explosions, from the Colliery Year Book and Coal Trades Directory. An initial plot suggests that there was a step change in the frequency of explosions, some time between 1880 and 1900. Consider the model $$ \text{num.explosions in year }i \sim \operatorname{Poisson}(\lambda 1_{i\leq\theta} + \mu 1_{i>\theta}) $$ with prior distributions $$ \lambda \sim \operatorname{Gamma}(a,b)\,,\quad \mu \sim \operatorname{Gamma}(a,b)\,,\quad \theta \sim \operatorname{Uniform}[1851,1852,\dots,1961] $$ Use Gibbs sampling to estimate $\theta$. It's up to you to pick appropriate $a$ and $b$.

This question asks you to invent a Gibbs sampler from scratch. This model is much simpler than the ranking model studied in lectures. In your answer, you should explain how your Gibbs sampler works, but you shouldn't spend time worrying about burn-in and mixing time.

Appendix

Data import

Here is a crude ranking of the players, based on a simple ratio $$ \frac{\text{number of wins for player }i}{\text{number of games played by player }i}. $$

Question (a)

Some notes about the code in Question (a), and how it uses normally distributed random variables:

To plot many panels in a single figure,

To plot autocorrelations,

Question (b)

The gaussian_ep function includes a yield statement, which means it is a lazy list generator, and you can use it with e.g.

g = gaussian_ep(G, M)
for _ in range(10):
    print(next(g))

Question (d)

To evaluate $\mathbb{P}(\operatorname{Normal}(\mu,\sigma^2)>x)$, you can use the appropriate scientific function directly,

scipy.stats.norm.sf(x, loc=μ, scale=σ)

or you can compute it using Monte Carlo integration,

np.mean(np.random.normal(size=10000, loc=μ, scale=σ) > x)

For this simple calculation there's no point in Monte Carlo integration, but the technique is useful in situations where there is no simple built-in scientific function.