Numerical computation with numpy

The main Python library for numerical work is NumPy. It’s vital to be familiar with it for almost any work in machine learning — the standard machine learning libraries like PyTorch need high performance numerical computation, and their libraries are designed to mimic NumPy.

Contents

1. Preamble

At the top of almost every piece of scientific computing work, we’ll import these standard modules. We’ll also give them short aliases so we can write e.g. np.foo rather than numpy.foo.

If you get an error message like

----> 1 import numpy as np
ModuleNotFoundError: No module named 'numpy'
it means you need to install the package. Install it from within Jupyter by running

!pip install numpy

The relevant packages have been preinstalled on hub.cl.cam.ac.uk, but on other platforms you may need to install packages yourself. If you're running on your own machine, you only need to do this once. If you're running on Google Colab, for example, you may need to do it over and over again, whenever the system allocates you a fresh server.

2. Vectorized thinking

We saw Python lists in section 1. We can of course store numbers in Python lists: a list to store a vector, a list of lists to store a matrix, and so on. For scientific computing, this is not a good idea.

The core skill is vectorized thinking, which means writing our code in terms of functions that operate on entire vectors (or arrays) at once.

2.1 A FIRST SESSION

All the elements of a numpy vector have to be the same type. It can be integer, or floating point, or boolean. To see the type of a vector, x.dtype. When you create an array, you can specify the dtype. To cast to another type, x.astype(int).

2.2 VECTORIZED LIBRARY ROUTINES

To be good at writing vectorized code, we need to know what sorts of calculations we can do on vectors. Here are some useful routines, and a full reference.

To create vectors,

For maths and statistics,

For random numbers, it's good practice to set the seed explicitly, so that your code is reproducible. That makes it much easier to debug. Call rng = np.random.default_rng(x) for some arbitrary integer seed x, and then use it to generate random values of various types:

Worked example. Here's an illustration: computing the correlation coefficient between two vectors $x$ and $y$, $$ \rho = \frac{\sum_i (x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_i(x_i-\bar{x})^2} \sqrt{\sum_i (y_i-\bar{y})^2}} $$ where $x$ and $y$ have the same length $N$, and $$ \bar{x} = \frac{1}{N} \sum_i x_i, \qquad \bar{y} = \frac{1}{N} \sum_i y_y. $$ Here are two implementations, one written in Python-style, one written in scientific computing style, to compute $\rho$. The latter is roughly 15 times faster. (The magic command %%time at the start of a cell makes the notebook print out the execution time.)

Although, if we really knew our way round numpy, we'd just use np.corrcoef

2.3 FOR LOOPS CONSIDERED HARMFUL

Vectorized thinking isn’t just for mathematical formulae — there are all sorts of programming constructs that can be vectorized also. In general, whenever you find yourself writing a for loop or a Python list comprehension, stop and see if you can vectorize your code. You’ll usually end up with something more flexible for scientific computing.

Functions for indexing:

General programming functions:

Cheating

Worked example. Suppose we want to sort a vector of strings by length, breaking ties alphabetically.

  1. Get a vector with the length of each string. Digging around the documentation for string functions, we find np.char.str_len.
  2. Work out how to put lengths in order, breaking ties alphabetically by names. Digging around the documentation for sorting, we find np.lexsort([y,x]), which returns sorting indexes like $\textsf{np.argsort(x)}$, but breaks ties in $\textsf{x}$ by another vector $\textsf{y}$. This is called lexicographical sorting.
  3. Pick out the names in the order specified by these indexes.

3. Arrays

NumPy supports matrices and higher-dimensional arrays. To enter a 2d array (i.e. a matrix) like $$ a = \begin{bmatrix} 2.2 & 3.7 & 9.1\\ −4 & 3.1 & 1.3 \end{bmatrix} $$ we type in

Use a.shape to find the dimensions of an array. In fact, vectors are nothing other than one-dimensional arrays, and their shape is a tuple of length 1. NumPy doesn’t have any concept of ‘row vector’ versus ‘column vector’ — it’s only 2d arrays that can be described in that way.

3.1 VECTORIZED OPERATIONS ON ARRAYS

Most of the standard vectorized operations can be applied to arrays, though now you have a choice about which dimension to apply them to.

3.2 ARRAY SLICING AND INDEXING

To refer to a subarray, we can use an extended version of Python’s slice notation.

There are two ways to refer to an arbitrary set of elements inside the array, both known as advanced indexing.

array([[0, 6, 0],
       [7, 0, 0],
       [0, 0, 8]])

For 1d vectors the only reshaping operations are slicing and concatenating, but for higher dimensional arrays there is a whole variety of reshaping functions such as stacking, tiling, transposing, etc. The most useful operation is adding a new dimension, for example to turn a one-dimensional vector into a column vector. The second most useful is stacking vectors to form an array.

array([[1],
       [2],
       [3]])
array([[1, 3, 5],
       [2, 4, 6]])

NumPy also has a powerful tool called broadcasting which generalizes ‘add a scalar to a vector’, and which is used a lot in more advanced array-manipulating code. It’s more advanced than we need for this course, but it’s used a lot in machine learning and it’s worth reading about. Here’s a simple example, normalizing a matrix so the columns sum to 1.

array([[0.6 , 0.25, 0.8 ],
       [0.4 , 0.75, 0.2 ]])

3.3 LINEAR ALGEBRA

In Easter term you will study linear algebra in the Maths for Natural Sciences course. If you want to try out the maths, you’ll find relevant functions in np.linalg and np.dual. In particular, see np.linalg.solve for solving systems of linear equations.

4. Simulation

Simulation is a mainstay of scientific computing. A common style with numpy is to predefine an vector or array to store the results, one row per timestep, and then iterate over timesteps gradually filling in the array. (This is the one case where for loops are appropriate.) Here’s an example, a differential equation simulation. A standard model for an epidemic is the SIR model, which specifies how the number of Susceptible, Infectious, and Recovered people evolves over time. For a total population of size $N$, $$ \begin{aligned} \frac{dS_t}{dt} &= -r_0 \gamma \frac{I_t S_t}{N}\\ \frac{dI_t}{dt} &= r_0 \gamma \frac{I_t S_t}{N} - \gamma I_t\\ \frac{dR_t}{dt} &= \gamma I_t, \end{aligned} $$ where $1/\gamma$ is the time it takes for an infections person to recover. (These differential equations are for a "friendly" epidemic which is non-lethal and in which you can't get reinfected. Can you see how to modify them to include deaths and reinfections?)

We might simulate this as follows.

This simulation is simple and naive.

What we have coded is called a discrete-time simulation, because time advances in fixed increments. In IA Algorithms you will study the ‘heap’ data structure, which is good simulation in which time is pegged to changes in state, called event-driven simulation.