The main Python library for numerical work is NumPy. It’s vital to be familiar with it for almost any work in machine learning — the standard machine learning libraries like PyTorch need high performance numerical computation, and their libraries are designed to mimic NumPy.
from IPython.display import YouTubeVideo
YouTubeVideo('HbbFrM2JhXc', width=560, height=315)
At the top of almost every piece of scientific computing work, we’ll import these standard modules. We’ll also give them short aliases so we can write e.g. np.foo rather than numpy.foo.
import numpy as np
import matplotlib.pyplot as plt
If you get an error message like
----> 1 import numpy as np ModuleNotFoundError: No module named 'numpy'it means you need to install the package. Install it from within Jupyter by running
!pip install numpy
The relevant packages have been preinstalled on hub.cl.cam.ac.uk, but on other platforms you may need to install packages yourself. If you're running on your own machine, you only need to do this once. If you're running on Google Colab, for example, you may need to do it over and over again, whenever the system allocates you a fresh server.
We saw Python lists in section 1. We can of course store numbers in Python lists: a list to store a vector, a list of lists to store a matrix, and so on. For scientific computing, this is not a good idea.
Python lists are flexible: they can store mixed data types e.g. integers mixed with strings and sublists and even functions. But this flexibility comes with a performance hit. For scientific computing, where we don’t need the flexibility, it’s better to use specialised classes for numeric vectors and arrays. When the machine knows what datatypes to expect, it can compute faster.
Mathematicians have lovely concise notation for vector algebra and matrices. By using specialised classes, we can use similar syntax in our code. This will be more concise and easier to debug, than if we used regular Python lists for everything.
The core skill is vectorized thinking, which means writing our code in terms of functions that operate on entire vectors (or arrays) at once.
x = np.array([1,2,5,3,2]) # [1, 2, 5, 3, 2]
y = np.ones(5) # [1.,1.,1.,1.,1.]
z = np.arange(5) # [0, 1, 2, 3, 4]
# We can do maths on vectors, applying the operations elementwise
# We can mix vectors and scalars
(x + 1) * 2 # [4, 6, 12, 8, 6]
(x + y) * z # [0., 3.,12.,12.,12.]
x >= 3 # [False, False, True, True, False]
np.where(x>=3, x, 3) # [3, 3, 5, 3, 3]
# Python slicing and indexing notation works
x[:3] # [1, 2, 5]
x[x>=3] = -10 # x is now [1, 2, -10, -10, 2]
All the elements of a numpy vector have to be the same type. It can be integer, or floating point, or
boolean. To see the type of a vector, x.dtype
. When you create an array, you can specify the dtype
. To cast to another type, x.astype(int)
.
(x>=2).dtype # bool
np.array([1,2], dtype=float) # [1.0, 2.0]
(x>=2).astype(int) # [0, 1, 0, 0, 1]
To be good at writing vectorized code, we need to know what sorts of calculations we can do on vectors. Here are some useful routines, and a full reference.
To create vectors,
np.array([1,2,3])
creates a numpy vector from a Python listnp.zeros(n)
, np.ones(n)
, numpy.full(n,fill_value)
numpy.ones_like(a)
creates a vector of the same shape as anp.arange(n)
creates $[0,1,...,n-1]$np.linspace(start,stop,n)
creates n evenly-spaced points between start and stop inclusive, very
useful for plottingFor maths and statistics,
np.sin
, np.exp
, np.floor
, …x @ y
gives the dot product, np.linalg.norm(x)
is the normnp.sum
, np.mean
, and np.prod
; also np.cumsum(x)
gives $[x_0, x_0 + x_1, x_0 + x_1 + x_2, \dots ]$np.min(x)
and np.max(x)
for the overall min and max; np.minimum(x,y)
for the element-wise
mininmumFor random numbers, it's good practice to set the seed explicitly, so that your code is reproducible. That makes it much easier to debug. Call rng = np.random.default_rng(x)
for some arbitrary integer seed x
, and then use it to generate random values of various types:
rng.random(n)
for $n$ floating point values in the range $[0,1]$rng.choice(a,n)
to pick $n$ elements at random from the list $a$Worked example. Here's an illustration: computing the correlation coefficient between two vectors $x$ and $y$,
$$
\rho = \frac{\sum_i (x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_i(x_i-\bar{x})^2} \sqrt{\sum_i (y_i-\bar{y})^2}}
$$
where $x$ and $y$ have the same length $N$, and
$$
\bar{x} = \frac{1}{N} \sum_i x_i, \qquad
\bar{y} = \frac{1}{N} \sum_i y_y.
$$
Here are two implementations, one written in Python-style, one written in scientific computing style,
to compute $\rho$. The latter is roughly 15 times faster. (The magic command %%time
at the start of a
cell makes the notebook print out the execution time.)
np.corrcoef
# Set up some parameters
N = 10000000
rand_seed = 161803
# Assume numpy is already imported
# Also load the libraries for pure Python code
import math, random
%%time
# Python-style code
random.seed(rand_seed)
# Create two lists of random numbers, xs and ys
xs = [random.random() for i in range(N)]
ys = [xs[i] + random.random() for i in range(N)]
# Compute the various terms involved in the formula
xbar = sum(xs) / N # sum(list) is built into Python
ybar = sum(ys) / N
sxy = sum([(x-xbar)*(y-ybar) for x,y in zip(xs,ys)])
sxx = sum([(x-xbar)**2 for x in xs])
syy = sum([(y-ybar)**2 for y in ys])
ρ = sxy / math.sqrt(sxx) / math.sqrt(syy)
print(ρ)
%%time
# Vectorized code
rng = np.random.default_rng(rand_seed)
# Create two random vectors x and y
x = rng.random(N)
y = x + rng.random(N)
# Compute the terms in the formula. Note: @ means ”dot product”
xbar = np.sum(x) / N
ybar = np.sum(y) / N
ρ = ((x-xbar) @ (y-ybar)) / np.sqrt(np.sum((x-xbar)**2) * np.sum((y-ybar)**2))
print(ρ)
Vectorized thinking isn’t just for mathematical formulae — there are all sorts of programming constructs that can be vectorized also. In general, whenever you find yourself writing a for loop or a Python list comprehension, stop and see if you can vectorize your code. You’ll usually end up with something more flexible for scientific computing.
Functions for indexing:
x[:10]
or x[10:]
or x[:-3]
np.where(b)
gives a vector of indexes at which the boolean vector $b$ is Truex[y>5]
i=np.where(y>5); x[i]
x[x<3] = x[x<3] + 10
np.concatenate([v1,v2])
concatenates two or more vectorsGeneral programming functions:
len(x)
gives the length of a vectornp.any
, np.all
, and other logic functionsnp.unique
returns the unique elements of a vectornp.isin(x,vals)
returns a vector, the same shape as $\textsf{x}$, recording whether or not each element of
$\textsf{x}$ is in $\textsf{vals}$∼x
is logical negation, the equivalent of Python’s not x
; also (x & y)
and (x | y)
both worknp.count_nonzero(x)
counts the number of entries where $\textsf{x}$ is $\textsf{True}$ or non-zerox.sort()
sorts a vector in-place, and np.sort(x)
creates a new vector which is a sorted version of
$\textsf{x}$np.argsort(x)
gives the vector of indexes that would put $\textsf{x}$ in order, i.e. it produces an integer
vector $\textsf{i}$ such that $\textsf{x[i]}$ is sorted; also see other sorting functionsnp.argmax
and other search functionsnp.where(cond,x,y)
is the vectorized version of Python’s $\textsf{x if cond else y}$np.char.add(s1,s2)
and np.char.lower(s)
Cheating
g = np.vectorize(f)
is a vectorized version of an arbitrary Python function , so you can call $\textsf{g(x)}$ on a vector $\textsf{x}$. It gives you the nice vectorized syntax, but under the hood it's really a slow $\textsf{for}$ loop, so avoid it if you canWorked example. Suppose we want to sort a vector of strings by length, breaking ties alphabetically.
np.char.str_len
.np.lexsort([y,x])
, which returns sorting indexes like
$\textsf{np.argsort(x)}$, but breaks ties in $\textsf{x}$ by another vector $\textsf{y}$. This is called lexicographical sorting.names = np.array(['alexis','chloe','guarav','shay','adrian','rebecca'])
lengths = np.char.str_len(names)
i = np.lexsort([names, lengths])
names[i]
NumPy supports matrices and higher-dimensional arrays. To enter a 2d array (i.e. a matrix) like $$ a = \begin{bmatrix} 2.2 & 3.7 & 9.1\\ −4 & 3.1 & 1.3 \end{bmatrix} $$ we type in
a = np.array([[2.2, 3.7, 9.1], [-4, 3.1, 1.3]])
Use a.shape
to find the dimensions of an array. In fact, vectors are nothing other than one-dimensional
arrays, and their shape is a tuple of length 1. NumPy doesn’t have any concept of ‘row vector’ versus
‘column vector’ — it’s only 2d arrays that can be described in that way.
a.shape # (2,3) — a matrix
np.array([5,6,4]).shape # (2,) — a vector
np.array([[5,6,4]]).shape # (1,3) — a row
np.array([[5],[6],[4]]).shape # (3,1) — a column
Most of the standard vectorized operations can be applied to arrays, though now you have a choice about which dimension to apply them to.
np.sum(a, axis=0) # vector of column sums
np.sum(a, axis=1) # vector of tow sums
np.sum(a) # scalar with total sum
To refer to a subarray, we can use an extended version of Python’s slice notation.
a[:, :2] # all rows, first two columns
a[1, :] # second row (indexes start at 0), all columns
a[1] # another way to fetch the second row
a[:2, :2] = [[1,2],[3,4]] # assign to a submatrix
There are two ways to refer to an arbitrary set of elements inside the array, both known as advanced indexing.
a = np.zeros((3,3), dtype=int)
# Boolean advanced indexing
i = np.array([[False,True,False],[True,False,False],[False,False,True]])
a[i] = [6,7,8]
# Integer advanced indexing
ix,iy = [0,1,2], [1,0,2]
a[ix,iy] = [6,7,8]
# Both produce the same result.
array([[0, 6, 0], [7, 0, 0], [0, 0, 8]])
For 1d vectors the only reshaping operations are slicing and concatenating, but for higher dimensional arrays there is a whole variety of reshaping functions such as stacking, tiling, transposing, etc. The most useful operation is adding a new dimension, for example to turn a one-dimensional vector into a column vector. The second most useful is stacking vectors to form an array.
x = np.array([1,2,3])
x[:, None] # can also be written x[:, np.newaxis], or x[..., None]
array([[1], [2], [3]])
np.column_stack([[1,2], [3,4], [5,6]])
array([[1, 3, 5], [2, 4, 6]])
NumPy also has a powerful tool called broadcasting which generalizes ‘add a scalar to a vector’, and which is used a lot in more advanced array-manipulating code. It’s more advanced than we need for this course, but it’s used a lot in machine learning and it’s worth reading about. Here’s a simple example, normalizing a matrix so the columns sum to 1.
a = np.array([[3,2,8],[2,6,2]])
colsums = np.sum(a, axis=0)
a / colsums
array([[0.6 , 0.25, 0.8 ], [0.4 , 0.75, 0.2 ]])
In Easter term you will study linear algebra in the Maths for Natural Sciences course. If you want to try out the maths, you’ll find relevant functions in np.linalg and np.dual. In particular, see np.linalg.solve for solving systems of linear equations.
Simulation is a mainstay of scientific computing. A common style with numpy is to predefine an vector or array to store the results, one row per timestep, and then iterate over timesteps gradually filling in the array. (This is the one case where for loops are appropriate.) Here’s an example, a differential equation simulation. A standard model for an epidemic is the SIR model, which specifies how the number of Susceptible, Infectious, and Recovered people evolves over time. For a total population of size $N$, $$ \begin{aligned} \frac{dS_t}{dt} &= -r_0 \gamma \frac{I_t S_t}{N}\\ \frac{dI_t}{dt} &= r_0 \gamma \frac{I_t S_t}{N} - \gamma I_t\\ \frac{dR_t}{dt} &= \gamma I_t, \end{aligned} $$ where $1/\gamma$ is the time it takes for an infections person to recover. (These differential equations are for a "friendly" epidemic which is non-lethal and in which you can't get reinfected. Can you see how to modify them to include deaths and reinfections?)
We might simulate this as follows.
r0 = 1.5 # reproduction number
γ = 1/10 # recovery time is 10 days
T = 365 # total duration to simulate
dt = 3 # time step granularity of simulation
x0 = [99990.0, 10.0, 0.0] # initial state
# initialization
num_iterations = np.ceil(T/dt).astype(int)
res = np.zeros((num_iterations, 4)) # matrix to store t,S,I,R
res[0,1:] = x0
res[:,0] = np.arange(num_iterations) * dt
# loop
for i in range(1, num_iterations):
S,I,R = res[i-1, 1:]
N = S+I+R
d = np.array([-r0*γ*I*S/N, r0*γ*I*S/N-γ*I, γ*I])
res[i, 1:] = res[i-1, 1:] + d*dt
# plot the output (see section 3 for more about plotting)
fig,ax = plt.subplots()
for col,lbl in zip([1,2,3], ['S','I','R']):
ax.plot(res[:,0], res[:,col], label=lbl)
plt.legend()
plt.show()
This simulation is simple and naive.
What we have coded is called a discrete-time simulation, because time advances in fixed increments. In IA Algorithms you will study the ‘heap’ data structure, which is good simulation in which time is pegged to changes in state, called event-driven simulation.