The main Python library for numerical work is NumPy. It's useful in itself for fast and concise number-crunching. It's also useful to learn it as a basis for further work in machine learning, since deep learning libraries such as PyTorch are all patterned after NumPy.

Working with vectors

Handling vectors

Numpy lets us use maths-style notation for vector algebra and matrices. This is more concise and easier to debug than using Python-style ‘for’ loops and list comprehensions (and it usually runs faster too).

Modify this code to use numpy. First import the library:

import numpy as np

Next, replace the x and y lists with numpy vectors:

x = np.array([1, 3, 2.1, 5])
y = np.array([1, 1, 0, 1])

Finally, replace the list comprehension with simple vector algebra:

x + 2*y

x = [1, 3, 2.1, 5]
y = [1, 1, 0, 1]

[xi + 2*yi for xi,yi in zip(x,y)]

import numpy as np

x = np.array([1, 3, 2.1, 5])
y = np.array([1, 1, 0, 1])
x + 2*y

Creating vectors

How do we create vectors in the first place? We've just seen one way to create a vector, from a Python list:

np.array(list)

We can also create integer ranges (like Python's range):

np.arange(n)

or evenly-spaced floating point ranges, useful for plotting:

np.linspace(start,stop,n)

as well as constant vectors:

np.zeros(n)
np.ones(n)
np.full(n, fill_value)
np.ones_like(vec) # same size as vec

and vectors of random numbers:

np.random.uniform(lo, hi, n)

Create the integer vector [5,7,9,11,13,15] using np.arange. What's the difference between this and np.linspace(5,15,6)?

import numpy as np

import numpy as np

# create a vector of integers
5 + 2*np.arange(6)

# note: np.linspace creates a vector of floating point values, not integers

Types

All the elements of a numpy vector must be the same type.

This is less flexible than Python's “store anything you like” lists, but it's faster—when the machine knows what datatypes to expect, it can compute the answer more efficiently.

vec.dtype          # get the datatype
vec.astype(type)   # coerce to a new datatype
np.array(list, dtype=type)

The type can be one of Python's basic types

int, float, bool, str

or it can be one of numpy's more precise types, which may be useful for speed

np.float32, np.float64, np.int32, np.uint8, etc.

Create the integer vector [5,7,9,11,13,15] using np.linspace.

import numpy as np

import numpy as np

np.linspace(5,15,6).astype(int)

Vectors as lists

Some list-style operations:

len(vec)
np.concatenate([vec, vec, …])

In Python, list+list means list concatenation, but in numpy vec+vec means element-wise vector addition.

Create the vector [10,11,12,13,4,5,6,7] using two calls to np.arange.

import numpy as np

import numpy as np
np.concatenate([10 + np.arange(4), 4 + np.arange(4)])

Maths

All the standard maths operators work on vectors (+, -, *, /, %, //, **) and they apply to each element pointwise. Numpy also brings a new operator, @, for computing the dot product.

Numpy also comes with a large library of maths functions that work on vectors. Whenever we can, we should use numpy maths functions rather than Python. It runs faster, and it leads to more concise code.

Don't use exp on each element, use np.exp on the whole vector. Don't use sin on each element, use np.sin on the whole vector. And so on. Writing numpy code feels like the children's game Simon Says!

When we use the standard maths operators, behind the scenes Python actually calls numpy functions. In Python, any class can provide its own implementations for these operators, and the numpy vector class does just this.

Here are some useful functions:

maths functions
np.minimum(x,y) gives the elementwise mininimum, and np.min(x) gives the overall minimum
vector functions such as norm

The code on the right computes the correlation coefficient between two vectors. Rewrite it to use numpy vectorized operations.

import random, math
N=1000

x = [random.random() for i in range(N)]
y = [xi + 0.5 * random.random() for xi in x]

# Compute the correlation coefficient
xbar = sum(x) / N  # sum(list) is built into Python
ybar = sum(y) / N
sxy = sum([(xi-xbar)*(yi-ybar) for xi,yi in zip(x,y)])
sxx = sum([(xi-xbar)**2 for xi in x])
syy = sum([(yi-ybar)**2 for yi in y])
ρ = sxy / math.sqrt(sxx) / math.sqrt(syy)
ρ

import numpy as np
N=1000

x = np.random.uniform(size=N)
y = x + 0.5 * np.random.uniform(size=N)

# Compute the correlation coefficient
xbar = np.mean(x)
ybar = np.mean(y)
ρ = ((x-xbar) @ (y-ybar)) / (np.linalg.norm(x-xbar) * np.linalg.norm(y-ybar))
ρ

# Or the true guru answer: np.corrcoef(x,y)[0,1]

Exercise

The code on the right produces a random value d. Produce a million samples of d using vector algebra, and find the mean and standard deviation.

import math, random
x = random.uniform(-1,1)
y = random.uniform(-1,1)
d = math.sqrt(x**2 + y**2)

import numpy as np
n = 1000000
x = np.random.uniform(-1,1,n)
y = np.random.uniform(-1,1,n)
d = np.sqrt(x**2 + y**2)
(np.mean(d), np.std(d))

Logic

Vectors can be vectors of Booleans, not just vectors of numbers.

The standard logical operators & and | work on Boolean vectors. For not, use ~. There are also

np.all(vec)  # are all values True?
np.any(vec)  # are any values True?

The standard comparison operators (==, !=, >=, <=, >, <) also work on vectors, and return Boolean vectors. And there are additional functions to test for NaN or infinite floating point values, and to test for approximate equality with np.allclose.

Use Boolean vector algebra to find which elements of x are in the range [2,4].

import numpy as np
x = np.array([1, 2, 5, 3, 2])

# Replace with vector algebra
[xi >= 2 and xi <= 4 for xi in x]

import numpy as np
x = np.array([1, 2, 5, 3, 2])

(x >= 2) & (x <= 4)

Strings

Vectors can also be vectors of strings. Numpy has various string operations such as

np.char.add(strvec, strvec)
np.char.str_len(strvec)

Create the vector ["Attempt_1", "Attempt_2", … "Attempt_n"] for n=5.

As the name ‘numpy’ suggests, string operations aren't numpy's forte. I often fall back to Pythonic list comprehensions for more elaborate string processing.

import numpy as np
n=5

import numpy as np
n = 5
np.char.add("Attempt_", (1 + np.arange(n)).astype(str))

Indexing and slicing

The usual Python index and slice notation works on vectors, e.g.

x[i]           # get x[i]
x[i:j]         # get [x[i],…,x[j-1]]
x[i] = newval  # update ith element
x[i] += incval # increment ith element

Create the vector [10,11,12,13,4,5,6,7] using np.arange(n) with n=8.

import numpy as np
n=8

import numpy as np
n=8

x = np.arange(n)
x[:n//2] += 10
x

‘For’ considered harmful

Vectorized programming

There are two reasons for writing code with vectorized syntax rather than ‘for’ loops or list comprehensions:

Vectorized code is often easier to read and write, especially when it's implementing maths formulae.
Vectorized code often runs faster, since it can be executed entirely in fast numpy loops implemented in C, rather than in slow interpreted Python

Therefore, whenever we find ourselves writing a ‘for’ loop or a Python list comprehension, we should stop and see if we can vectorize our code.

Vectorization is often a brain teaser. The next pages will look at some common patterns. In most of them, the vectorized code is both more concise and faster than the Python equivalent. In the last few patterns, the vectorized code is trickier to understand — but that's the price we pay for speed.

x = [199, 200, 208, 210, 200, 207, 240, 269, 260, 263]

# Count the number of times the value increases
incs = 0
for i in range(1, len(x)):
    if x[i] > x[i-1]:
        incs += 1
incs

Counting

The code on the right uses a ‘for’ loop to count the number of times a sequence increases.

Describe what this line produces:

x[1:] >= x[:-1]

There's a nifty trick with Booleans which is handy for counting. If we apply a numerical function to a boolean vector, it will typecast it: False ↦ 0, True ↦ 1.

Using np.sum, count the number of times the value increases.

import numpy as np
x = np.array([199, 200, 208, 210, 200, 207, 240, 269, 260, 263])

# Count the number of times the value increases
incs = 0
for i in range(1, len(x)):
    if x[i] > x[i-1]:
        incs += 1
incs

import numpy as np
x = np.array([199, 200, 208, 210, 200, 207, 240, 269, 260, 263])

# (x[1:] >= x[:-1]) indicates whether each item is >= the preceding
np.sum(x[1:] >= x[:-1])

If-then expressions

One of the most useful pieces of vectorized logic is the numpy version of Python's x if cond else y. In numpy, x and y and cond are allowed to be vectors, and we write

np.where(cond, x, y)

The code on the right uses list comprehensions to truncate the elements of f to lie between -0.9 and +0.9. Replace the list comprehensions with vectorized operations.

This code includes a simple plot. In a later part of this course we'll learn much more about plotting.

import numpy as np
x = np.linspace(0,2*np.pi,200)[:-1]
f = np.sin(x)

# Replace these list comprehensions
θ = 0.9
f = [-θ if fi <= -θ else fi for fi in f]
f = [θ if fi >= θ else fi for fi in f]

import matplotlib.pyplot as plt
plt.plot(x, f)

import numpy as np
x = np.linspace(0,2*np.pi,200)[:-1]
f = np.sin(x)

θ = 0.9
f = np.where(f <= -θ, -θ, np.where(f >= θ, θ, f))
# Alternatively, f = np.maximum(-θ, np.minimum(θ, f))

import matplotlib.pyplot as plt
plt.plot(x, f)

Filtering (boolean indexing)

As well as the slice notation for picking out chunks of a vector, numpy also supports boolean indexing.

We can use this to get a new vector that's a filtered version of the original, or to modify a vector by updating a subset of its values.

x[boolvec]           # filtered vector
x[boolvec] = newvals  # update subset

Replace the ‘for’ loop on the right with its vectorized equivalent.

What's the difference between this exercise and the previous exercise using np.where? This exercise is imperative-style: we're modifying the f vector object in-place. The previous exercise was functional-style: we created a new vector object and gave it the name f.

import numpy as np
x = np.linspace(0,2*np.pi,200)[:-1]
f = np.sin(x)

# Replace this 'for' loop
θ = 0.9
for i in range(len(f)):
    if f[i] < -θ: f[i] = -θ
    if f[i] > θ: f[i] = θ

import matplotlib.pyplot as plt
plt.plot(x, f)

import numpy as np
x = np.linspace(0,2*np.pi,200)[:-1]
f = np.sin(x)

θ = 0.9
f[f < -θ] = -θ
f[f > θ] = θ

import matplotlib.pyplot as plt
plt.plot(x, f)

Sorting (integer indexing)

Numpy has a third form of indexing, integer indexing.

x[intvec]           # filtered vector
x[intvec] = newvals  # update subset

To find the indexes where a condition is true,

np.where(cond)[0]

Sorting and un-sorting

As with Python, we can sort a vector in-place with x.sort(), or we can create a new sorted vector leaving the original unchanged with np.sort(x).

But there's another way to sort: by finding the integer indexes i such that x[i] is sorted.

i = np.argsort(x)
x[i]

(There's also np.lexsort if we want to specify how to break ties.)

Indexed-based sorting is useful when we want to undo the sort, because of the nifty property that x[i][np.argsort(i)] == x.

The code on the right uses Python list comprehensions to compute the ranks of items in a list. Replace it with vectorized code, using integer-based sorting.

import numpy as np

# Replace this Python code by a vectorized form
def rank(x):
    original_order = range(len(x))
    xsorted = sorted(zip(x, original_order))
    ranks = [(oi,ri) for ri,(_,oi) in enumerate(xsorted)]
    ranks = sorted(ranks)  # sort by original_order
    return [ri for _,ri in ranks]

x = np.array(['cheddar','wensleydale', 'stilton', 'crowdie'])
[f"{c}={r}" for c,r in zip(x, rank(x))]

import numpy as np

def rank(x):
    i = np.argsort(x)
    r = np.arange(len(x))
    return r[np.argsort(i)]

x = np.array(['cheddar','wensleydale', 'stilton', 'crowdie'])
[f"{c}={r}" for c,r in zip(x, rank(x))]

Searching

To find the index of the minimum or maximum value (returning the first index if there are ties):

np.argmin(x)
np.argmax(x)

Using numpy's automatic casting from Booleans to integers, we can use this to find the first element in a list where some condition is true:

np.argmax(cond)

Given an integer vector x, strip off the leading and trailing zeros.

import numpy as np
x = np.array([0, 0, 0, 13, 12, 0, 5, 0, 8, 0, 0, 0, 0])

# Strip off the leading and trailing zeros

import numpy as np
x = np.array([0, 0, 0, 13, 12, 0, 5, 0, 8, 0, 0, 0, 0])

assert np.any(x != 0), "TODO: deal with all-zero or empty lists"
i = np.argmax(x != 0)
j = np.argmax(x[::-1] != 0)
x[i:-j]

Exercise

Find (approximately) the smallest x value in the range [1,10] for which x² < 0.01(e^x - 1).

import numpy as np
x = np.linspace(1, 10, 1000)
a = x**2
b = 0.01 * (np.exp(x) - 1)

# Find the smallest x for which a < b

import numpy as np
x = np.linspace(1, 10, 1000)
a = x**2
b = 0.01 * (np.exp(x) - 1)

# Find the smallest x for which a < b
x[np.argmax(a < b)]

Lookup

There are many lookup-style operations for which we could use a Python dict or a set.

Lookup operations aren't numpy's forte, and I would generally use the more expressive pandas library instead; but there are a few simple cases for which numpy has useful functions.

The np.unique function is self-explanatory. But it's worth reading the documentation, because it has all sorts of options for returning counts and indexes.

np.unique(x)

Numpy has a vectorized version of Python's x in y function, for testing whether an item x is present in the array y. The numpy version accepts a vector x and returns a boolean vector of the same length.

np.isin(x, y)

How many elements of x are weekdays?

import numpy as np
x = np.array(['Fri', 'Mon', 'Sat', 'Tue', 'Sat', 'Wed', 'Sun', 'Mon'])

import numpy as np
x = np.array(['Fri', 'Mon', 'Sat', 'Tue', 'Sat', 'Wed', 'Sun', 'Mon'])
np.sum(~np.isin(x, ['Sat','Sun']))

Advanced lookup *

In Python we can use a dictionary as a look-up table, as in the code on the right.

Numpy can do the same, but it's a bit ugly, and should only be used when there's a real need for speed. This is advanced numpy.

First create a list k with the keys from move, and another list v with the values. We can use list(move.keys()) amd list(move.values()) for this.

Next, use the following code to find the index in k for each entry of x:

ki = np.argsort(k)
ki[np.searchsorted(k, x, sorter=ki)]

Finally, use integer indexing to get the relevant values from move.

import numpy as np
x = np.array(['forward','up','up','down','forward','up','forward'])
move = {'forward':0, 'up':-1, 'down':1}

# Lookup the value from move for each item in x
[move[xi] for xi in x]

import numpy as np
x = np.array(['forward','up','up','down','forward','up','forward'])
move = {'forward':0, 'up':-1, 'down':1}

# Lookup the value from move for each item in x
k = list(move.keys())
v = list(move.values())
ki = np.argsort(k)
j = ki[np.searchsorted(k, x, sorter=ki)]
np.array(v)[j]

Iteration

There are some problems that are intrinsically iterative, where it's hard to see how to avoid a ‘for’ loop.

Numpy does however have one useful iterative functions built in. Given a vector x = [x0,x1,x1,…],

np.cumsum(x)  # returns [x0, x0+x1, x0+x1+x2, …]

Sometimes, if we're very cunning, we can turn an iteration into a vectorized operation by using np.cumsum in a clever way.

Some people use NaN (not a number) to indicate unknown values. The ffill function as shown on the right fills in NaN values, by propagating forwards the last non-NaN value. Can you implement it without a ‘for’ loop?

HINT: First define y = x[~np.isnan(x)]. Which entries of y should the function return?

import numpy as np

# Fill in NaN values by propagating the last non-NaN value forwards
def ffill(x):
    y = np.copy(x)
    for i in range(1, len(x)):
        if np.isnan(y[i]):
            y[i] = y[i-1]
    return y

x = np.array([1.0, np.nan, 5.3, 1.1, np.nan, 3.0, 2.1, np.nan, np.nan])
ffill(x)

import numpy as np

# Fill in NaN values by propagating the last non-NaN value forwards
def ffill(x):
    b = ~ np.isnan(x)
    b[0] = True
    return x[b][np.cumsum(b) - 1]
    
x = np.array([1.0, np.nan, 5.3, 1.1, np.nan, 3.0, 2.1, np.nan, np.nan])
ffill(x)

Advanced iteration *

The idea of np.cumsum is that it accumulates successive applications of the ‘addition’ operation. Numpy lets us accumulate other functions using accumulate:

cumprod = np.times.accumulate
cumprod(x) # returns [x[0], x[0]*x[1], …]

cummin = np.minimum.accumulate
cummin(x) # returns [x[0], min(x[0],x[1]), …]

Reimplement the function queue_sim using vectorized operations, and verify that your version gives the same answer as this Python version.

This function simulates a queue which serves C units of work per timestep, with an amount of work a_t arriving in timestep t. The queue size evolves according to the recursion

q_t+1 = max(q_t + a_t - C, 0)

It can be proved that another way to get the same answer is with the formula

q_t = q₀ + x_t - min(0, y_t)

where

x_t = (a₀ - C) + ⋯ + (a_t-1 - C)

and

y_t = min(q₀+x₁, q₀+x₂, …, q₀+x_t).

Therefore, to reimplement queue_sim using vectorized operations, we should

compute x = [x₁, x₂, …, x_t] using np.cumsum
compute y = [y₁, y₂, …, y_t] by accumulating np.minimum
compute q = [q₁, q₂, …, q_t]

import numpy as np

# Simulate a queue with service rate C and input a = [a[0],...,a[t-1]]
# Return q = [q[1],...,q[t]]
def queue_sim(q0, C, a):
    res = []
    q = q0
    for ai in a:
        q = max(q + ai - C, 0)
        res.append(q)
    return res

a = np.array([4, 1, 2, 8, 2, 3, 1])
C = 3
q0 = 1

print(queue_sim(q0, C, a))

import numpy as np

# Simulate a queue with service rate C and input a = [a[0],...,a[t-1]]
# Return q = [q[1],...,q[t]]
def queue_sim(q0, C, a):
    res = []
    q = q0
    for ai in a:
        q = max(q + ai - C, 0)
        res.append(q)
    return res

def queue_sim2(q0, C, a):
    x = np.cumsum(a - C)
    y = np.minimum.accumulate(q0 + x)
    q = q0 + x - np.minimum(0, y)
    return q

a = np.array([4, 1, 2, 8, 2, 3, 1])
C = 3
q0 = 1

print(queue_sim(q0, C, a))
print(queue_sim2(q0, C, a))

Pseudo-vectorization

Writing our code with vector algebra has two benefits: (1) the syntax is nice, and (2) it runs faster than Python ‘for’ loops, because numpy can just use its fast internal C routines and avoid the slow Python interpreter.

If we have a Python function that has no numpy equivalent, but we still want the benefit of nice vector syntax, we can vectorize our function.

def f(x):
    … # some function that operates on values

g = np.vectorize(f)
g(vec) # applies f to each element

Under the hood this is still a slow ‘for’ loop, so avoid it if you can!

import numpy as np
import re

def f(x): return re.sub(r'(.*)\\s+and.*', r'\\1', x)

g = np.vectorize(f)
x = ['chalk and cheese', 'baked beans and toast', 'pride and prejudice']
g(x)

Arrays

Why arrays?

Numpy supports matrices and higher-dimensional arrays.

To enter a 2d array (i.e. a matrix) like

	2.2	3.7	9.1
	-4	3.1	1.3

we type in the values as nested lists,

a = np.array([[2.2, 3.7, 9.1], [-4, 3.1, 1.3]])

Why learn about arrays?

Arrays will play only a very small role in this Scientific Computing course.

But if you go on to work with neural networks, or with graphics, or systems modelling, or any one of a whole range of areas of computer science, you'll need to use arrays. There isn't actually much more to them than vectors, so now is as good a time as any to learn how to handle them in numpy.

import numpy as np

a = np.array([[2.2, 3.7, 9.1], [-4, 3.1, 1.3]])
a

Array shapes

Every numpy array has a shape.

a.shape

For a matrix with m rows and n columns, the shape is (m,n).
For higher-dimensional arrays, the shape is a tuple with length equal to the number of dimensions.
A vector is just a 1d array. A vector of length n has shape (n,) — this is a tuple of length 1.

All of the usual array-creation methods can also be used to create arrarys: simply pass in an extra argument specifying the shape.

Note: for most numpy routines this extra argument is called shape, but for random number generation routines it's called size.

Create the following matrix, by passing the extra argument shape=(2,3) to np.ones.

	1	1	1
	1	1	1

import numpy as np

import numpy as np

# Note the type specifier, to return an integer array
np.ones(shape=(2,3), dtype=int)

Row and column vectors?

Numpy vectors are nothing other than one-dimensional arrays, and their shape is a tuple of length 1.

x = np.ones(n)  # vector with n items
x.shape         # returns (n,)

In maths notation we distinguish between row and column vectors, but numpy doesn't make this distinction. A vector is just a 1d array, and it's only 2d arrays that can be described as ‘row’ or ‘column’.

Create a random column vector with n elements.

import numpy as np
n = 4

import numpy as np
n = 4
np.random.uniform(size=(n,1))

Reshaping

For 1d vectors, the only meaningful ways to reshape them is by slicing and concatenating. But for higher-dimensional arrays there is a whole variety of reshaping functions such as stacking, tiling, etc. Here are four that are particularly useful.

Adding a dimension:

a[:, None]    # turn vector into row vector
a[..., None]  # add dim. to arbitrary array
a[None, ...]  # add first dimension

Stacking vectors:

np.column_stack([vec, vec, …])
np.row_stack([vec, vec, …])

Transposing:

a.T

Reshaping the existing elements into a new array:

a.reshape(shape)

The shape argument is allowed to have one entry equal to -1, meaning “make this dimension however big it needs to be to use up all the elements”.

How many ways can you think of to create the column vector [[1],[2],…,[n]]?

import numpy as np
n = 4
x = 1 + np.arange(n)

import numpy as np
n = 4
x = 1 + np.arange(n)

x[:, None]
x.reshape((-1,1))
np.row_stack([x]).T

Indexing and slicing

To refer to the item at row i column j (where indexes start at 0, as usual) use

a[i, j]

To refer to a subarray, we can use an extended version of Python's slice syntax:

a[:, :2]     # all rows, first two columns
a[1, :]      # second row, all columns
a[:2, :2]    # 2×2 submatrix at the top left

If we specify only the rows we want, then it will give us all the columns:

a[:2]    # first two rows

Assign new values into a subarray of a, to create the following matrix:

	1	0	2
	0	0	0
	3	0	4

import numpy as np
a = np.zeros(shape=(3,3), dtype=int)
# a[???] = ???
a

import numpy as np
a = np.zeros(shape=(3,3), dtype=int)
a[::2,::2] = np.array([[1,2],[3,4]])
a

Arrays as lists

Arrays can used in Python as if they were lists. When we iterate over an array a, we get a[0], a[1], ….

By the logic of numpy's slice syntax, if a is a matrix, then each a[i] is a row of the matrix; and if a is a d-dimensional matrix then each a[i] is a (d-1)-dimensional subarray.

Print out the rows of the a matrix, prefixed by "Row=" and the row number.

import numpy as np

a = np.arange(10).reshape((5,2))
a

import numpy as np

a = np.arange(10).reshape((5,2))
for i,row in enumerate(a):
    print(f"Row={i}: {row}")

Vectorized operations

Numpy's usual vectorized operations can be applied to arrays, though now we have to specify which dimensions we want to apply them to.

np.sum(a, axis=0)  # vec of column sums
np.sum(a, axis=1)  # vec of row sums
np.sum(a)          # grand total

Find the 4-vector of column sums for the matrix

1	2	3	4
5	6	7	8
9	10	11	12

import numpy as np

,
import numpy as np
a = np.arange(12).reshape((3,-1)) + 1
np.sum(a, axis=0)

Exercise

I have been given a greyscale image, in the form of an integer matrix. But it has a white border (value 255), which I would like to remove. The border might be different thicknesses on different sides. How can I detect the border and remove it?

255	255	255	255	255
255	255	255	255	255
255	255	255	255	255
255	7	7	255	255
255	3	93	6	255
255	255	255	255	255

Desired result:

	7	7	255
	3	93	6

HINT: Use the vectorized function np.all along one or other dimension of the array.

import numpy as np
w = 255
img = np.array([[w,w,w,w,w],
                [w,w,w,w,w],
                [w,w,w,w,w],
                [w,7,7,255,w],
                [w,3,93,6,w],
                [w,w,w,w,w]])

import numpy as np
w = 255
img = np.array([[w,w,w,w,w],
                [w,w,w,w,w],
                [w,w,w,w,w],
                [w,7,7,255,w],
                [w,3,93,6,w],
                [w,w,w,w,w]])

rw = np.all(img==w, axis=1)  # for each row: is it all w?
cw = np.all(img==w, axis=0)  # for each col: is it all w?
img[np.argmin(rw):-np.argmin(rw[::-1]), np.argmin(cw):-np.argmin(cw[::-1])]

Broadcasting *

Numpy has a powerful tool called broadcasting which generalizes ‘add a scalar to a vector”. It's more advanced than we need for this course, but it's used a lot in deep learning, so it's worth reading about.

The code on the right shows a simple example: normalizing a matrix so the columns sum to 1.

	3	2	8
	2	6	2

import numpy as np
a = np.array([[3,2,8], [2,6,2]])
colsums = np.sum(a, axis=0)
a / colsums

Advanced indexing

There are two ways to refer to an arbtrary group of items inside an array, both called advanced indexing.

Boolean indexing

a[mask]

This refers to the elements of a where the mask is True. The mask should have the same shape as a. (Although, as with slicing, we can omit trailing dimensions.)

Integer indexing

a[[i₁,i₂,…,i_n], [j₁,j₂,…,j_n]]

This refers to elements a[i₁,j₁], a[i₂,j₂], …, a[i_n,j_n]. The syntax extends to arrays with more than two dimensions.

A permutation matrix is a square matrix of 0s and 1s, where each row contains exactly one 1, and each column likewise. Using np.random.permutation(n) create a random n×n permutation matrix.

(NOTE: The function np.random.permutation(n) produces a random permutation of the integers [0,1,…,n-1].)

import numpy as np
n = 5

import numpy as np
n = 5
p = np.zeros((n,n), dtype=int)
p[np.arange(n), np.random.permutation(n)] = 1
p

Linear algebra *

For matrix multiplication, use the @ operator.

vec @ vec  # dot product
mat @ vec  # matrix × vector
mat @ mat  # matrix × matrix

In Easter term you will study linear algebra in the Maths for Natural Sciences course. All the functions you'll come across there can be found in np.linalg.

Simultaneous linear equations

A collection of m simultaneous linear equations in n unknowns can be written as A @ x = b, where A is an m×n matrix, x is the length-n vector of unknowns, and b is a length-m vector.

To solve for x, when m=n and we also know that the equations have a solution and that it is unique,

np.linalg.solve(A, b)

To get a solution for x otherwise,

np.linalg.lstsq(A, b)[0]

If the equations have multiple solutions, it will return one of them. If the equations have no solution, it will return a solution that makes A @ x as close to b as possible (in the ‘least squares’ sense).

Solve the equations \\begin{align} 3.1 x_1 + x_2 &= 0.5 \\\\ 2 x_1 - 1.5 x_2 &= 1 \\end{align}

import numpy as np

import numpy as np
A = [[3.1, 1], [2,-1.5]]
b = [0.5, 1]
np.linalg.solve(A, b)

Simulation

Basic simulation

Simulation is a mainstay of scientific computing.

A common pattern with numpy is to predefine a vector or array to store the results, one row per timestep, and then iterate over timesteps gradually filling in the array.

The code on the right shows an example, a simulator of predator–prey dynamics. Let $f_t$ be the population of foxes (predators) at time $t$, let $r_t$ be the population of rabbits (prey), and suppose they evolve according to \begin{align} \frac{d r_t}{dt} &= \alpha r_t - \beta r_t f_t\\ \frac{d f_t}{dr} &= -\gamma f_t + \delta r_t f_t \end{align} where $\alpha$, $\beta$, $\gamma$, and $\delta$ are constants. This model expresses the idea that rabbits will (if left unchecked) reproduce exponentially, that foxes will (if they get no food) die out, and that there are interactions between rabbits and foxes in which the rabbit gets eaten and the fox gets food needed for reproduction.

The simulator is a simple discrete-time approximation, based on \begin{align} r_{t+\Delta} &\approx r_t + \Delta \frac{d r_t}{d t}\\ f_{t+\Delta} &\approx f_t + \Delta \frac{d f_t}{d t} \end{align}

This simulator is simple and naive.

From a mathematical point of view it's naive, because there are much more sophisticated numerical methods for solving differential equations.
From a computer science point of view it's not good code, because it tangles together the iteration logic with the logging logic. It should really be rewritten with generators.
But from a scientific computing point of view, simulations like this are so easy to put together and learn from, that they are invaluable.

import numpy as np
import matplotlib.pyplot as plt

α,β,γ,δ = (0.11, .04, .04, .01)
Δ = 0.01   # timestep
T = 500    # total simulated time

res = np.zeros((round(T/Δ),2))  # preallocate array to store sim
res[0] = [10, 10]               # initial state r[0]=10, f[0]=10

for i in range(1, len(res)):
    r,f = res[i-1]
    dr,df = (α*r - β*r*f, -γ*f + δ*r*f) 
    res[i] = res[i-1] + Δ * np.array([dr,df])

# Note: the plotting code will be explained later in this course
fig,axes = plt.subplots(2,1)
for j,lbl,ax in zip([0,1], ['rabbits','foxes'], axes):
    ax.plot(np.arange(len(res))*Δ, res[:,j])
    ax.set_ylabel(lbl)
plt.show()

Monte Carlo simulation

A simulation which uses random numbers is called a Monte Carlo simulation (for the rather daft reason that the city of Monte Carlo is famous for gambling, and gambling is also random).

Arguably all simulators should be Monte Carlo, since real-world dynamics are invariably random. Let's inject some randomness into this predator-prey simulator.

Replace the update step with

noise = np.random.normal(loc=1, scale=0.002)
res[i] = res[i-1] * noise + Δ * np.array([dr,df])

In Monte Carlo simulation it's valuable to understand how much the outcome depends on random chance.

Wrap up the simulator into a function called runsim. This function should create an empty res array, run the simulator to fill it in, then return it. Then run it 5 times and store the result in a list,

runs = [runsim() for _ in range(5)]

Show all 5 runs, by replacing the line that calls ax.plot with

for res in runs:
    ax.plot(np.arange(len(res))*Δ, res[:,j])

How do we decide what sort of random variables to use and where to put them in the simulator? This is a great big area of study — it's what data science and machine learning are all about. You'll hone your modelling skills throughout the rest of your degree!

import numpy as np
import matplotlib.pyplot as plt

α,β,γ,δ = (0.11, .04, .04, .01)
Δ = 0.01   # timestep
T = 500    # total simulated time

res = np.zeros((round(T/Δ),2))
res[0] = [10, 10]

for i in range(1, len(res)):
    r,f = res[i-1]
    dr,df = (α*r - β*r*f, -γ*f + δ*r*f)
    res[i] = res[i-1] + Δ * np.array([dr,df])

# Note: the plotting code will be explained later in this course
fig,axes = plt.subplots(2,1)
for j,lbl,ax in zip([0,1], ['rabbits','foxes'], axes):
    ax.plot(np.arange(len(res))*Δ, res[:,j])
    ax.set_ylabel(lbl)
plt.show()

import numpy as np
import matplotlib.pyplot as plt

α,β,γ,δ = (0.11, .04, .04, .01)
Δ = 0.01   # timestep
T = 500    # total simulated time

def runsim():
    res = np.zeros((round(T/Δ),2))
    res[0] = [10, 10]
    for i in range(1, len(res)):
        r,f = res[i-1]
        dr,df = (α*r - β*r*f, -γ*f + δ*r*f) 
        noise = np.random.normal(loc=1, scale=0.002)
        res[i] = res[i-1] * noise + Δ * np.array([dr,df])
    return res

runs = [runsim() for _ in range(5)]

fig,axes = plt.subplots(2,1)
for j,lbl,ax in zip([0,1], ['rabbits','foxes'], axes):
    for res in runs:
        ax.plot(np.arange(len(res))*Δ, res[:,j])
    ax.set_ylabel(lbl)
plt.show()

Reproducibility

A Monte Carlo simulator gives different results each time we run it — that's the whole point of using randomness! But for reproducibility (and especially for debuggability) we'd like to be able to force it to pick the same random numbers each time.

In numpy the pattern is to create a ‘stream of randomness’, let's call it ω, initialized with an arbitrary seed. Thereafter use methods of ω to generate actual random numbers.

ω = np.random.default_rng(seed)  # seed an integer

ω.uniform(...)
ω.normal(...)
# and many other random variable distributions

Run this code several times. You should see a different output each time.

Modify runsim to accept an argument ω, and replace np.random.binomial by ω.binomial. Next, define ω using np.random.default_rng, using whatever seed takes your fancy, and call runsim(ω). You should now see the same output every time you run the code.

In probability theory, we talk about a sample space $\Omega$, the set of all possible outcomes of an experiment. A particular outcome is denoted $\omega\in\Omega$. A random variable $X$, for example the outcome of a roll of a die, is (formally speaking) a function $X:\Omega\to\{1,\dots,6\}$. This is why I like to use the symbol $\omega$ in my code to denote a ‘stream of randomness’.

import numpy as np

def runsim():
    return np.random.binomial(n=1, p=0.5, size=10)

runsim()

import numpy as np

def runsim(ω):
    return ω.binomial(n=1, p=0.5, size=10)

ω = np.random.default_rng(1618)
runsim(ω)

Next steps

Well done!

You've finished this tutorial about NumPy.

The next step is to embark on your own scientific investigations, using numpy to speed things up — to make your code more concise hence faster to type and easier to read, and also to make it run faster.

Tick 1 will ask you to implement a simulator of the COVID epidemic. You'll investigate two factors that impact the spread of the disease: vaccination rates, and social mingling.

import numpy as np

s = 'W eaed iloanlng!'
a = np.array(list(s)).reshape(4,4).T
''.join(a.reshape(-1))

255	255	255	255	255
255	255	255	255	255
255	255	255	255	255
255	7	7	255	255
255	3	93	6	255
255	255	255	255	255

255	255	255	255	255
255	255	255	255	255
255	255	255	255	255
255	7	7	255	255
255	3	93	6	255
255	255	255	255	255

255	255	255	255	255
255	255	255	255	255
255	255	255	255	255
255	7	7	255	255
255	3	93	6	255
255	255	255	255	255