NumPy
The main Python library for numerical work is NumPy. It's useful in itself for fast and concise number-crunching.
It's also useful to learn it as a basis for further work in machine learning, since deep learning libraries such as PyTorch
are all patterned after NumPy.
Working with vectors
Handling vectors
Numpy
lets us use maths-style notation for vector algebra and matrices. This is
more concise and easier to debug than using
Python-style ‘for’ loops and list comprehensions
(and it usually runs faster too).
Modify this code to use numpy. First import the library:
import numpy as np
Next, replace the x
and y
lists with
numpy vectors:
x = np.array([1, 3, 2.1, 5])
y = np.array([1, 1, 0, 1])
Finally, replace the list comprehension with simple vector algebra:
x + 2*y
x = [1, 3, 2.1, 5]
y = [1, 1, 0, 1]
[xi + 2*yi for xi,yi in zip(x,y)]
import numpy as np
x = np.array([1, 3, 2.1, 5])
y = np.array([1, 1, 0, 1])
x + 2*y
Creating vectors
How do we create vectors in the first place? We've just seen one way to create a vector,
from a Python list:
np.array(list
)
We can also create integer ranges (like Python's range
):
np.arange(n)
or evenly-spaced floating point ranges, useful for plotting:
np.linspace(start,stop,n)
as well as constant vectors:
np.zeros(n)
np.ones(n)
np.full(n, fill_value)
np.ones_like(vec
) # same size as vec
and vectors of random numbers:
np.random.uniform(lo, hi, n)
Create the integer vector [5,7,9,11,13,15] using np.arange
.
What's the difference between this and np.linspace(5,15,6)
?
import numpy as np
import numpy as np
# create a vector of integers
5 + 2*np.arange(6)
# note: np.linspace creates a vector of floating point values, not integers
Types
All the elements of a numpy vector must be the same type.
This is less
flexible than Python's “store anything you like” lists,
but it's faster—when the machine knows what datatypes to expect, it
can compute the answer more efficiently.
vec
.dtype # get the datatype
vec
.astype(type
) # coerce to a new datatype
np.array(list
, dtype=type
)
The type can be one of Python's basic types
or it can be one of numpy's more precise types, which may
be useful for speed
np.float32
, np.float64
, np.int32
, np.uint8
, etc.
Create the integer vector [5,7,9,11,13,15] using np.linspace
.
import numpy as np
import numpy as np
np.linspace(5,15,6).astype(int)
Vectors as lists
Some list-style operations:
len(vec
)
np.concatenate([vec
, vec
, …])
In Python, list
+list
means list concatenation,
but in numpy vec
+vec
means element-wise vector addition.
Create the vector [10,11,12,13,4,5,6,7] using two calls to np.arange
.
import numpy as np
import numpy as np
np.concatenate([10 + np.arange(4), 4 + np.arange(4)])
Maths
All the standard maths operators work on vectors (+
, -
, *
, /
,
%
, //
, **
) and they apply to each element pointwise.
Numpy also brings a new operator, @
, for computing the dot product.
Numpy also comes with a large library of maths functions that work on vectors. Whenever we can, we should use numpy maths functions rather than Python.
It runs faster, and it leads to more concise code.
Don't use exp
on each element, use np.exp
on the whole vector. Don't use sin
on each element, use np.sin
on
the whole vector. And so on.
Writing numpy code feels like the children's game Simon
Says!
When we use the standard maths operators, behind the scenes Python actually calls numpy functions.
In Python, any class can provide its own implementations for these operators, and the numpy vector class does just this.
Here are some useful functions:
The code on the right computes the correlation coefficient
between two vectors.
Rewrite it to use numpy vectorized operations.
import random, math
N=1000
x = [random.random() for i in range(N)]
y = [xi + 0.5 * random.random() for xi in x]
# Compute the correlation coefficient
xbar = sum(x) / N # sum(list) is built into Python
ybar = sum(y) / N
sxy = sum([(xi-xbar)*(yi-ybar) for xi,yi in zip(x,y)])
sxx = sum([(xi-xbar)**2 for xi in x])
syy = sum([(yi-ybar)**2 for yi in y])
ρ = sxy / math.sqrt(sxx) / math.sqrt(syy)
ρ
import numpy as np
N=1000
x = np.random.uniform(size=N)
y = x + 0.5 * np.random.uniform(size=N)
# Compute the correlation coefficient
xbar = np.mean(x)
ybar = np.mean(y)
ρ = ((x-xbar) @ (y-ybar)) / (np.linalg.norm(x-xbar) * np.linalg.norm(y-ybar))
ρ
# Or the true guru answer: np.corrcoef(x,y)[0,1]
Exercise
The code on the right produces a random value d
.
Produce a million samples of d
using vector algebra, and find the mean and
standard deviation.
import math, random
x = random.uniform(-1,1)
y = random.uniform(-1,1)
d = math.sqrt(x**2 + y**2)
import numpy as np
n = 1000000
x = np.random.uniform(-1,1,n)
y = np.random.uniform(-1,1,n)
d = np.sqrt(x**2 + y**2)
(np.mean(d), np.std(d))
Logic
Vectors can be vectors of Booleans, not just vectors of numbers.
The standard logical operators &
and |
work on Boolean vectors.
For not
, use ~
.
There are also
np.all(vec
) # are all values True?
np.any(vec
) # are any values True?
The standard comparison operators (==
, !=
,
>=
, <=
, >
, <
) also work on vectors,
and return Boolean vectors.
And there are additional
functions to test for NaN or infinite floating point values, and to test for approximate equality with
np.allclose
.
Use Boolean vector algebra to find which elements of x
are in the range [2,4].
import numpy as np
x = np.array([1, 2, 5, 3, 2])
# Replace with vector algebra
[xi >= 2 and xi <= 4 for xi in x]
import numpy as np
x = np.array([1, 2, 5, 3, 2])
(x >= 2) & (x <= 4)
Strings
Vectors can also be vectors of strings.
Numpy has various string operations
such as
np.char.add(strvec
, strvec
)
np.char.str_len(strvec
)
Create the vector ["Attempt_1", "Attempt_2", … "Attempt_n"] for n=5.
As the name ‘numpy’ suggests, string operations aren't numpy's forte.
I often fall back to Pythonic list comprehensions for more elaborate string processing.
import numpy as np
n=5
import numpy as np
n = 5
np.char.add("Attempt_", (1 + np.arange(n)).astype(str))
Indexing and slicing
The usual Python index and slice notation works on vectors, e.g.
x[i] # get x[i]
x[i:j] # get [x[i],…,x[j-1]]
x[i] = newval
# update ith element
x[i] += incval
# increment ith element
Create the vector [10,11,12,13,4,5,6,7] using np.arange(n)
with n=8.
import numpy as np
n=8
import numpy as np
n=8
x = np.arange(n)
x[:n//2] += 10
x
‘For’ considered harmful
Vectorized programming
There are two reasons for writing code with vectorized syntax rather than
‘for’ loops or list comprehensions:
- Vectorized code is often easier to read and write, especially when it's implementing maths formulae.
- Vectorized code often runs faster, since it can be executed entirely in fast numpy loops implemented in C, rather
than in slow interpreted Python
Therefore, whenever we find ourselves writing a ‘for’ loop or a Python list comprehension,
we should stop and see if we can vectorize our code.
Vectorization is often a brain teaser. The next pages will look at some common patterns.
In most of them, the vectorized code is both more concise and faster than the Python equivalent.
In the last few patterns, the vectorized code is trickier to understand — but that's the price we pay for speed.
x = [199, 200, 208, 210, 200, 207, 240, 269, 260, 263]
# Count the number of times the value increases
incs = 0
for i in range(1, len(x)):
if x[i] > x[i-1]:
incs += 1
incs
Counting
The code on the right uses a ‘for’ loop to count the number of times a sequence increases.
Describe what this line produces:
x[1:] >= x[:-1]
There's a nifty trick with Booleans which is handy for counting.
If we apply a numerical function to a boolean vector, it will typecast it:
False ↦ 0
, True ↦ 1
.
Using np.sum
, count the number of times the value increases.
import numpy as np
x = np.array([199, 200, 208, 210, 200, 207, 240, 269, 260, 263])
# Count the number of times the value increases
incs = 0
for i in range(1, len(x)):
if x[i] > x[i-1]:
incs += 1
incs
import numpy as np
x = np.array([199, 200, 208, 210, 200, 207, 240, 269, 260, 263])
# (x[1:] >= x[:-1]) indicates whether each item is >= the preceding
np.sum(x[1:] >= x[:-1])
If-then expressions
One of the most useful pieces of vectorized logic is the numpy version of Python's
x if cond
else y
. In numpy,
x
and y
and cond
are allowed to be vectors, and
we write
np.where(cond
, x, y)
The code on the right uses list comprehensions to
truncate the elements of f
to lie between -0.9 and +0.9.
Replace the list comprehensions with vectorized operations.
This code includes a simple plot. In a later part of this course we'll learn much
more about plotting.
import numpy as np
x = np.linspace(0,2*np.pi,200)[:-1]
f = np.sin(x)
# Replace these list comprehensions
θ = 0.9
f = [-θ if fi <= -θ else fi for fi in f]
f = [θ if fi >= θ else fi for fi in f]
import matplotlib.pyplot as plt
plt.plot(x, f)
import numpy as np
x = np.linspace(0,2*np.pi,200)[:-1]
f = np.sin(x)
θ = 0.9
f = np.where(f <= -θ, -θ, np.where(f >= θ, θ, f))
# Alternatively, f = np.maximum(-θ, np.minimum(θ, f))
import matplotlib.pyplot as plt
plt.plot(x, f)
Filtering (boolean indexing)
As well as the slice notation for picking out
chunks of a vector, numpy also supports boolean indexing.
We can use this to get a new vector that's a filtered version of the original,
or to modify a vector by updating a subset of its values.
x[boolvec
] # filtered vector
x[boolvec
] = newvals
# update subset
Replace the ‘for’ loop on the right with its vectorized equivalent.
What's the difference between this exercise and the previous exercise using np.where
?
This exercise is imperative-style: we're modifying the f
vector object in-place.
The previous exercise was functional-style: we created a new vector object and gave it the name f
.
import numpy as np
x = np.linspace(0,2*np.pi,200)[:-1]
f = np.sin(x)
# Replace this 'for' loop
θ = 0.9
for i in range(len(f)):
if f[i] < -θ: f[i] = -θ
if f[i] > θ: f[i] = θ
import matplotlib.pyplot as plt
plt.plot(x, f)
import numpy as np
x = np.linspace(0,2*np.pi,200)[:-1]
f = np.sin(x)
θ = 0.9
f[f < -θ] = -θ
f[f > θ] = θ
import matplotlib.pyplot as plt
plt.plot(x, f)
Sorting (integer indexing)
Numpy has a third form of indexing, integer indexing.
x[intvec
] # filtered vector
x[intvec
] = newvals
# update subset
To find the indexes where a condition is true,
np.where(cond
)[0]
Sorting and un-sorting
As with Python, we can sort a vector in-place with x.sort()
, or we can create
a new sorted vector leaving the original unchanged with np.sort(x)
.
But there's another way to sort: by finding the integer indexes i
such that
x[i]
is sorted.
i = np.argsort(x)
x[i]
(There's also np.lexsort
if we want to specify how to break ties.)
Indexed-based sorting is useful when we want to undo the sort, because of the nifty property that
x[i][np.argsort(i)] == x
.
The code on the right uses Python list comprehensions to compute the ranks of items in a list.
Replace it with vectorized code, using integer-based sorting.
import numpy as np
# Replace this Python code by a vectorized form
def rank(x):
original_order = range(len(x))
xsorted = sorted(zip(x, original_order))
ranks = [(oi,ri) for ri,(_,oi) in enumerate(xsorted)]
ranks = sorted(ranks) # sort by original_order
return [ri for _,ri in ranks]
x = np.array(['cheddar','wensleydale', 'stilton', 'crowdie'])
[f"{c}={r}" for c,r in zip(x, rank(x))]
import numpy as np
def rank(x):
i = np.argsort(x)
r = np.arange(len(x))
return r[np.argsort(i)]
x = np.array(['cheddar','wensleydale', 'stilton', 'crowdie'])
[f"{c}={r}" for c,r in zip(x, rank(x))]
Searching
To find the index of the minimum or maximum value (returning the first index if there are ties):
np.argmin(x)
np.argmax(x)
Using numpy's automatic casting from Booleans to integers, we can use this to find the first element
in a list where some condition is true:
np.argmax(cond
)
Given an integer vector x
, strip off the leading and trailing zeros.
import numpy as np
x = np.array([0, 0, 0, 13, 12, 0, 5, 0, 8, 0, 0, 0, 0])
# Strip off the leading and trailing zeros
import numpy as np
x = np.array([0, 0, 0, 13, 12, 0, 5, 0, 8, 0, 0, 0, 0])
assert np.any(x != 0), "TODO: deal with all-zero or empty lists"
i = np.argmax(x != 0)
j = np.argmax(x[::-1] != 0)
x[i:-j]
Exercise
Find (approximately) the smallest x value in the range [1,10] for which
x2 < 0.01(ex - 1).
import numpy as np
x = np.linspace(1, 10, 1000)
a = x**2
b = 0.01 * (np.exp(x) - 1)
# Find the smallest x for which a < b
import numpy as np
x = np.linspace(1, 10, 1000)
a = x**2
b = 0.01 * (np.exp(x) - 1)
# Find the smallest x for which a < b
x[np.argmax(a < b)]
Lookup
There are many lookup-style operations for which we could use a Python dict
or a set
.
Lookup operations aren't numpy's forte, and I would generally use
the more expressive pandas library instead;
but there are a few simple cases for which numpy has useful functions.
The np.unique
function is self-explanatory. But it's worth reading the documentation, because it has all sorts of options
for returning counts and indexes.
np.unique(x)
Numpy has a vectorized version of Python's x in y
function, for testing whether an item x
is present in the array y
. The numpy version accepts a vector x
and returns a boolean
vector of the same length.
np.isin(x, y)
How many elements of x
are weekdays?
import numpy as np
x = np.array(['Fri', 'Mon', 'Sat', 'Tue', 'Sat', 'Wed', 'Sun', 'Mon'])
import numpy as np
x = np.array(['Fri', 'Mon', 'Sat', 'Tue', 'Sat', 'Wed', 'Sun', 'Mon'])
np.sum(~np.isin(x, ['Sat','Sun']))
Advanced lookup *
In Python we can use a dictionary as a look-up table, as in the code on the right.
Numpy can do the same, but it's a bit ugly, and should only be used when there's a real need for speed.
This is advanced numpy.
First create a list k
with the keys from move
, and another list v
with the values.
We can use list(move.keys())
amd list(move.values())
for this.
Next, use the following code to find the index in k
for each entry of x
:
ki = np.argsort(k)
ki[np.searchsorted(k, x, sorter=ki)]
Finally, use integer indexing to get the relevant values from move
.
import numpy as np
x = np.array(['forward','up','up','down','forward','up','forward'])
move = {'forward':0, 'up':-1, 'down':1}
# Lookup the value from move for each item in x
[move[xi] for xi in x]
import numpy as np
x = np.array(['forward','up','up','down','forward','up','forward'])
move = {'forward':0, 'up':-1, 'down':1}
# Lookup the value from move for each item in x
k = list(move.keys())
v = list(move.values())
ki = np.argsort(k)
j = ki[np.searchsorted(k, x, sorter=ki)]
np.array(v)[j]
Iteration
There are some problems that are intrinsically iterative, where it's hard to see how to avoid a ‘for’ loop.
Numpy does however have one useful iterative functions built in. Given a vector x = [x0,x1,x1,…]
,
np.cumsum(x) # returns [x0, x0+x1, x0+x1+x2, …]
Sometimes, if we're very cunning, we can turn an iteration into a vectorized operation
by using np.cumsum
in a clever way.
Some people use NaN (not a number) to indicate unknown values.
The ffill
function as shown on the right fills in NaN values, by propagating forwards the last non-NaN value.
Can you implement it without a ‘for’ loop?
HINT: First define y = x[~np.isnan(x)]
. Which entries of y
should the function return?
import numpy as np
# Fill in NaN values by propagating the last non-NaN value forwards
def ffill(x):
y = np.copy(x)
for i in range(1, len(x)):
if np.isnan(y[i]):
y[i] = y[i-1]
return y
x = np.array([1.0, np.nan, 5.3, 1.1, np.nan, 3.0, 2.1, np.nan, np.nan])
ffill(x)
import numpy as np
# Fill in NaN values by propagating the last non-NaN value forwards
def ffill(x):
b = ~ np.isnan(x)
b[0] = True
return x[b][np.cumsum(b) - 1]
x = np.array([1.0, np.nan, 5.3, 1.1, np.nan, 3.0, 2.1, np.nan, np.nan])
ffill(x)
Advanced iteration *
The idea of np.cumsum
is that it accumulates successive applications of the
‘addition’ operation. Numpy lets us accumulate other functions using
accumulate
:
cumprod = np.times.accumulate
cumprod(x) # returns [x[0], x[0]*x[1], …]
cummin = np.minimum.accumulate
cummin(x) # returns [x[0], min(x[0],x[1]), …]
Reimplement the function queue_sim
using vectorized operations,
and verify that your version gives the same answer as this Python version.
This function simulates a queue which serves C units of work per timestep, with an amount of work
at arriving in timestep t. The queue size evolves
according to the recursion
qt+1 = max(qt + at - C, 0)
It can be proved that another way to get the same answer is with the formula
qt = q0 + xt - min(0, yt)
where
xt = (a0 - C) + ⋯ + (at-1 - C)
and
yt = min(q0+x1, q0+x2, …, q0+xt).
Therefore, to reimplement queue_sim
using vectorized operations, we should
- compute
x = [x1, x2, …, xt]
using np.cumsum
- compute
y = [y1, y2, …, yt]
by accumulating np.minimum
- compute
q = [q1, q2, …, qt]
import numpy as np
# Simulate a queue with service rate C and input a = [a[0],...,a[t-1]]
# Return q = [q[1],...,q[t]]
def queue_sim(q0, C, a):
res = []
q = q0
for ai in a:
q = max(q + ai - C, 0)
res.append(q)
return res
a = np.array([4, 1, 2, 8, 2, 3, 1])
C = 3
q0 = 1
print(queue_sim(q0, C, a))
import numpy as np
# Simulate a queue with service rate C and input a = [a[0],...,a[t-1]]
# Return q = [q[1],...,q[t]]
def queue_sim(q0, C, a):
res = []
q = q0
for ai in a:
q = max(q + ai - C, 0)
res.append(q)
return res
def queue_sim2(q0, C, a):
x = np.cumsum(a - C)
y = np.minimum.accumulate(q0 + x)
q = q0 + x - np.minimum(0, y)
return q
a = np.array([4, 1, 2, 8, 2, 3, 1])
C = 3
q0 = 1
print(queue_sim(q0, C, a))
print(queue_sim2(q0, C, a))
Pseudo-vectorization
Writing our code with vector algebra has two benefits: (1) the syntax is nice, and (2) it
runs faster than Python ‘for’ loops, because numpy can just use its fast internal C routines
and avoid the slow Python interpreter.
If we have a Python function that has no numpy equivalent, but we still want the benefit of nice
vector syntax, we can vectorize our function.
def f(x):
… # some function that operates on values
g = np.vectorize(f)
g(vec
) # applies f to each element
Under the hood this is still a slow ‘for’ loop, so avoid it if you can!
import numpy as np
import re
def f(x): return re.sub(r'(.*)\\s+and.*', r'\\1', x)
g = np.vectorize(f)
x = ['chalk and cheese', 'baked beans and toast', 'pride and prejudice']
g(x)
Arrays
Why arrays?
Numpy supports matrices and higher-dimensional arrays.
To enter a 2d array (i.e. a matrix) like
we type in the values as nested lists,
a = np.array([[2.2, 3.7, 9.1], [-4, 3.1, 1.3]])
Why learn about arrays?
Arrays will play only a very small role in this Scientific Computing course.
But if you go on to work with neural networks, or with graphics, or systems modelling, or any one of a
whole range of areas of computer science, you'll need to use arrays.
There isn't actually much more to them than vectors, so now is as good a time as any
to learn how to handle them in numpy.
import numpy as np
a = np.array([[2.2, 3.7, 9.1], [-4, 3.1, 1.3]])
a
Array shapes
Every numpy array has a shape.
a.shape
- For a matrix with m rows and n columns, the shape is (m,n).
- For higher-dimensional arrays, the shape is a tuple with length equal to the number of dimensions.
- A vector is just a 1d array. A vector of length n has shape (n,) — this is a tuple of length 1.
All of the usual array-creation methods can also be used to create
arrarys: simply pass in an extra argument specifying the shape.
Note: for most numpy routines this extra argument is called shape
,
but for random number generation routines it's called size
.
Create the following matrix, by passing the extra argument shape=(2,3)
to np.ones
.
import numpy as np
import numpy as np
# Note the type specifier, to return an integer array
np.ones(shape=(2,3), dtype=int)
Row and column vectors?
Numpy vectors are nothing other than one-dimensional arrays, and their shape
is a tuple of length 1.
x = np.ones(n) # vector with n items
x.shape # returns (n,)
In maths notation we distinguish between row and column vectors, but numpy
doesn't make this distinction. A vector is just a 1d array, and it's only 2d arrays that
can be described as ‘row’ or ‘column’.
Create a random column vector with n elements.
import numpy as np
n = 4
import numpy as np
n = 4
np.random.uniform(size=(n,1))
Reshaping
For 1d vectors, the only meaningful ways to reshape them is by slicing and concatenating.
But for higher-dimensional
arrays there is a whole variety of
reshaping functions
such as stacking, tiling, etc. Here are four that are particularly useful.
Adding a dimension:
a[:, None] # turn vector into row vector
a[..., None] # add dim. to arbitrary array
a[None, ...] # add first dimension
Stacking vectors:
np.column_stack([vec
, vec
, …])
np.row_stack([vec
, vec
, …])
Transposing:
a.T
Reshaping the existing elements into a new array:
a.reshape(shape
)
The shape
argument is allowed to have one entry equal to -1, meaning
“make this dimension however big it needs to be to use up all the elements”.
How many ways can you think of to create the column vector [[1],[2],…,[n]]
?
import numpy as np
n = 4
x = 1 + np.arange(n)
import numpy as np
n = 4
x = 1 + np.arange(n)
x[:, None]
x.reshape((-1,1))
np.row_stack([x]).T
Indexing and slicing
To refer to the item at row i column j (where indexes start at 0, as usual) use
a[i, j]
To refer to a subarray, we can use an extended version of Python's slice syntax:
a[:, :2] # all rows, first two columns
a[1, :] # second row, all columns
a[:2, :2] # 2×2 submatrix at the top left
If we specify only the rows we want, then it will give us all the columns:
a[:2] # first two rows
Assign new values into a subarray of a
, to create the following matrix:
import numpy as np
a = np.zeros(shape=(3,3), dtype=int)
# a[???] = ???
a
import numpy as np
a = np.zeros(shape=(3,3), dtype=int)
a[::2,::2] = np.array([[1,2],[3,4]])
a
Arrays as lists
Arrays can used in Python as if they were lists.
When we iterate over an array a
, we get a[0]
, a[1]
, ….
By the logic of numpy's slice syntax, if a
is a matrix, then each a[i]
is a row of the matrix;
and if a
is a d-dimensional matrix then each a[i]
is a (d-1)-dimensional subarray.
Print out the rows of the a
matrix, prefixed by "Row="
and the row number.
import numpy as np
a = np.arange(10).reshape((5,2))
a
import numpy as np
a = np.arange(10).reshape((5,2))
for i,row in enumerate(a):
print(f"Row={i}: {row}")
Vectorized operations
Numpy's usual vectorized operations can be applied to arrays, though
now we have to specify which dimensions we want to apply them to.
np.sum(a, axis=0) # vec of column sums
np.sum(a, axis=1) # vec of row sums
np.sum(a) # grand total
Find the 4-vector of column sums for the matrix
import numpy as np
,
import numpy as np
a = np.arange(12).reshape((3,-1)) + 1
np.sum(a, axis=0)
Exercise
I have been given a greyscale image, in the form of an integer matrix.
But it has a white border (value 255), which I would like to remove.
The border might be different thicknesses on different sides. How can I detect the border and
remove it?
|
255 | 255 | 255 | 255 | 255 |
|
255 | 255 | 255 | 255 | 255 |
255 | 255 | 255 | 255 | 255 |
255 | 7 | 7 | 255 | 255 |
255 | 3 | 93 | 6 | 255 |
255 | 255 | 255 | 255 | 255 |
HINT: Use the vectorized function np.all
along one or other dimension
of the array.
import numpy as np
w = 255
img = np.array([[w,w,w,w,w],
[w,w,w,w,w],
[w,w,w,w,w],
[w,7,7,255,w],
[w,3,93,6,w],
[w,w,w,w,w]])
import numpy as np
w = 255
img = np.array([[w,w,w,w,w],
[w,w,w,w,w],
[w,w,w,w,w],
[w,7,7,255,w],
[w,3,93,6,w],
[w,w,w,w,w]])
rw = np.all(img==w, axis=1) # for each row: is it all w?
cw = np.all(img==w, axis=0) # for each col: is it all w?
img[np.argmin(rw):-np.argmin(rw[::-1]), np.argmin(cw):-np.argmin(cw[::-1])]
Broadcasting *
Numpy has a powerful tool called
broadcasting
which generalizes ‘add a scalar to a vector”.
It's more advanced than we need for this course,
but it's used a lot in deep learning, so it's worth reading about.
The code on the right shows a simple example: normalizing a matrix so the columns sum to 1.
import numpy as np
a = np.array([[3,2,8], [2,6,2]])
colsums = np.sum(a, axis=0)
a / colsums
Advanced indexing
There are two ways to refer to an arbtrary group of items inside an array,
both called advanced indexing.
Boolean indexing
a[mask
]
This refers to the elements of a
where the mask is True
.
The mask should have the same shape as a
.
(Although, as with slicing, we can omit trailing dimensions.)
Integer indexing
a[[i1,i2,…,in], [j1,j2,…,jn]]
This refers to elements
a[i1,j1]
,
a[i2,j2]
,
…, a[in,jn]
.
The syntax extends to arrays with more than two dimensions.
A permutation matrix is a square matrix of 0s and 1s, where
each row contains exactly one 1, and each column likewise. Using
np.random.permutation(n)
create a random n×n
permutation matrix.
(NOTE: The function
np.random.permutation(n)
produces a random permutation of
the integers [0,1,…,n-1].)
import numpy as np
n = 5
import numpy as np
n = 5
p = np.zeros((n,n), dtype=int)
p[np.arange(n), np.random.permutation(n)] = 1
p
Linear algebra *
For matrix multiplication, use the @
operator.
vec
@ vec
# dot product
mat
@ vec
# matrix × vector
mat
@ mat
# matrix × matrix
In Easter term you will study linear algebra in the Maths for Natural Sciences course.
All the functions you'll come across there can be found in
np.linalg.
Simultaneous linear equations
A collection of m simultaneous linear equations in n unknowns can
be written as A @ x = b, where A is an m×n matrix,
x is the length-n vector of unknowns, and b is a length-m vector.
To solve for x, when m=n and we also know that
the equations have a solution and that it is unique,
np.linalg.solve(A, b)
To get a solution for x otherwise,
np.linalg.lstsq(A, b)[0]
If the equations have multiple solutions, it will return one of them.
If the equations have no solution, it will return a solution that makes
A @ x as close to b as possible (in the ‘least squares’ sense).
Solve the equations
\\begin{align}
3.1 x_1 + x_2 &= 0.5 \\\\
2 x_1 - 1.5 x_2 &= 1
\\end{align}
import numpy as np
import numpy as np
A = [[3.1, 1], [2,-1.5]]
b = [0.5, 1]
np.linalg.solve(A, b)
Simulation
Basic simulation
Simulation is a mainstay of scientific computing.
A common pattern with numpy is to predefine a vector or array to store the results,
one row per timestep, and then iterate over timesteps gradually filling in
the array.
The code on the right shows an example, a simulator of predator–prey dynamics.
Let $f_t$ be the population of foxes (predators) at time $t$, let $r_t$ be the population of rabbits (prey),
and suppose they evolve according to
\begin{align}
\frac{d r_t}{dt} &= \alpha r_t - \beta r_t f_t\\
\frac{d f_t}{dr} &= -\gamma f_t + \delta r_t f_t
\end{align}
where $\alpha$, $\beta$, $\gamma$, and $\delta$ are constants. This model expresses the idea that
rabbits will (if left unchecked) reproduce exponentially, that foxes will (if they get no food) die out,
and that there are interactions between rabbits and foxes in which the rabbit gets eaten
and the fox gets food needed for reproduction.
The simulator is a simple discrete-time approximation, based on
\begin{align}
r_{t+\Delta} &\approx r_t + \Delta \frac{d r_t}{d t}\\
f_{t+\Delta} &\approx f_t + \Delta \frac{d f_t}{d t}
\end{align}
This simulator is simple and naive.
- From a mathematical point of view it's naive, because there are much
more sophisticated numerical
methods for solving differential equations.
- From a computer science point of view it's not good code, because it tangles together the iteration logic with the logging logic.
It should really be rewritten with generators.
- But from a scientific computing point of view, simulations like this are so easy to put
together and learn from, that they are invaluable.
import numpy as np
import matplotlib.pyplot as plt
α,β,γ,δ = (0.11, .04, .04, .01)
Δ = 0.01 # timestep
T = 500 # total simulated time
res = np.zeros((round(T/Δ),2)) # preallocate array to store sim
res[0] = [10, 10] # initial state r[0]=10, f[0]=10
for i in range(1, len(res)):
r,f = res[i-1]
dr,df = (α*r - β*r*f, -γ*f + δ*r*f)
res[i] = res[i-1] + Δ * np.array([dr,df])
# Note: the plotting code will be explained later in this course
fig,axes = plt.subplots(2,1)
for j,lbl,ax in zip([0,1], ['rabbits','foxes'], axes):
ax.plot(np.arange(len(res))*Δ, res[:,j])
ax.set_ylabel(lbl)
plt.show()
Monte Carlo simulation
A simulation which uses random numbers is called a Monte Carlo simulation
(for the rather daft reason that the city of Monte Carlo is famous for gambling, and gambling is also random).
Arguably all simulators should be Monte Carlo, since real-world dynamics are invariably random.
Let's inject some randomness into this predator-prey simulator.
Replace the update step with
noise = np.random.normal(loc=1, scale=0.002)
res[i] = res[i-1] * noise + Δ * np.array([dr,df])
In Monte Carlo simulation it's valuable to understand how much the outcome
depends on random chance.
Wrap up the simulator into a function
called runsim
. This function should create an empty res
array, run the simulator to fill it in,
then return it.
Then run it 5 times and store the result in a list,
runs = [runsim() for _ in range(5)]
Show all 5 runs, by replacing the line that calls ax.plot
with
for res in runs:
ax.plot(np.arange(len(res))*Δ, res[:,j])
How do we decide what sort of random variables to use and where to put them in the simulator?
This is a great big area of study — it's what data science and machine learning are all about.
You'll hone your modelling skills throughout the rest of your degree!
import numpy as np
import matplotlib.pyplot as plt
α,β,γ,δ = (0.11, .04, .04, .01)
Δ = 0.01 # timestep
T = 500 # total simulated time
res = np.zeros((round(T/Δ),2))
res[0] = [10, 10]
for i in range(1, len(res)):
r,f = res[i-1]
dr,df = (α*r - β*r*f, -γ*f + δ*r*f)
res[i] = res[i-1] + Δ * np.array([dr,df])
# Note: the plotting code will be explained later in this course
fig,axes = plt.subplots(2,1)
for j,lbl,ax in zip([0,1], ['rabbits','foxes'], axes):
ax.plot(np.arange(len(res))*Δ, res[:,j])
ax.set_ylabel(lbl)
plt.show()
import numpy as np
import matplotlib.pyplot as plt
α,β,γ,δ = (0.11, .04, .04, .01)
Δ = 0.01 # timestep
T = 500 # total simulated time
def runsim():
res = np.zeros((round(T/Δ),2))
res[0] = [10, 10]
for i in range(1, len(res)):
r,f = res[i-1]
dr,df = (α*r - β*r*f, -γ*f + δ*r*f)
noise = np.random.normal(loc=1, scale=0.002)
res[i] = res[i-1] * noise + Δ * np.array([dr,df])
return res
runs = [runsim() for _ in range(5)]
fig,axes = plt.subplots(2,1)
for j,lbl,ax in zip([0,1], ['rabbits','foxes'], axes):
for res in runs:
ax.plot(np.arange(len(res))*Δ, res[:,j])
ax.set_ylabel(lbl)
plt.show()
Reproducibility
A Monte Carlo simulator gives different results each time we run it — that's the whole point
of using randomness! But for reproducibility (and especially for debuggability) we'd like to be able
to force it to pick the same random numbers each time.
In numpy the pattern is to create a ‘stream of randomness’, let's call it ω,
initialized with an arbitrary seed.
Thereafter use methods of ω to
generate actual random numbers.
ω = np.random.default_rng(seed
) # seed
an integer
ω.uniform(...)
ω.normal(...)
# and many other random variable distributions
Run this code several times. You should see a different output each time.
Modify runsim
to accept an argument ω
, and replace np.random.binomial
by ω.binomial
.
Next, define ω using np.random.default_rng
, using whatever seed takes your fancy,
and call runsim(ω)
. You should now see the same output every time you run the code.
In probability theory, we talk about a sample space $\Omega$, the set of all possible outcomes of an experiment.
A particular outcome is denoted $\omega\in\Omega$. A random variable $X$, for example the outcome of a roll of a die, is
(formally speaking) a function $X:\Omega\to\{1,\dots,6\}$. This is why I like to use the symbol
$\omega$ in my code to denote a ‘stream of randomness’.
import numpy as np
def runsim():
return np.random.binomial(n=1, p=0.5, size=10)
runsim()
import numpy as np
def runsim(ω):
return ω.binomial(n=1, p=0.5, size=10)
ω = np.random.default_rng(1618)
runsim(ω)
Next steps
Well done!
You've finished this tutorial about NumPy.
The next step is to embark on your own scientific investigations,
using numpy to speed things up — to make your code more concise hence faster to type and easier to read,
and also to make it run faster.
Tick 1 will ask you to implement a simulator of the COVID epidemic.
You'll investigate two factors that impact the spread of the disease: vaccination rates,
and social mingling.
import numpy as np
s = 'W eaed iloanlng!'
a = np.array(list(s)).reshape(4,4).T
''.join(a.reshape(-1))