Plotting data with matplotlib

Matplotlib is a powerful but unintuitive plotting library. The best way to learn Matplotlib is to browse through galleries until you find something you like, and copy it. But to make sense of what you read, you need to know the basic structure of a plot, which is laid out in section 1. Then, to customize your plots you’ll need to make frequent use of Google, Stack Overflow, the matplotlib gallery, and maybe if things get desperate look at the documentation. The rest of this notebook is a gallery — scroll through the pictures, and if you see something you like then read the code. The Python Graph Gallery is also a great source of inspiration, but most of its plots use a Matplotlib wrapper called seaborn, which is yet another thing to learn.

Contents

0. Preamble

Here are the standard imports for nearly any piece of data handling work:

The plots in this notebook are all based on the stop-and-search dataset explored in Notebook 3.

1. Code structure for plotting

Here is the general structure of plot code. I find it helpful to build up my plot step by step, adding pieces in the order listed here, and checking at each step what the plot looks like. If you add everything all in one go, chances are it won’t work and you won’t know which bit went wrong.

# First, prepare the data and put it into a dataframe

# Get the overall Figure object (used for some overall customization)
# and Axes object (used for the actual plotting)
# Set figure size and other style parameters
fig,ax = plt.subplots(figsize=(x,y), ...)

# 1. Draw data points / bars / curves etc. onto ax
# 2. Configure limits and colour scales
# 3. Add annotations, text, arrows, etc.
# 4. Configure the grid, tick location, tick labels and format
# 5. Legend, axis labels, titles

# Save as pdf or svg or png, depending on the destination
plt.savefig('myplot.pdf', transparent=True, bbox_inches='tight', pad_inches=0)
plt.show()

Here's a very simple example.

It's usually more interesting to produce plots consisting of one or more subplots. The code to produce this starts with

fig,(ax1,ax2,ax3) = plt.subplots(nrows=1,ncols=3)

which gives us thee Axes objects, once for each subplot, which we can then draw on using ax1.bar, ax2.bar and so on. The full code is in the gallery below.

multipanel plot

You’ll also see plenty of code samples which use commands like plt.barh or plt.yticks. That’s old-style ‘stateful’ code, where matplotlib tries to work out which subplot you’re currently drawing on — it works fine if you only have one subplot, but it’s confusing when you have multiple subplots. Matplotlib documentation advises that for more complex plots you should get the Axes object first and then use ax.barh or ax.set_yticks.

2.1 MULTIPANEL BAR CHART

Here’s the code behind our multipanel plot, shown above. Note the line

fig,(ax1,ax2,ax3) = plt.subplots(nrows=1,ncols=3, sharey=True)

which asks for three subplots in a row, and says that their $y$ scales are to be shared. Matplotlib picks the scales automatically to fit the objects drawn onto a subplot, and sharey=True means that all three subplots get their scales adjusted. It also means that the tick marks are only shown on one of the three subplots.

Your computer scientists, so you should produce the three plots with a for loop, rather than by copying the plot command three times!

2.2 HISTOGRAM AND DENSITY PLOT

This plot shows two graphics superimposed, a histogram (i.e. a bar chart based on binned counts), and a smooth curve for the density. To produce the smooth curve we can use a generic smoother such as scipy.stats.gaussian_kde, which takes the underlying data and returns a function, and then apply this function to evenly-spaced values along the $x$-axis to generate the points to be plotted.

2.3 LINE PLOTS + LEGEND

There are several techniques being used in this example.

2.4 SCATTER PLOT + DISCRETE COLOUR SCALE

For scatter plots, use Axes.scatter. This lets you specify the marker, the size s, and the color c. Here I'm iterating through the different police forces, and calling $\textsf{scatter}$ each time. I chose an appropriate colour scale using

c = plt.get_cmap('Pastel1', n))

This gives a function that we can call to get actual colour values. In this case it's a discrete colour scale with values $c(0),\dots,c(n-1)$.

2.5 MULTIPANEL PLOT AGAIN

Here is another multipanel plot, also called a facet plot or small multiples plot. According to the plotting guru Edward Tufte,

At the heart of quantitative reasoning is a single question: Compared to what? Small multiple designs, multivariate and data bountiful, answer directly by visually enforcing comparisons of changes, of the differences among objects, of the scope of alternatives. For a wide range of problems in data presentation, small multiples are the best design solution.

We also showed a multipanel plot at the top of this notebook. There are actually two types of facet plot:

Two other things worth mentioning in this code.

2.6 HEATMAP + CONTINUOUS COLOUR SCALE

This plot uses Axes.imshow to draw a heatmap. This takes an array, and treats it as pixels to be coloured. We can tell it what colour scale to use with cmap, and control the limits with vmin and vmax, and show the scale with plt.colorbar.

It's a bit of a hassle working with imshow because matplotlib just sees the data as an array, and we have to tell it explicitly what the rows and columns mean, using the extent argument. If we have a full dataset, as we do here, it's much easier to use Axes.bin2d.

This plot suggests there's some issue with the data. It's worth investigating what's going on in that one day of the year!