Pandas is a rich expressive library for analyzing data. Plus it comes with a vast set of utilities. Once you're familiar with what pandas has to offer, you'll be able to do all sorts of data handling tasks by stringing together the right pandas commands.
from IPython.display import YouTubeVideo
YouTubeVideo('hKOkdgv6RkI', width=560, height=315)
At the top of almost every piece of data handling work, we’ll import these standard modules.
import numpy as np
import pandas
import matplotlib.pyplot as plt
The running example for this section is a dataset of stop-and-search records, made available by the
UK home office. As it’s a moderate-sized file (172MB) I like to download it to disk, so it’s fast to
reread it each time I restart the notebook. Here’s how we can fetch a file from a url, using the Unix
command-line tool wget
. (The exclamation mark !wget
in the code below is called a Jupyter magic, and it means “Treat this
line as though it were executed at the command prompt”. In IB Unix Tools you’ll learn more about
the Unix command line.)
# Execute a unix command to download a file (if it’s not already
# downloaded), and show download progress
import os.path
if os.path.exists('stop-and-search.csv'):
print("file already downloaded")
else:
!wget "https://www.cl.cam.ac.uk/teaching/2021/DataSci/data/stop-and-search.csv"
We almost always work with data in the form of a spreadsheet-like table, referred to as a dataframe. A dataframe is a collection of named columns. Each column has the same length, and all entries in a column have the same type, though different columns may have different types. Pandas uses numpy to store columns, so it’s reasonably fast.
Here’s how to load a dataframe from a file using pandas.read_csv
, and how to inspect it. (This
dataframe will be used as a running example in the rest of this notebook.) The Pandas library is full of
handy utilities like $\textsf{read_csv}$.
# Import a dataframe using the pandas library
stopsearch = pandas.read_csv('stop-and-search.csv')
# How many rows are there?
print(f"This dataset has {len(stopsearch)} rows")
# What are the columns?
print(stopsearch.columns)
# Display the first 5 rows. iloc[:5] means ”select the first five rows”
# (not all columns fit on this page)
stopsearch.iloc[:5]
Missing values (as in the third entry in the $\textsf{age_range}$ column above) are a fact of life in data science.
They should really be supported by Python itself, but they aren’t, so Pandas adopts its own conventions:
it uses either $\textsf{np.nan}$ (the IEEE floating point for ‘Not A Number’), or $\textsf{None}$ (the built-in Python value
commonly used to denote ‘no return value’). It’s best to use $\textsf{np.nan}$ when the underlying column is a
numpy vector of floating point values, but for other column types it doesn’t matter which is used. To
determine whether values are missing, use pandas.isna
.
pandas.isna(stopsearch.age_range[:5]) # returns [False,False,True,False,False]
sum(pandas.isna(stopsearch.age_range)) # count number of missing values
In my experience, around 70% of the time you spend working with data will be fighting to import it and clean it up. See the supplementary notebooks for a collection of recipes for web scraping, reading from a database, and parsing log files.
The simplest case is when you have a nice simple comma-separated value (CSV) file. A CSV file looks like this:
"Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species"
5.1,3.5,1.4,0.2,"setosa"
4.9,3,1.4,0.2,"setosa"
4.7,3.2,1.3,0.2,"setosa"
4.6,3.1,1.5,0.2,"setosa"
5,3.6,1.4,0.2,"setosa"
i.e. a header line, then one line per row of the data frame, with values separated by commas. We’ve
already seen how to import a CSV, using pandas.read_csv
. If your file is nearly a CSV but has some
quirks such as comments or a missing header row, experiment with that function’s 55 options. We can
use the same function to read CSV files from remote url:
url = 'https://www.cl.cam.ac.uk/teaching/2021/DataSci/data/iris.csv'
iris = pandas.read_csv(url)
To write a CSV file,
iris.to_csv('iris.csv', index=False)
To create a dataframe from scratch, pass in a dictionary of columns. You can optionally specify the column order you want with the columns argument.
iris = pandas.DataFrame({
'species': ['setosa', 'virginica', 'virginica', 'setosa', 'versicolor'],
'Petal.length': [1.0, 5.0, 5.8, 1.7, 4.2],
'Petal.width': [0.2, 1.9, 1.6, 0.5, 1.2]
},
columns = ['Petal.length', 'Petal.width', 'species'])
Or you can create a dataframe from a list of tuples. Now the columns argument is needed to say what the column names are.
iris = pandas.DataFrame([
('setosa', 1.0, 0.2), ('virginica', 5.0, 1.9), ('virginica', 5.8, 1.6),
('setosa', 1.7, 0.5), ('versicolor', 4.2, 1.2)
],
columns = ['species', 'Petal.length', 'Petal.width'])
Data frames have a triple identity:
We can access entire columns as though the dataframe is a dictionary.
stopsearch.columns # get a list of column names
stopsearch.keys() # … and another way to do the same
x = stopsearch['outcome'] # get a single column
x = stopsearch.outcome # … and another way to do the same
del stopsearch['location'] # delete a column
# add or modify a column
stopsearch['outcome_N'] = np.where(stopsearch.outcome == 'False', 0, 1)
The columns aren't just numpy vectors, they're pandas Series objects. They support many additional real-world data-handling operations that are missing from plain numpy. For example
Series.str.startswith
and many other vectorized string operationspandas.to_datetime
and then use the many datetime operationsSeries.replace
to replace elements, as shown below# list all the non-NA values, and their counts
stopsearch['age_range'].value_counts()
# replace all the string age-ranges with numbers
r = {'18-24':21.5, '10-17':14, '25-34':30, 'over 34':40, 'under 10':8}
stopsearch['age'] = stopsearch['age_range'].replace(r)
We can obtain a new dataframe by selecting a subset of rows and/or columns,
using .loc
.
stopsearch.loc[:, ['force','datetime','outcome']] # all rows, some cols
stopsearch[['force','datetime','outcome']] # … the same thing
stopsearch.loc[stopsearch.force=='cambridgeshire'] # some rows, all cols
stopsearch.loc[stopsearch.force=='cambridgeshire', # some rows, some cols
['force','datetime','outcome']]
If we want to select rows by row number, rather than by a boolean condition as above, we need .iloc.
stopsearch.iloc[:3] # the first 3 rows
stopsearch[:3] # … and another way to do the same
stopsearch.iloc[[0,3,5]] # select several rows
stopsearch.iloc[[5]] # returns a one-row dataframe
stopsearch.sample(4) # select 4 rows at random
Row and column selectors can be combined.
wantcols = ['force','datetime','outcome']
stopsearch.loc[stopsearch.force=='cambridgeshire', wantcols]
stopsearch[wantcols].iloc[:3]
stopsearch.loc[stopsearch.force=='cambridgeshire', wantcols].iloc[:3]
To pull out a single row as a tuple, or to pull out a single value as a scalar, there is different syntax.
stopsearch['force'].iat[5] # a scalar for the specified column and row
stopsearch.iloc[5] # a tuple containing the values for row 5
We can use these indexing operations to update a specific element in the dataframe—but I think it’s cleaner to modify data using dictionary indexing, replacing an entire column, rather than hacking at individual elements.
Sometimes pandas will tell us off, warning us that the operation may be inefficient.
stopsearch['outcome_N'][0] = 2
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame
This means that we've taken a subset of the rows, and tried to set a variable. Pandas doesn't actually copy the dataframe when you take a subset of the rows, it shares the data with the original table. So, to suppress this warning, take a copy of the dataframe with df=stopsearch.copy()
, and then modify
the copy.
There is a third way to select rows from a dataframe, which in my experience is the source of endless confusion: selecting by row index. When you see a Pandas dataframe printed out, there is a column at the left without a column name. These aren’t row numbers, they are row indexes, which behave like the keys in a dictionary. In all the examples we’ve seen so far the indexes happen to be numbers, but they could be any other Python object. Row indexes are there for the same reason database tables have indexes: they’re vital for efficient lookup. But we won’t be using them in this course.
Pandas always keeps track of row indexes. Even when you pull out a single column, it doesn't give us a numpy vector, it gives us a Series object which is numpy-vector-plus-indexes.
And it always tries to match indexes.
This is usually not what we want, especially when we just want some plain simple arithmetic. I recommend that you generally use .values
when you are working with subsets of rows. This gives you the actual numpy vector behind the
column, not the confusing Series. We won’t be taking advantage of row
indexes in this course, but it’s worth knowing they exist so you can understand the cryptic errors and
error messages you will undoubtedly come across.
df = pandas.DataFrame({'x': [3,3,4,8,2]}, index=['a','b','c','d','e'])
x | |
---|---|
a | 3 |
b | 3 |
c | 4 |
d | 8 |
e | 2 |
# This looks like it should add [3,3,4] and [4,8,2] … but it doesn’t!
df['x'][:3] + df['x'][2:]
a NaN b NaN c 8.0 d NaN e NaN Name: x, dtype: float64
# To get the answer we were probably expecting,
df['x'][:3].values + df['x'][-3:].values
array([ 7, 11, 6])
The pattern behind much data processing is split-apply-combine-join: split your data into pieces, apply a transformation to each piece, combine the pieces, and join results from different datasets together. We could code this explicitly with a ‘for’ loop, but it would involve lots of boilerplate code — and I hope you have been persuaded by the NumPy notes that ‘for’ loops are considered harmful. Instead, let’s see how to do it with Pandas.
The following line of code performs a cross-tabulation: it splits the data into a separate dataframe
for each combination of officer-defined ethnicity and gender, applies the len
function to each subdataframe
to get the number of rows it contains, and combines the results into a single indexed object.
If we want to apply the function to a single column, pick it out after the groupby.
If we want to apply a more elaborate function, it's handy to define an anonymous function with lambda
.
# Select cambridgeshire records, then tabulate by ethnicity and gender
df = stopsearch.loc[stopsearch.force=='cambridgeshire']
x = df.groupby(['officer_defined_ethnicity', 'gender']).apply(len)
officer_defined_ethnicity gender Asian Female 7 Male 179 Other 1 Black Female 10 Male 257 Other Female 6 Male 28 White Female 253 Male 1465 Other 5 dtype: int64
# Apply the np.mean function to the 'age' column, separately for each gender
# (This assumes you've run the commands in 4.1 to define the age column, before defining df.)
df.groupby('gender')['age'].apply(lambda x: np.mean(x))
gender Female 24.656682 Male 24.628913 Other 24.750000 Name: age, dtype: float64
For this course, we will only apply functions that return simple Python values. It’s possible but more complicated to apply functions that return dataframes or Pandas columns or indexed arrays.
The groupby/apply commands have produced an indexed array. An indexed array is a cross between a normal numpy array and a dataframe. We access elements and sub-arrays by dimension, like a numpy array — but the indexes aren’t integer positions, they’re values from the underlying column. Also, the array might be ‘incomplete’, as in the example above which has no entry for $\textsf{['Black','Other']}$.
x.loc['Asian'] # select the sub-array of ethnicity Asian
x.loc[:, 'Other'] # select the sub-array of gender Other
x.loc[['Black','White']] # select two ethnicities, all genders
The index labels can be accessed with x.index.levels[0].values
and x.index.levels[1].values
.
To pretty-print an indexed array, use unstack. It will by default fill in any missing values with $\textsf{NaN}$
(not a number), and you can override this with $\textsf{fill_value}$.
x.unstack(fill_value=0)
gender | Female | Male | Other |
---|---|---|---|
officer_defined_ethnicity | |||
Asian | 7 | 179 | 1 |
Black | 10 | 257 | 0 |
Other | 6 | 28 | 0 |
White | 253 | 1465 | 5 |
There are two ways to convert an indexed array to a dataframe, depending on the shape of the dataframe you want to end up with.
# Convert an indexed array into a long-form dataframe
# -- the array values will become a column, and we can specify its name
x[['Black','White']].reset_index(name='count')
officer_defined_ethnicity | gender | count | |
---|---|---|---|
0 | Black | Female | 10 |
1 | Black | Male | 257 |
2 | White | Female | 253 |
3 | White | Male | 1465 |
4 | White | Other | 5 |
# Convert an indexed array into a wide-form dataframe.
# (I'm using the line-continuation character "\" so my code isn't a confusing one-liner)
x[['Black','White']].unstack(fill_value=0) \
.reset_index() \
.rename_axis(None, axis=1)
officer_defined_ethnicity | Female | Male | Other | |
---|---|---|---|---|
0 | Black | 10 | 257 | 0 |
1 | White | 253 | 1465 | 5 |
When you first start working with data, I recommend you do all your calculations on dataframes rather than indexed arrays. If you want to do calculations on an indexed array, first turn it into a dataframe. As you get deeper into working with data, you’ll discover that the skill in working with data is knowing which representation works best for your task, dataframe or indexed array. Also,
When processing data we often want to combine data at different levels of aggregation. For example, we might like to compare the frequency of false stops (i.e. where the police stopped someone and found nothing suspicious) across different ethnicities. Here’s how we prepare the first three columns … but what about the $n_\text{tot}$ column and $n/n_\text{tot}$?
df = stopsearch.loc[stopsearch.force=='cambridgeshire'].copy()
df['outcome'] = np.where(df.outcome == 'False', 'nothing', 'find')
x = df.groupby(['officer_defined_ethnicity', 'outcome']) \
.apply(len) \
.reset_index(name='n')
officer_defined_ethnicity | outcome | n | ntot | n / ntot | |
---|---|---|---|---|---|
0 | Asian | find | 116 | 192 | 60.4% |
1 | Asian | nothing | 76 | 39.6% | |
2 | Black | find | 170 | 270 | 63.0% |
3 | Black | nothing | 100 | 37.0% | |
4 | Other | find | 28 | 37 | 75.7% |
5 | Other | nothing | 9 | 24.3% | |
6 | White | find | 1060 | 1740 | 60.9% |
7 | White | nothing | 680 | 30.1% |
The database answer is to create a smaller table with two columns, $\textsf{officer_defined_ethnicity}$ and $n_\text{tot}$, and then to join this to $\textsf{x}$ using the key $\textsf{officer_defined_ethnicity}$.
y = x.groupby('officer_defined_ethnicity')['n'].apply(sum).reset_index(name='ntot')
officer_defined_ethnicity | ntot | |
---|---|---|
0 | Asian | 192 |
1 | Black | 270 |
2 | Other | 37 |
3 | White | 1740 |
z = x.merge(y, on='officer_defined_ethnicity')
p = z.n / z.ntot
z['percent_find'] = np.round(p * 100, 1)
# Also compute a margin for error; see IB Data Science for the theory
z['err'] = np.round(1.96 * np.sqrt(p*(1-p)/z.ntot) * 100, 1)
officer_defined_ethnicity | outcome | n | ntot | percent_find | err | |
---|---|---|---|---|---|---|
0 | Asian | find | 116 | 192 | 60.4 | 6.9 |
1 | Asian | nothing | 76 | 192 | 39.6 | 6.9 |
2 | Black | find | 170 | 270 | 63.0 | 5.8 |
3 | Black | nothing | 100 | 270 | 37.0 | 5.8 |
4 | Other | find | 28 | 37 | 75.7 | 13.8 |
5 | Other | nothing | 9 | 37 | 24.3 | 13.8 |
6 | White | find | 1060 | 1740 | 60.9 | 2.3 |
7 | White | nothing | 680 | 1740 | 39.1 | 2.3 |
This particular operation (groupby and apply, then merge the result back into the original table) is so common that pandas has a built-in way to do it, using df.groupby(...)[col].transform(f)
. This applies function $\textsf{f}$ to each chunk of $\textsf{col}$, and then reassembles the answer back into an array with the same index as $\textsf{df}$. We can then just put it into $\textsf{df}$ as an extra column.
x['ntot'] = x.groupby('officer_defined_ethnicity')['n'].transform(sum)
officer_defined_ethnicity | outcome | n | ntot | |
---|---|---|---|---|
0 | Asian | find | 116 | 192 |
1 | Asian | nothing | 76 | 192 |
2 | Black | find | 170 | 270 |
3 | Black | nothing | 100 | 270 |
4 | Other | find | 28 | 37 |
5 | Other | nothing | 9 | 37 |
6 | White | find | 1060 | 1740 |
7 | White | nothing | 680 | 1740 |
Pandas also lets us join indexed arrays on their common indices, and that can often streamline more advanced database-style joins; but that counts as more advanced Pandas usage than we will cover here.