Pandas
Pandas is a rich expressive library for analyzing datasets. Plus it comes with a vast set of utilities.
Once we're familiar with what pandas has to offer, we can do all sorts of data handling tasks by stringing
together just the right pandas commands.
Pandas is also a bit of a monster. It tries to be all
things to all users, so there are many ways of completing a task —
as we see whenever we ask ChatGPT
or search stats.stackexchange.
This tutorial is an opinionated guide about how I think one should use pandas.
DataFrames
Loading and saving data
We almost always work with data in the form of a spreadsheet-like table, referred to as a DataFrame.
In the wild, data is often stored in csv files (‘comma-separated values’). These are plain-text files,
and they're a robust and open way to share datasets.
To load a csv file, either from a url or a local filename, use the function
pandas.read_csv(file
)
This returns a DataFrame object. A DataFrame is a collection of named columns.
All columns have the same length, and within a column all entries have the same type. Pandas
stores columns as numpy vectors.
To save a DataFrame df
to a csv file,
df.to_csv(filename
, index=False)
Load the csv file https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv
and look at the DataFrame.
This file consists of police stop-and-search incidents, from data.police.uk,
and limited to Cambridgeshire Constabulary records for 2021.
import pandas
import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)
stopsearch
Eyeballing
Eyeballing is when we look at a dataset just to get an idea of what it contains,
without any definite questions in mind.
Here are some eyeballing functions that we can use on a DataFrame df
:
len(df) # number of rows
df.iloc[i] # returns row number i
df.iloc[slice
] # returns a range of rows
df.sample(n) # returns n rows chosen randomly
df.columns # column names
df.colname
# returns a single column
How many rows and columns are there in the stopsearch
dataset?
import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)
import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)
f"{len(stopsearch)} rows and {len(stopsearch.columns)} columns"
Missing values
Missing values are a fact of life in data science.
They should really be supported by Python itself, but they aren't, so pandas adopts its own
conventions: it uses either np.nan
(the IEEE floating point for ‘Not A Number’) or
None
(the built-in Python value).
To determine whether values are missing, regardless of which convention is being used,
pandas.isna(x)
where x
is a vector, for example a column from the dataframe.
This function returns a boolean vector, one item for each item of x
.
There's a very handy function for filling in missing values:
x.fillna(replacement
)
x.fillna(method='ffill') # carry forwards
For each of the columns of stopsearch
, report what percentage of items are missing.
import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)
import pandas
import numpy as np
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)
for colname in stopsearch.columns:
print(f"{colname}: {np.mean(pandas.isna(stopsearch[colname]))*100:.2f}% missing")
Creating and combining DataFrames
To create a DataFrame from scratch, pass in a dictionary of column-vectors. We can optionally specify the column
order we want with the columns
argument.
df = pandas.DataFrame(
{'x': [x1,x2,…], 'y': [y1,y2,…], …},
columns=['y','x',…])
Or we can create a dataframe from a list of rows, each row represented by a tuple.
In this case we need to use the columns
argument
to specify what the column names are.
df = pandas.DataFrame(
[(x1,y1,…), (x2,y2,…), …],
columns=['x','y',…])
To stack two or more dataframes on top of each other, matching the columns by name,
there are two functions that do the same thing:
pandas.concat([df1, df2])
df1.append(df2)
Create this dataframe:
species | petal.length | petal.width |
"setosa" | 1.0 | 0.2 |
"virginica" | 5.0 | 1.9 |
"virginica" | 5.8 | 1.6 |
"setosa" | 1.7 | 0.5 |
"versicolor" | 4.2 | 1.2 |
This data is from a famous dataset, the iris flower
dataset, introduced in 1936 by the pioneering statistician Ronald Fisher, from data collected by Edgar Anderson.
The discipline of data science and machine learning is propelled by the sharing of datasets.
import pandas
import pandas
iris = pandas.DataFrame({
'species': ["setosa","virginica","virginica","setosa","versicolor"],
'petal.length': [1.0, 5.0, 5.8, 1.7, 4.2],
'petal.width': [0.2, 1.9, 1.6, 0.5, 1.2]})
iris
Dictionary of columns
Dataframes have a triple identity:
- they're like a dictionary, where the keys are column names and the values are entire column vectors
- they're like a database table, from which we can select subtables
- they're like arrays, and we can refer to rows by number
Dictionary of columns
df.keys() # list of all column names
df['colname
'] # returns a single column
df['colname
'] = vec
# create a new column
del df['colname
'] # delete a column
We have previously seen another way to refer to columns, which is slightly less typing
(but slightly more ambiguous — imagine if one of our columns were called "columns"
!)
df.columns # same as df.keys()
df.colname
# same as df['colname
']
Add a new column to stopsearch
called outcomeN
,
an integer column, equal to 0 if outcome=="A no further action disposal"
and 1 otherwise.
import pandas
import numpy as np
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)
# What are the values in the 'outcome' column?
np.unique(stopsearch.outcome)
# Create a new column outcomeN, with value 0 if outcome == "A no further action disposal"
import pandas
import numpy as np
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)
# What are the values in the 'outcome' column?
np.unique(stopsearch.outcome)
# Create a new column outcomeN, with value 0 if outcome == "A no further action disposal"
stopsearch['outcomeN'] = np.where(stopsearch.outcome == "A no further action disposal", 0, 1)
Extracting subtables
We can obtain a new DataFrame by filtering the rows and/or selecting a subset of columns, using .loc
:
df.loc[boolvec
]
df[['col1
','col2
',…]]
df.loc[boolvec
, ['col1
','col2
',…]]
Because of irritating rules about operator precedence, we need to wrap up logical conditions with lots of brackets, e.g.
df.loc[(df.x==y) & (~df.z)]
.
We can also select a subset of rows by row numbers, using .iloc
:
df.iloc[slice
]
df.iloc[[i1,i2,…]]
To sort a DataFrame according to the value of a column, or to pick a single arbitrary row for each level of a column,
df.sort_values('col
')
df.drop_duplicates('col
')
(These two functions also accept a list of column names rather than just a single column name.)
From stopsearch
extract a subtable consisting of the first 10 rows, and columns gender
and outcome
.
import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)
import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)
stopsearch[['gender','outcome']].iloc[:10]
Picking out rows and elements
To pull out a single specific value, row i of column colname
,
df.colname
.iat[i]
To pull out a single row,
r = df.iloc[i]
We can then access the values of this row using r.colname
.
Modifying a DataFrame
We can add, replace, and remove columns by treating the dataframe as a dictionary:
df['colname
'] = vec
del df['colname
']
There's a shorthand for renaming columns:
dfnew = df.rename(columns={'old
': 'new
', …})
df.rename(columns={'old
': 'new
', …}, inplace=True)
We can replace items in a column:
df['colname
'].loc[…] = newvals
df['colname
'].iloc[…] = newvals
df['colname
'].iat[i] = newval
In stopsearch
, the column self_defined_ethnicity
has some missing values.
Replace them by "Not stated"
.
import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)
# replace missing values in self_defined_ethnicity by "Not stated"
import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)
stopsearch.self_defined_ethnicity.loc[pandas.isna(stopsearch.self_defined_ethnicity)] = 'Not stated'
SettingWithCopyWarning
Sometimes, when modifying a DataFrame, pandas tells us
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
This means we've taken a subset of the rows, and tried to set or update a column. Pandas doesn't actually copy the
whole DataFrame when we extract a subtable,
instead for efficiency the original table and the subtable share the same underlying data,
but this doesn't work if we want to modify values in the subtable.
To avoid this warning, take a copy of the subtable and then modify
the copy:
df2 = df.loc[…].copy()
Operations on columns
The column vectors in a DataFrame are Series objects.
Series objects are mostly just like numpy vectors. But they have many
extra real-world data-handling operations that are missing from plain numpy.
Some column operations I use a lot are
x.describe() # general summary
x.value_counts() # unique values & counts
x.sort_values() # sort
For many tasks you'll probably find that pandas has a built-in vectorized
function that can help. Please go and read the documentation around any functions mentioned in this tutorial —
many of them have options that might be useful.
Here are some examples. A single line of pandas code can do a lot, so for readability I like to split my lines.
stopsearch.location_latitude \\
.mean()
stopsearch.location_street_name \\
.str.contains('Parking')
pandas.to_datetime(stopsearch.datetime) \\
.dt.year \\
.min()
In the stopsearch
dataset, the entries in self_defined_ethnicity
all have the two-part form “A - B”.
Extract just the first part, using
.str.extract("(.*) -", expand=False)
and tabulate the result
using value_counts
.
import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)
import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)
stopsearch.self_defined_ethnicity.str.extract("(.*) -", expand=False).value_counts()
Exercise
Create a new column age
, obtained from age_range
according to
the table below.
HINT: look up the documentation for replace
age_range | age |
under 10 | 8.0 |
10-17 | 14 |
18-24 | 21.5 |
25-34 | 30 |
over 34 | 40 |
import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)
stopsearch.age_range.unique()
import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)
r = {'under 10':8, '10-17':14, '18-24':21.5, '25-34':30, 'over 34':40}
stopsearch['age'] = stopsearch['age_range'].replace(r)
Exercise
Take the subset of rows where event_type
is 0 or 1, then define a new column
Δt
containing the time since the previous event.
HINT: look up isin
and diff
.
import pandas
df = pandas.DataFrame({'time': [1,2,5,4,6,8,3,7], 'event_type': [0,0,0,1,1,1,2,2]})
df
import pandas
df = pandas.DataFrame({'time': [1,2,5,4,6,8,3,7], 'event_type': [0,0,0,1,1,1,2,2]})
df = df.loc[df.event_type.isin([0,1])].sort_values('time')
df['Δt'] = df.time.diff()
df
Indexes and tabulation
Indexes
Pandas has an idiosyncratic approach to labelling items in a dataset. It's inspired by dictionary keys, and by database
indexes — but it generalizes these into something altogether more powerful, called an Index.
Indexes unify arrays and data tables, and they pervade nearly every piece of pandas data analysis.
Indexes for DataFrames
In fact we've already seen indexes. Every DataFrame comes with an index, a label for each row.
When we display a DataFrame, the index is shown on the left.
Evaluate the code on the right. How are the row labels displayed?
We can set the index when we create a DataFrame. If we don't set an index,
pandas will set it for us.
Add an extra argument to the pandas.DataFrame
call, and evaluate to see the new index.
pandas.DataFrame(…, index=['a','b','c'])
We can select rows by label, using .loc
:
df.loc[[label1
, label2
, …]]
Select the subtable consisting of rows 'c'
and 'a'
.
import pandas
df = pandas.DataFrame({'x': [1,2,3], 'y': ['yes','yes','no']})
df
import pandas
df = pandas.DataFrame({'x': [1,2,3], 'y': ['yes','yes','no']},
index=['a','b','c'])
df.loc[['c','a']]
Series
A Series is a numpy vector in which each item has a label. The labels are called the index
.
# To create a new Series:
x = pandas.Series(vec
, index=vec
)
x.values # the underlying numpy vector
x.index.values # the numpy vector of labels
When we get a column from a DataFrame, df.colname
, what is returned is a Series
with the same index as the DataFrame.
Evaluate the code on the right, to see how a Series is displayed.
import pandas
df = pandas.DataFrame({'x': [3,2,4,8,2]}, index=['a','b','c','d','e'])
df.x
Automatic alignment
Pandas always tries its hardest to match indexes, whenever we combine two objects.
-
Binary maths operations on Series are applied per label, not per position in the vector.
- When we add a new Series column to a DataFrame, the items are added per label, not per position in the column.
The code on the right looks like it should compute the differences between successive elements of x
.
Run it, and see what it actually produces. Why doesn't it compute differences?
Replace the last line with df.x.diff()
, which will actually do what we want.
import pandas
df = pandas.DataFrame({'x': [3,2,4,8,2]}, index=['a','b','c','d','e'])
df.x[1:] - df.x[:-1]
import pandas
df = pandas.DataFrame({'x': [3,2,4,8,2]}, index=['a','b','c','d','e'])
df.x.diff()
Tabulation
Tabulation is achieved by (1) splitting the dataset into groups, (2) counting the number of items in each group.
df.groupby(['col1
','col2
',…]).apply(len)
Tabulate the stopsearch
dataset, by officer_defined_ethnicity
and by gender
.
Table = Series with multidimensional index
Pandas’ big clever idea is that a table can be thought of as a Series (i.e. a vector) with a multidimensional index
(one dimension per grouping variable).
This makes it
basically a cross between an array and a dataframe:
- Like an array, we can access elements and subarrays by saying which indexes we want. But unlike an array
the indexes aren't integer positions: instead they're values of the grouping variables.
- Like a dataframe, and unlike an array, the multidimensional Series might be ‘ragged’ — it might not
have entries for every combination of its dimensions.
Let x
be the result of your tabulation. Evaluate these:
x.loc['Asian']
x.loc[:, 'Female']
x.loc[['Black','White']]
import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)
import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)
x = stopsearch.groupby(['officer_defined_ethnicity','gender']).apply(len)
print("Full table:", x, sep="\\n")
x.loc['Asian'] # the 1d subtable of ethnicity = Asian, all genders
x.loc[:, 'Female'] # the 1d subtable of all ethnicities, gender = Female
x.loc[['Black','White']] # the 2d subtable with two ethnicities, all genders
Exercise
In the code on the right, why does x
have fewer items than y
?
Sort x
, largest count to smallest.
import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)
x = stopsearch.groupby(['age_range']).apply(len)
y = stopsearch['age_range'].unique()
len(x), len(y) # why are these not equal?
import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)
x = stopsearch.groupby(['age_range']).apply(len)
y = stopsearch['age_range'].unique()
# x doesn't include missing values, y does
# We can use groupby(..., dropna=False) to keep missing values.
# But the version of pandas in this tutorial is old, and doesn't support this :-(
x.sort_values(ascending=False)
Table → DataFrame
To convert a table x
into a DataFrame,
x.reset_index(name='valname
')
- There will be one row for every combination of grouping variables that's present in the dataset
- There will be one column per grouping variable, plus an extra column for the values in the table; we can specify
what name to use for this last column
In effect, reset_index
takes the multidimensional index, turns it into columns,
puts these columns into a DataFrame along with the values, and then resets the DataFrame's index to be just
plain numbers. That's why it's called reset_index
.
There's a sister command set_index
which takes columns of a DataFrame and converts them into the index.
Convert the table on the right into a DataFrame.
import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)
stopsearch.groupby(['officer_defined_ethnicity','gender']).apply(len)
import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)
stopsearch.groupby(['officer_defined_ethnicity','gender']).apply(len) \\
.reset_index(name='num_stops')
Table → array
To convert a table x
into an array,
x.unstack(fill_value=val
)
But what actually is an array? In pandas, an array is just a DataFrame.
The unstack
command takes the final grouping variable,
and pivots it to become column headings.
Evaluate the code on the right, to see the resulting DataFrame.
To get the underlying numpy array,
x.unstack().values
Pandas has a confusing notion that groups of columns can have names. I've never found this useful, so
I always delete the column-group name that unstack
creates:
x.unstack().rename_axis(columns=None)
If the table was ‘ragged’, in other words if there were some
missing combinations of grouping variables, then unstack
will by default fill in
with NaN. We can override this by specifying fill_value
.
Apply the following operations to the unstacked table. Explain what they do.
unstacked_table
\\
.rename_axis(columns=None) \\
.reset_index()
import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)
stopsearch.groupby(['officer_defined_ethnicity','gender']).apply(len) \\
.unstack(fill_value=0)
import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)
stopsearch.groupby(['officer_defined_ethnicity','gender']).apply(len) \\
.unstack(fill_value=0) \\
.rename_axis(columns=None) \\
.reset_index()
# unstack: convert table to array-form
# rename_axis: get rid of useless column-group heading
# resent_index: turn officer_defined_ethnicity (the index) into a column
Which to use?
In Pandas there are always lots of ways to solve a given task, and
it can be quite bewildering to know when to use DataFrames and when to use tables.
When you start working with data, I recommend you do all your calculations on simple DataFrames
and you don't make any clever use of indexes.
If you create a table, turn it into a DataFrame straight away,
either long-form (with reset_index
) or wide-form (with unstack
followed by rename_axis
and reset_index
).
As you get deeper into working with data, you'll work out the situations where a clunky
groupby
and .loc
can be streamlined by using indexes.
Split-apply-merge
Transforming data
The pattern behind much data processing is split-apply-merge: split the DataFrame into pieces
according to specified grouping variables,
apply a transformation to each piece, merge results together.
We could code this explicitly
with ‘for’ loops, but it would involve lots and lots of boilerplate code.
Pandas has several tools to streamline things:
- aggregate, to compute a summary for each piece; it returns a table with one item for
each combination of the grouping variables
- transform, to apply a transformation to each piece; it returns a table
with the same rows as the original DataFrame
- merge, which let us combine DataFrames at different levels of aggregation
Aggregate
Aggregate
df.groupby(['groupcol
',…]).apply(func
)
df.groupby(['groupcol
',…])[valcol
].apply(func
)
These two operations split the DataFrame, apply an aggregation function to each piece, and
join the results together into a table.
- In the first version, the entire DataFrame
df
is split into sub-DataFrames
and func
is applied to each piece.
- In the second version, it's the column
df.valcol
that's split into sub-Series
and func
is applied to each piece.
- In both cases,
func
should return a simple Python value, not a pandas Series or
DataFrame.
We've already seen the first version — it's how tabulation works, using
the aggregation function len
to get the number of rows in the sub-DataFrame.
Find the mean age
, separately for each gender.
We can find the mode (most frequent value) of x
with x.mode().values[0]
.
Find the mode of age_range
for each combination of officer_defined_ethnicity
and gender
.
import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)
r = {'under 10':8, '10-17':14, '18-24':21.5, '25-34':30, 'over 34':40}
stopsearch['age'] = stopsearch['age_range'].replace(r)
# Find the mean age for each gender
# Find the mode of age_range for each officer_defined_ethnicity and gender
import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)
r = {'under 10':8, '10-17':14, '18-24':21.5, '25-34':30, 'over 34':40}
stopsearch['age'] = stopsearch['age_range'].replace(r)
stopsearch.groupby('gender')['age'].apply(lambda x: x.mean())
stopsearch.groupby(['officer_defined_ethnicity','gender'])['age_range'] \\
.apply(lambda x: x.mode().values[0])
Exercise
Given the dataframe df
, produce an array
that has rows for A
and columns for B
, and shows the values of X
.
HINT: what function might you apply
?
import pandas
df = pandas.DataFrame({'A': [0,0,0,1,1,1], 'B': [0,1,2,0,1,2], 'X': range(6)})
# Produce an array with rows for A and columns for B, showing values of X
import pandas
df = pandas.DataFrame({'A': [0,0,0,1,1,1], 'B': [0,1,2,0,1,2], 'X': range(6)})
df.groupby(['A','B'])['X'].apply(lambda x: x.values[0]).unstack()
# Or apply(lambda x: x.iloc[0])
# Or apply(lambda x: x.iat[0])
# Or apply(sum)
# The point is, we want to apply *some* function, and the function
# has to return a simple Python value.
Transform
Transform
df.groupby(['groupcol
',…])['valcol
'] \\
.transform(func
)
This splits the column df.valcol
into pieces, and applies
func
to each piece, and joins the results together into a column
with the same index as df
.
This means we can simply insert it into df
as an extra column.
Here's how transform can be used to normalize values within groups:
Evaluate the code on the right. It tabulates the number of stops
in each ethnic group, split by outcome.
Define a new column tot_stops
, to be the total number of stops
within each ethnic group.
df['tot_stops'] = df \\
.groupby('officer_defined_ethnicity') \\
['num_stops'] \\
.transform(sum)
Define a new column percent_stops
to be num_stops/tot_stops*100
.
Even though we're using the function sum
, which
returns a single value per ethnic group, transform
has to return something with the same
index as df
, so it duplicates the sum across both levels of outcome2
.
import numpy as np
import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)
# Create a column outcome2 that simplifies outcomes into two values, nothing / find.
stopsearch['outcome2'] = np.where(stopsearch.outcome == "A no further action disposal", 'nothing', 'find')
# Count the number of outcomes of each type, per officer_defined_ethnicity
df = stopsearch.groupby(['officer_defined_ethnicity', 'outcome2']).apply(len).reset_index(name='num_stops')
df
import numpy as np
import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)
# Create a column outcome2 that simplifies outcomes into two values, nothing / find.
stopsearch['outcome2'] = np.where(stopsearch.outcome == "A no further action disposal", 'nothing', 'find')
# Count the number of outcomes of each type, per officer_defined_ethnicity
df = stopsearch.groupby(['officer_defined_ethnicity', 'outcome2']).apply(len).reset_index(name='num_stops')
df
df['tot_stops'] = df.groupby('officer_defined_ethnicity')['num_stops'].transform(sum)
df['percent_stops'] = df.num_stops / df.tot_stops * 100
Exercise
Add a new column Δt
that gives within each event_type
the time since the previous event.
import pandas
df = pandas.DataFrame({'time': [1,6,4,2,8,5,3,7], 'event_type': [0,1,1,0,1,0,2,2]})
# Within each event_type, let Δt be the time since the previous event
import pandas
df = pandas.DataFrame({'time': [1,6,4,2,8,5,3,7], 'event_type': [0,1,1,0,1,0,2,2]})
# Within each event_type, let Δt be the time since the previous event
df['Δt'] = df.sort_values('time').groupby('event_type')['time'].transform(lambda x: x.diff())
df
# NOTE: The new column that we're adding has with its own index.
# So pandas will match up the indexes of the original DataFrame with the new column's index,
# so it puts each Δt in the right row.
Apply
Apply
The apply
function can be used for other types of split-apply-join operations, not only
aggregation.
df.groupby(['groupcol
',…]).apply(func
)
df.groupby(['groupcol
',…])['valcol
'].apply(func
)
- If
func
returns a DataFrame, then the result will be all these dataframes
stacked on top of each other
- If
func
returns a Series, then the result will be all these series stacked on top
of each other — we can use this to return multiple values
- Otherwise, the result is the simple aggregation that we've seen already.
Pick out the latest row for each event_type
.
import pandas
df = pandas.DataFrame({'time': [1,6,4,2,8,5,3,7], 'event_type': [0,1,1,0,1,0,2,2]},
index=['a','b','c','d','e','f','g','h'])
# Pick out the latest row for each event_type
import pandas
df = pandas.DataFrame({'time': [1,6,4,2,8,5,3,7], 'event_type': [0,1,1,0,1,0,2,2]},
index=['a','b','c','d','e','f','g','h'])
# version 1: func returns a DataFrame, retaining the original DataFrame's index
df.sort_values('time').groupby('event_type').apply(lambda df: df.iloc[[-1]])
# version 2: func returns a single row of a DataFrame, losing the index
df.sort_values('time').groupby('event_type').apply(lambda df: df.iloc[-1])
# version 3: it's a common task, so there's a built-in
df.sort_values('time').groupby('event_type').tail(1)
Merge
Merge
It's easy to combine results at the same level of aggregation, i.e. when they all share the same index.
Pandas will automatically align them based on their indexes.
# stack the columns of two dataframes
pandas.concat([df1, df2], axis=1)
# add a new column
df['colname
'] = vec
What if we want to combine two DataFrames at different levels of aggregation?
To combine two DataFrames based on common column values rather than indexes,
or to combine two DataFrames where the indexes have different dimensions,
we need a database-style merge:
df1.merge(df2, on='col
')
df1.merge(df2, left_on='col1
', right_on='col2
')
For each officer_defined_ethnicity
, what fraction of stop-and-search incidents led to the police finding something?
Create a table with number of stops in which outcome != "A no further action disposal"
, indexed by
ethnicity. Convert it to a DataFrame using reset_index
.
Create another table with total number of stops, indexed by ethnicity, and convert it to a DataFrame also.
Merge the two DataFrames on the officer_defined_ethnicity
column. Hence compute the fraction
that we're after.
Previously we did this same analysis using transform
. In pandas, there are always many ways to answer a question!
import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)
# In what fraction of cases did the police find something, per officer_defined_ethnicity?
import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)
find = stopsearch.outcome != "A no further action disposal"
df1 = stopsearch.loc[find].groupby('officer_defined_ethnicity').apply(len).reset_index(name='find_stops')
df2 = stopsearch.groupby('officer_defined_ethnicity').apply(len).reset_index(name='tot_stops')
df = df1.merge(df2, on='officer_defined_ethnicity')
df['fr_find'] = df.find_stops / df.tot_stops
df
Merge how?
When we merge two dataframes A and B on a column x, we have a choice about what we want to end up with:
A.merge(B, on=mergecols
, how='left')
A.merge(B, on=mergecols
, how='inner') # default
- The first version will keep all the x-values in A,
and if some x-values are missing in B it will fill in with missing values.
- The second version will keep only those x-values that appear in both A and B
- In both versions, a single row in A might be expanded into several rows, one for each row of B with a matching x value.
Be paranoid
It's easy to confuse ourselves with cavalier merging!
A common pattern is to let A be the dataframe we're working with, and B be extra information with which we want
to label the rows of A. Therefore,
- Merge left, since we don't want the data we're working with to silently disappear
- Double check that the
len(A)
is equal to the length of the merged dataframe, since we're only
expecting one B label per row of A
# No demo
# This tutorial is using an old version of pandas, which doesn't support 'how'
Next steps
Real-world data
Well done!
You've finished this tutorial about pandas.
The next step is to use all the tools you've learnt to investigate real-world datasets.
You'll look at some simple datasets in Tick 3.
The watchword is fluency. Pandas lets us achieve a great deal by stringing together
a few commands. The more fluent we are in pandas-style, the more effectively we can investigate
datasets.
I build up my code incrementally: I apply one command, then I look at the output, then I string
on the next command, then look at the output, and so on. It's much easier to string together commands
when we can see the shape of the dataset we're working with.
A word of warning
Real-world datasets are full of mischief. You eyeball it and assume it has one row per country, and then
you discover that some countries have multiple rows. You eyeball it and assume a column is numeric, and then
you discover that some entries are strings.
My data science coding is defensive, full of sanity checks and assert
statements.
import pandas
df = pandas.DataFrame({'word': ['done','tutorial!','completing','for','Well','this'],
'seq': [1, 5, 3, 2, 0, 4]})
df.sort_values('seq')['word'].str.cat(sep=' ')