Pandas is a rich expressive library for analyzing datasets. Plus it comes with a vast set of utilities. Once we're familiar with what pandas has to offer, we can do all sorts of data handling tasks by stringing together just the right pandas commands.

Pandas is also a bit of a monster. It tries to be all things to all users, so there are many ways of completing a task — as we see whenever we ask ChatGPT or search stats.stackexchange. This tutorial is an opinionated guide about how I think one should use pandas.

DataFrames

Loading and saving data

We almost always work with data in the form of a spreadsheet-like table, referred to as a DataFrame.

In the wild, data is often stored in csv files (‘comma-separated values’). These are plain-text files, and they're a robust and open way to share datasets. To load a csv file, either from a url or a local filename, use the function

pandas.read_csv(file)

This returns a DataFrame object. A DataFrame is a collection of named columns. All columns have the same length, and within a column all entries have the same type. Pandas stores columns as numpy vectors.

To save a DataFrame df to a csv file,

df.to_csv(filename, index=False)

Load the csv file https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv and look at the DataFrame.

This file consists of police stop-and-search incidents, from data.police.uk, and limited to Cambridgeshire Constabulary records for 2021.

import pandas

import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)
stopsearch

Eyeballing

Eyeballing is when we look at a dataset just to get an idea of what it contains, without any definite questions in mind.

Here are some eyeballing functions that we can use on a DataFrame df:

len(df)          # number of rows
df.iloc[i]       # returns row number i
df.iloc[slice]    # returns a range of rows
df.sample(n)     # returns n rows chosen randomly

df.columns       # column names
df.colname        # returns a single column

How many rows and columns are there in the stopsearch dataset?

import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)

import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)

f"{len(stopsearch)} rows and {len(stopsearch.columns)} columns"

Missing values

Missing values are a fact of life in data science.

They should really be supported by Python itself, but they aren't, so pandas adopts its own conventions: it uses either np.nan (the IEEE floating point for ‘Not A Number’) or None (the built-in Python value).

To determine whether values are missing, regardless of which convention is being used,

pandas.isna(x)

where x is a vector, for example a column from the dataframe. This function returns a boolean vector, one item for each item of x.

There's a very handy function for filling in missing values:

x.fillna(replacement)
x.fillna(method='ffill')  # carry forwards

For each of the columns of stopsearch, report what percentage of items are missing.

import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)

import pandas
import numpy as np
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)

for colname in stopsearch.columns:
    print(f"{colname}: {np.mean(pandas.isna(stopsearch[colname]))*100:.2f}% missing")

Creating and combining DataFrames

To create a DataFrame from scratch, pass in a dictionary of column-vectors. We can optionally specify the column order we want with the columns argument.

df = pandas.DataFrame(
    {'x': [x1,x2,…], 'y': [y1,y2,…], …}, 
    columns=['y','x',…])

Or we can create a dataframe from a list of rows, each row represented by a tuple. In this case we need to use the columns argument to specify what the column names are.

df = pandas.DataFrame(
    [(x1,y1,…), (x2,y2,…), …], 
    columns=['x','y',…])

To stack two or more dataframes on top of each other, matching the columns by name, there are two functions that do the same thing:

pandas.concat([df1, df2])
df1.append(df2)

Create this dataframe:

species	petal.length	petal.width
"setosa"	1.0	0.2
"virginica"	5.0	1.9
"virginica"	5.8	1.6
"setosa"	1.7	0.5
"versicolor"	4.2	1.2

This data is from a famous dataset, the iris flower dataset, introduced in 1936 by the pioneering statistician Ronald Fisher, from data collected by Edgar Anderson. The discipline of data science and machine learning is propelled by the sharing of datasets.

import pandas

import pandas

iris = pandas.DataFrame({
    'species': ["setosa","virginica","virginica","setosa","versicolor"],
    'petal.length': [1.0, 5.0, 5.8, 1.7, 4.2],
    'petal.width': [0.2, 1.9, 1.6, 0.5, 1.2]})
iris

Dictionary of columns

Dataframes have a triple identity:

they're like a dictionary, where the keys are column names and the values are entire column vectors
they're like a database table, from which we can select subtables
they're like arrays, and we can refer to rows by number

Dictionary of columns

df.keys()            # list of all column names
df['colname']         # returns a single column
df['colname'] = vec   # create a new column
del df['colname']     # delete a column

We have previously seen another way to refer to columns, which is slightly less typing (but slightly more ambiguous — imagine if one of our columns were called "columns"!)

df.columns    # same as df.keys()
df.colname     # same as df['colname']

Add a new column to stopsearch called outcomeN, an integer column, equal to 0 if outcome=="A no further action disposal" and 1 otherwise.

import pandas
import numpy as np
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)

# What are the values in the 'outcome' column?
np.unique(stopsearch.outcome)

# Create a new column outcomeN, with value 0 if outcome == "A no further action disposal"

import pandas
import numpy as np
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)

# What are the values in the 'outcome' column?
np.unique(stopsearch.outcome)

# Create a new column outcomeN, with value 0 if outcome == "A no further action disposal"
stopsearch['outcomeN'] = np.where(stopsearch.outcome == "A no further action disposal", 0, 1)

Extracting subtables

We can obtain a new DataFrame by filtering the rows and/or selecting a subset of columns, using .loc:

df.loc[boolvec]
df[['col1','col2',…]]
df.loc[boolvec, ['col1','col2',…]]

Because of irritating rules about operator precedence, we need to wrap up logical conditions with lots of brackets, e.g.
df.loc[(df.x==y) & (~df.z)].

We can also select a subset of rows by row numbers, using .iloc:

df.iloc[slice]
df.iloc[[i1,i2,…]]

To sort a DataFrame according to the value of a column, or to pick a single arbitrary row for each level of a column,

df.sort_values('col')
df.drop_duplicates('col')

(These two functions also accept a list of column names rather than just a single column name.)

From stopsearch extract a subtable consisting of the first 10 rows, and columns gender and outcome.

import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)

import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)

stopsearch[['gender','outcome']].iloc[:10]

Picking out rows and elements

To pull out a single specific value, row i of column colname,

df.colname.iat[i]

To pull out a single row,

r = df.iloc[i]

We can then access the values of this row using r.colname.

Modifying a DataFrame

We can add, replace, and remove columns by treating the dataframe as a dictionary:

df['colname'] = vec
del df['colname']

There's a shorthand for renaming columns:

dfnew = df.rename(columns={'old': 'new', …})
df.rename(columns={'old': 'new', …}, inplace=True)

We can replace items in a column:

df['colname'].loc[…] = newvals
df['colname'].iloc[…] = newvals
df['colname'].iat[i] = newval

In stopsearch, the column self_defined_ethnicity has some missing values. Replace them by "Not stated".

import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)

# replace missing values in self_defined_ethnicity by "Not stated"

import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)

stopsearch.self_defined_ethnicity.loc[pandas.isna(stopsearch.self_defined_ethnicity)] = 'Not stated'

SettingWithCopyWarning

Sometimes, when modifying a DataFrame, pandas tells us

SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

This means we've taken a subset of the rows, and tried to set or update a column. Pandas doesn't actually copy the whole DataFrame when we extract a subtable, instead for efficiency the original table and the subtable share the same underlying data, but this doesn't work if we want to modify values in the subtable.

To avoid this warning, take a copy of the subtable and then modify the copy:

df2 = df.loc[…].copy()

Operations on columns

The column vectors in a DataFrame are Series objects.

Series objects are mostly just like numpy vectors. But they have many extra real-world data-handling operations that are missing from plain numpy.

Some column operations I use a lot are

x.describe()      # general summary
x.value_counts()  # unique values & counts
x.sort_values()   # sort

For many tasks you'll probably find that pandas has a built-in vectorized function that can help. Please go and read the documentation around any functions mentioned in this tutorial — many of them have options that might be useful.

Here are some examples. A single line of pandas code can do a lot, so for readability I like to split my lines.

stopsearch.location_latitude \\
    .mean()

stopsearch.location_street_name \\
    .str.contains('Parking')

pandas.to_datetime(stopsearch.datetime) \\
    .dt.year \\
    .min()

In the stopsearch dataset, the entries in self_defined_ethnicity all have the two-part form “A - B”. Extract just the first part, using

.str.extract("(.*) -", expand=False)

and tabulate the result using value_counts.

import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)

import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)
stopsearch.self_defined_ethnicity.str.extract("(.*) -", expand=False).value_counts()

Exercise

Create a new column age, obtained from age_range according to the table below.

HINT: look up the documentation for replace

age_range	age
under 10	8.0
10-17	14
18-24	21.5
25-34	30
over 34	40

import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)

stopsearch.age_range.unique()

import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)

r = {'under 10':8, '10-17':14, '18-24':21.5, '25-34':30, 'over 34':40}
stopsearch['age'] = stopsearch['age_range'].replace(r)

Exercise

Take the subset of rows where event_type is 0 or 1, then define a new column Δt containing the time since the previous event.

HINT: look up isin and diff.

import pandas
df = pandas.DataFrame({'time': [1,2,5,4,6,8,3,7], 'event_type': [0,0,0,1,1,1,2,2]})
df

import pandas
df = pandas.DataFrame({'time': [1,2,5,4,6,8,3,7], 'event_type': [0,0,0,1,1,1,2,2]})
df = df.loc[df.event_type.isin([0,1])].sort_values('time')
df['Δt'] = df.time.diff()
df

Indexes and tabulation

Indexes

Pandas has an idiosyncratic approach to labelling items in a dataset. It's inspired by dictionary keys, and by database indexes — but it generalizes these into something altogether more powerful, called an Index. Indexes unify arrays and data tables, and they pervade nearly every piece of pandas data analysis.

Indexes for DataFrames

In fact we've already seen indexes. Every DataFrame comes with an index, a label for each row. When we display a DataFrame, the index is shown on the left.

Evaluate the code on the right. How are the row labels displayed?

We can set the index when we create a DataFrame. If we don't set an index, pandas will set it for us.

Add an extra argument to the pandas.DataFrame call, and evaluate to see the new index.

pandas.DataFrame(…, index=['a','b','c'])

We can select rows by label, using .loc:

df.loc[[label1, label2, …]]

Select the subtable consisting of rows 'c' and 'a'.

import pandas
df = pandas.DataFrame({'x': [1,2,3], 'y': ['yes','yes','no']})
df

import pandas
df = pandas.DataFrame({'x': [1,2,3], 'y': ['yes','yes','no']},
                      index=['a','b','c'])
df.loc[['c','a']]

Series

A Series is a numpy vector in which each item has a label. The labels are called the index.

# To create a new Series:
x = pandas.Series(vec, index=vec)

x.values        # the underlying numpy vector
x.index.values  # the numpy vector of labels

When we get a column from a DataFrame, df.colname, what is returned is a Series with the same index as the DataFrame.

Evaluate the code on the right, to see how a Series is displayed.

import pandas
df = pandas.DataFrame({'x': [3,2,4,8,2]}, index=['a','b','c','d','e'])
df.x

Automatic alignment

Pandas always tries its hardest to match indexes, whenever we combine two objects.

Binary maths operations on Series are applied per label, not per position in the vector.
When we add a new Series column to a DataFrame, the items are added per label, not per position in the column.

The code on the right looks like it should compute the differences between successive elements of x. Run it, and see what it actually produces. Why doesn't it compute differences?

Replace the last line with df.x.diff(), which will actually do what we want.

import pandas
df = pandas.DataFrame({'x': [3,2,4,8,2]}, index=['a','b','c','d','e'])
df.x[1:] - df.x[:-1]

import pandas
df = pandas.DataFrame({'x': [3,2,4,8,2]}, index=['a','b','c','d','e'])
df.x.diff()

Tabulation

Tabulation is achieved by (1) splitting the dataset into groups, (2) counting the number of items in each group.

df.groupby(['col1','col2',…]).apply(len)

Tabulate the stopsearch dataset, by officer_defined_ethnicity and by gender.

Table = Series with multidimensional index

Pandas’ big clever idea is that a table can be thought of as a Series (i.e. a vector) with a multidimensional index (one dimension per grouping variable).

This makes it basically a cross between an array and a dataframe:

Like an array, we can access elements and subarrays by saying which indexes we want. But unlike an array the indexes aren't integer positions: instead they're values of the grouping variables.
Like a dataframe, and unlike an array, the multidimensional Series might be ‘ragged’ — it might not have entries for every combination of its dimensions.

Let x be the result of your tabulation. Evaluate these:

x.loc['Asian']
x.loc[:, 'Female']
x.loc[['Black','White']]

import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)

import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)

x = stopsearch.groupby(['officer_defined_ethnicity','gender']).apply(len)
print("Full table:", x, sep="\\n")

x.loc['Asian']             # the 1d subtable of ethnicity = Asian, all genders
x.loc[:, 'Female']         # the 1d subtable of all ethnicities, gender = Female
x.loc[['Black','White']]   # the 2d subtable with two ethnicities, all genders

Exercise

In the code on the right, why does x have fewer items than y?

Sort x, largest count to smallest.

import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)

x = stopsearch.groupby(['age_range']).apply(len)
y = stopsearch['age_range'].unique()

len(x), len(y)  # why are these not equal?

import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)

x = stopsearch.groupby(['age_range']).apply(len)
y = stopsearch['age_range'].unique()

# x doesn't include missing values, y does
# We can use groupby(..., dropna=False) to keep missing values.
# But the version of pandas in this tutorial is old, and doesn't support this :-(

x.sort_values(ascending=False)

Table → DataFrame

To convert a table x into a DataFrame,

x.reset_index(name='valname')

There will be one row for every combination of grouping variables that's present in the dataset
There will be one column per grouping variable, plus an extra column for the values in the table; we can specify what name to use for this last column

In effect, reset_index takes the multidimensional index, turns it into columns, puts these columns into a DataFrame along with the values, and then resets the DataFrame's index to be just plain numbers. That's why it's called reset_index.

There's a sister command set_index which takes columns of a DataFrame and converts them into the index.

Convert the table on the right into a DataFrame.

import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)

stopsearch.groupby(['officer_defined_ethnicity','gender']).apply(len)

import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)

stopsearch.groupby(['officer_defined_ethnicity','gender']).apply(len) \\
    .reset_index(name='num_stops')

Table → array

To convert a table x into an array,

x.unstack(fill_value=val)

But what actually is an array? In pandas, an array is just a DataFrame. The unstack command takes the final grouping variable, and pivots it to become column headings.

Evaluate the code on the right, to see the resulting DataFrame.

To get the underlying numpy array, x.unstack().values

Pandas has a confusing notion that groups of columns can have names. I've never found this useful, so I always delete the column-group name that unstack creates:

x.unstack().rename_axis(columns=None)

If the table was ‘ragged’, in other words if there were some missing combinations of grouping variables, then unstack will by default fill in with NaN. We can override this by specifying fill_value.

Apply the following operations to the unstacked table. Explain what they do.

unstacked_table \\
    .rename_axis(columns=None) \\
    .reset_index()

import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)

stopsearch.groupby(['officer_defined_ethnicity','gender']).apply(len) \\
    .unstack(fill_value=0)

import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)

stopsearch.groupby(['officer_defined_ethnicity','gender']).apply(len) \\
    .unstack(fill_value=0) \\
    .rename_axis(columns=None) \\
    .reset_index()

# unstack: convert table to array-form
# rename_axis: get rid of useless column-group heading
# resent_index: turn officer_defined_ethnicity (the index) into a column

Which to use?

In Pandas there are always lots of ways to solve a given task, and it can be quite bewildering to know when to use DataFrames and when to use tables.

When you start working with data, I recommend you do all your calculations on simple DataFrames and you don't make any clever use of indexes. If you create a table, turn it into a DataFrame straight away, either long-form (with reset_index) or wide-form (with unstack followed by rename_axis and reset_index).

As you get deeper into working with data, you'll work out the situations where a clunky groupby and .loc can be streamlined by using indexes.

Split-apply-merge

Transforming data

The pattern behind much data processing is split-apply-merge: split the DataFrame into pieces according to specified grouping variables, apply a transformation to each piece, merge results together.

We could code this explicitly with ‘for’ loops, but it would involve lots and lots of boilerplate code. Pandas has several tools to streamline things:

aggregate, to compute a summary for each piece; it returns a table with one item for each combination of the grouping variables
transform, to apply a transformation to each piece; it returns a table with the same rows as the original DataFrame
merge, which let us combine DataFrames at different levels of aggregation

Aggregate

df.groupby(['groupcol',…]).apply(func)
df.groupby(['groupcol',…])[valcol].apply(func)

These two operations split the DataFrame, apply an aggregation function to each piece, and join the results together into a table.

In the first version, the entire DataFrame df is split into sub-DataFrames and func is applied to each piece.
In the second version, it's the column df.valcol that's split into sub-Series and func is applied to each piece.
In both cases, func should return a simple Python value, not a pandas Series or DataFrame.

We've already seen the first version — it's how tabulation works, using the aggregation function len to get the number of rows in the sub-DataFrame.

Find the mean age, separately for each gender.

We can find the mode (most frequent value) of x with x.mode().values[0]. Find the mode of age_range for each combination of officer_defined_ethnicity and gender.

import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)
r = {'under 10':8, '10-17':14, '18-24':21.5, '25-34':30, 'over 34':40}
stopsearch['age'] = stopsearch['age_range'].replace(r)

# Find the mean age for each gender
# Find the mode of age_range for each officer_defined_ethnicity and gender

import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)
r = {'under 10':8, '10-17':14, '18-24':21.5, '25-34':30, 'over 34':40}
stopsearch['age'] = stopsearch['age_range'].replace(r)

stopsearch.groupby('gender')['age'].apply(lambda x: x.mean())

stopsearch.groupby(['officer_defined_ethnicity','gender'])['age_range'] \\
    .apply(lambda x: x.mode().values[0])

Exercise

Given the dataframe df, produce an array that has rows for A and columns for B, and shows the values of X.

HINT: what function might you apply?

import pandas
df = pandas.DataFrame({'A': [0,0,0,1,1,1], 'B': [0,1,2,0,1,2], 'X': range(6)})

# Produce an array with rows for A and columns for B, showing values of X

import pandas
df = pandas.DataFrame({'A': [0,0,0,1,1,1], 'B': [0,1,2,0,1,2], 'X': range(6)})

df.groupby(['A','B'])['X'].apply(lambda x: x.values[0]).unstack()

# Or apply(lambda x: x.iloc[0])
# Or apply(lambda x: x.iat[0])
# Or apply(sum)
# The point is, we want to apply *some* function, and the function
# has to return a simple Python value.

Transform

df.groupby(['groupcol',…])['valcol'] \\
    .transform(func)

This splits the column df.valcol into pieces, and applies func to each piece, and joins the results together into a column with the same index as df. This means we can simply insert it into df as an extra column.

Here's how transform can be used to normalize values within groups:

Evaluate the code on the right. It tabulates the number of stops in each ethnic group, split by outcome.

Define a new column tot_stops, to be the total number of stops within each ethnic group.

df['tot_stops'] = df \\
    .groupby('officer_defined_ethnicity') \\
    ['num_stops'] \\
    .transform(sum)

Define a new column percent_stops to be num_stops/tot_stops*100.

Even though we're using the function sum, which returns a single value per ethnic group, transform has to return something with the same index as df, so it duplicates the sum across both levels of outcome2.

import numpy as np
import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)

# Create a column outcome2 that simplifies outcomes into two values, nothing / find.
stopsearch['outcome2'] = np.where(stopsearch.outcome == "A no further action disposal", 'nothing', 'find')

# Count the number of outcomes of each type, per officer_defined_ethnicity
df = stopsearch.groupby(['officer_defined_ethnicity', 'outcome2']).apply(len).reset_index(name='num_stops')
df

import numpy as np
import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)

# Create a column outcome2 that simplifies outcomes into two values, nothing / find.
stopsearch['outcome2'] = np.where(stopsearch.outcome == "A no further action disposal", 'nothing', 'find')

# Count the number of outcomes of each type, per officer_defined_ethnicity
df = stopsearch.groupby(['officer_defined_ethnicity', 'outcome2']).apply(len).reset_index(name='num_stops')
df

df['tot_stops'] = df.groupby('officer_defined_ethnicity')['num_stops'].transform(sum)
df['percent_stops'] = df.num_stops / df.tot_stops * 100

Exercise

Add a new column Δt that gives within each event_type the time since the previous event.

import pandas
df = pandas.DataFrame({'time': [1,6,4,2,8,5,3,7], 'event_type': [0,1,1,0,1,0,2,2]})

# Within each event_type, let Δt be the time since the previous event

import pandas
df = pandas.DataFrame({'time': [1,6,4,2,8,5,3,7], 'event_type': [0,1,1,0,1,0,2,2]})

# Within each event_type, let Δt be the time since the previous event
df['Δt'] = df.sort_values('time').groupby('event_type')['time'].transform(lambda x: x.diff())
df

# NOTE: The new column that we're adding has with its own index.
# So pandas will match up the indexes of the original DataFrame with the new column's index,
# so it puts each Δt in the right row.

Apply

The apply function can be used for other types of split-apply-join operations, not only aggregation.

df.groupby(['groupcol',…]).apply(func)
df.groupby(['groupcol',…])['valcol'].apply(func)

If func returns a DataFrame, then the result will be all these dataframes stacked on top of each other
If func returns a Series, then the result will be all these series stacked on top of each other — we can use this to return multiple values
Otherwise, the result is the simple aggregation that we've seen already.

Pick out the latest row for each event_type.

import pandas
df = pandas.DataFrame({'time': [1,6,4,2,8,5,3,7], 'event_type': [0,1,1,0,1,0,2,2]},
                      index=['a','b','c','d','e','f','g','h'])

# Pick out the latest row for each event_type

import pandas
df = pandas.DataFrame({'time': [1,6,4,2,8,5,3,7], 'event_type': [0,1,1,0,1,0,2,2]},
                      index=['a','b','c','d','e','f','g','h'])

# version 1: func returns a DataFrame, retaining the original DataFrame's index
df.sort_values('time').groupby('event_type').apply(lambda df: df.iloc[[-1]])

# version 2: func returns a single row of a DataFrame, losing the index
df.sort_values('time').groupby('event_type').apply(lambda df: df.iloc[-1])

# version 3: it's a common task, so there's a built-in
df.sort_values('time').groupby('event_type').tail(1)

Merge

It's easy to combine results at the same level of aggregation, i.e. when they all share the same index. Pandas will automatically align them based on their indexes.

# stack the columns of two dataframes
pandas.concat([df1, df2], axis=1)

# add a new column
df['colname'] = vec

What if we want to combine two DataFrames at different levels of aggregation?

To combine two DataFrames based on common column values rather than indexes, or to combine two DataFrames where the indexes have different dimensions, we need a database-style merge:

df1.merge(df2, on='col')
df1.merge(df2, left_on='col1', right_on='col2')

For each officer_defined_ethnicity, what fraction of stop-and-search incidents led to the police finding something?

Create a table with number of stops in which outcome != "A no further action disposal", indexed by ethnicity. Convert it to a DataFrame using reset_index.

Create another table with total number of stops, indexed by ethnicity, and convert it to a DataFrame also.

Merge the two DataFrames on the officer_defined_ethnicity column. Hence compute the fraction that we're after.

Previously we did this same analysis using transform. In pandas, there are always many ways to answer a question!

import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)

# In what fraction of cases did the police find something, per officer_defined_ethnicity?

import pandas
url = 'https://www.cl.cam.ac.uk/teaching/current/DataSci/data/stopsearch_cam2021.csv'
stopsearch = pandas.read_csv(url)

find = stopsearch.outcome != "A no further action disposal"
df1 = stopsearch.loc[find].groupby('officer_defined_ethnicity').apply(len).reset_index(name='find_stops')
df2 = stopsearch.groupby('officer_defined_ethnicity').apply(len).reset_index(name='tot_stops')
df = df1.merge(df2, on='officer_defined_ethnicity')
df['fr_find'] = df.find_stops / df.tot_stops
df

Merge how?

When we merge two dataframes A and B on a column x, we have a choice about what we want to end up with:

A.merge(B, on=mergecols, how='left')
A.merge(B, on=mergecols, how='inner')  # default

The first version will keep all the x-values in A, and if some x-values are missing in B it will fill in with missing values.
The second version will keep only those x-values that appear in both A and B
In both versions, a single row in A might be expanded into several rows, one for each row of B with a matching x value.

Be paranoid

It's easy to confuse ourselves with cavalier merging! A common pattern is to let A be the dataframe we're working with, and B be extra information with which we want to label the rows of A. Therefore,

Merge left, since we don't want the data we're working with to silently disappear
Double check that the len(A) is equal to the length of the merged dataframe, since we're only expecting one B label per row of A

# No demo
# This tutorial is using an old version of pandas, which doesn't support 'how'

Next steps

Real-world data

Well done!

You've finished this tutorial about pandas.

The next step is to use all the tools you've learnt to investigate real-world datasets. You'll look at some simple datasets in Tick 3.

The watchword is fluency. Pandas lets us achieve a great deal by stringing together a few commands. The more fluent we are in pandas-style, the more effectively we can investigate datasets.

I build up my code incrementally: I apply one command, then I look at the output, then I string on the next command, then look at the output, and so on. It's much easier to string together commands when we can see the shape of the dataset we're working with.

A word of warning

Real-world datasets are full of mischief. You eyeball it and assume it has one row per country, and then you discover that some countries have multiple rows. You eyeball it and assume a column is numeric, and then you discover that some entries are strings.

My data science coding is defensive, full of sanity checks and assert statements.

import pandas
df = pandas.DataFrame({'word': ['done','tutorial!','completing','for','Well','this'],
                        'seq': [1, 5, 3, 2, 0, 4]})
df.sort_values('seq')['word'].str.cat(sep=' ')