Handling data with pandas

Pandas is a rich expressive library for analyzing data. Plus it comes with a vast set of utilities. Once you're familiar with what pandas has to offer, you'll be able to do all sorts of data handling tasks by stringing together the right pandas commands.

Contents

1. Preamble

At the top of almost every piece of data handling work, we’ll import these standard modules.

The running example for this section is a dataset of stop-and-search records, made available by the UK home office. As it’s a moderate-sized file (172MB) I like to download it to disk, so it’s fast to reread it each time I restart the notebook. Here’s how we can fetch a file from a url, using the Unix command-line tool wget. (The exclamation mark !wget in the code below is called a Jupyter magic, and it means “Treat this line as though it were executed at the command prompt”. In IB Unix Tools you’ll learn more about the Unix command line.)

2. What data looks like

We almost always work with data in the form of a spreadsheet-like table, referred to as a dataframe. A dataframe is a collection of named columns. Each column has the same length, and all entries in a column have the same type, though different columns may have different types. Pandas uses numpy to store columns, so it’s reasonably fast.

Here’s how to load a dataframe from a file using pandas.read_csv, and how to inspect it. (This dataframe will be used as a running example in the rest of this notebook.) The Pandas library is full of handy utilities like $\textsf{read_csv}$.

2.1 MISSING VALUES

Missing values (as in the third entry in the $\textsf{age_range}$ column above) are a fact of life in data science. They should really be supported by Python itself, but they aren’t, so Pandas adopts its own conventions: it uses either $\textsf{np.nan}$ (the IEEE floating point for ‘Not A Number’), or $\textsf{None}$ (the built-in Python value commonly used to denote ‘no return value’). It’s best to use $\textsf{np.nan}$ when the underlying column is a numpy vector of floating point values, but for other column types it doesn’t matter which is used. To determine whether values are missing, use pandas.isna.

3. Importing, exporting, and creating dataframes

In my experience, around 70% of the time you spend working with data will be fighting to import it and clean it up. See the supplementary notebooks for a collection of recipes for web scraping, reading from a database, and parsing log files.

The simplest case is when you have a nice simple comma-separated value (CSV) file. A CSV file looks like this:

"Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species"
5.1,3.5,1.4,0.2,"setosa"
4.9,3,1.4,0.2,"setosa"
4.7,3.2,1.3,0.2,"setosa"
4.6,3.1,1.5,0.2,"setosa"
5,3.6,1.4,0.2,"setosa"

i.e. a header line, then one line per row of the data frame, with values separated by commas. We’ve already seen how to import a CSV, using pandas.read_csv. If your file is nearly a CSV but has some quirks such as comments or a missing header row, experiment with that function’s 55 options. We can use the same function to read CSV files from remote url:

To write a CSV file,

To create a dataframe from scratch, pass in a dictionary of columns. You can optionally specify the column order you want with the columns argument.

Or you can create a dataframe from a list of tuples. Now the columns argument is needed to say what the column names are.

4. Selecting and modifying data

Data frames have a triple identity:

4.1 LIKE A DICTIONARY

We can access entire columns as though the dataframe is a dictionary.

The columns aren't just numpy vectors, they're pandas Series objects. They support many additional real-world data-handling operations that are missing from plain numpy. For example

4.2 LIKE A DATABASE TABLE

We can obtain a new dataframe by selecting a subset of rows and/or columns, using .loc.

If we want to select rows by row number, rather than by a boolean condition as above, we need .iloc.

Row and column selectors can be combined.

To pull out a single row as a tuple, or to pull out a single value as a scalar, there is different syntax.

We can use these indexing operations to update a specific element in the dataframe—but I think it’s cleaner to modify data using dictionary indexing, replacing an entire column, rather than hacking at individual elements.

Sometimes pandas will tell us off, warning us that the operation may be inefficient.

SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

This means that we've taken a subset of the rows, and tried to set a variable. Pandas doesn't actually copy the dataframe when you take a subset of the rows, it shares the data with the original table. So, to suppress this warning, take a copy of the dataframe with df=stopsearch.copy(), and then modify the copy.

4.3 LIKE AN ARRAY

There is a third way to select rows from a dataframe, which in my experience is the source of endless confusion: selecting by row index. When you see a Pandas dataframe printed out, there is a column at the left without a column name. These aren’t row numbers, they are row indexes, which behave like the keys in a dictionary. In all the examples we’ve seen so far the indexes happen to be numbers, but they could be any other Python object. Row indexes are there for the same reason database tables have indexes: they’re vital for efficient lookup. But we won’t be using them in this course.

Pandas always keeps track of row indexes. Even when you pull out a single column, it doesn't give us a numpy vector, it gives us a Series object which is numpy-vector-plus-indexes. And it always tries to match indexes. This is usually not what we want, especially when we just want some plain simple arithmetic. I recommend that you generally use .values when you are working with subsets of rows. This gives you the actual numpy vector behind the column, not the confusing Series. We won’t be taking advantage of row indexes in this course, but it’s worth knowing they exist so you can understand the cryptic errors and error messages you will undoubtedly come across.

x
a 3
b 3
c 4
d 8
e 2
a    NaN
b    NaN
c    8.0
d    NaN
e    NaN
Name: x, dtype: float64
array([ 7, 11,  6])

5. Tabulations and indexed arrays

The pattern behind much data processing is split-apply-combine-join: split your data into pieces, apply a transformation to each piece, combine the pieces, and join results from different datasets together. We could code this explicitly with a ‘for’ loop, but it would involve lots of boilerplate code — and I hope you have been persuaded by the NumPy notes that ‘for’ loops are considered harmful. Instead, let’s see how to do it with Pandas.

5.1 DATAFRAME → INDEXED ARRAY

The following line of code performs a cross-tabulation: it splits the data into a separate dataframe for each combination of officer-defined ethnicity and gender, applies the len function to each subdataframe to get the number of rows it contains, and combines the results into a single indexed object.

officer_defined_ethnicity  gender
Asian                      Female       7
                           Male       179
                           Other        1
Black                      Female      10
                           Male       257
Other                      Female       6
                           Male        28
White                      Female     253
                           Male      1465
                           Other        5
dtype: int64
gender
Female    24.656682
Male      24.628913
Other     24.750000
Name: age, dtype: float64

For this course, we will only apply functions that return simple Python values. It’s possible but more complicated to apply functions that return dataframes or Pandas columns or indexed arrays.

The groupby/apply commands have produced an indexed array. An indexed array is a cross between a normal numpy array and a dataframe. We access elements and sub-arrays by dimension, like a numpy array — but the indexes aren’t integer positions, they’re values from the underlying column. Also, the array might be ‘incomplete’, as in the example above which has no entry for $\textsf{['Black','Other']}$.

The index labels can be accessed with x.index.levels[0].values and x.index.levels[1].values. To pretty-print an indexed array, use unstack. It will by default fill in any missing values with $\textsf{NaN}$ (not a number), and you can override this with $\textsf{fill_value}$.

gender Female Male Other
officer_defined_ethnicity
Asian 7 179 1
Black 10 257 0
Other 6 28 0
White 253 1465 5

5.2 INDEXED ARRAY → DATAFRAME

There are two ways to convert an indexed array to a dataframe, depending on the shape of the dataframe you want to end up with.

officer_defined_ethnicity gender count
0 Black Female 10
1 Black Male 257
2 White Female 253
3 White Male 1465
4 White Other 5
officer_defined_ethnicity Female Male Other
0 Black 10 257 0
1 White 253 1465 5

When you first start working with data, I recommend you do all your calculations on dataframes rather than indexed arrays. If you want to do calculations on an indexed array, first turn it into a dataframe. As you get deeper into working with data, you’ll discover that the skill in working with data is knowing which representation works best for your task, dataframe or indexed array. Also,

6. Merging dataframes

When processing data we often want to combine data at different levels of aggregation. For example, we might like to compare the frequency of false stops (i.e. where the police stopped someone and found nothing suspicious) across different ethnicities. Here’s how we prepare the first three columns … but what about the $n_\text{tot}$ column and $n/n_\text{tot}$?

officer_defined_ethnicity outcome n ntot n / ntot
0 Asian find 116 192 60.4%
1 Asian nothing 76 39.6%
2 Black find 170 270 63.0%
3 Black nothing 100 37.0%
4 Other find 28 37 75.7%
5 Other nothing 9 24.3%
6 White find 1060 1740 60.9%
7 White nothing 680 30.1%

6.1 WITH DATABASE-STYLE JOINS

The database answer is to create a smaller table with two columns, $\textsf{officer_defined_ethnicity}$ and $n_\text{tot}$, and then to join this to $\textsf{x}$ using the key $\textsf{officer_defined_ethnicity}$.

officer_defined_ethnicity ntot
0 Asian 192
1 Black 270
2 Other 37
3 White 1740
officer_defined_ethnicity outcome n ntot percent_find err
0 Asian find 116 192 60.4 6.9
1 Asian nothing 76 192 39.6 6.9
2 Black find 170 270 63.0 5.8
3 Black nothing 100 270 37.0 5.8
4 Other find 28 37 75.7 13.8
5 Other nothing 9 37 24.3 13.8
6 White find 1060 1740 60.9 2.3
7 White nothing 680 1740 39.1 2.3

6.2 WITH PANDAS INDEXING

This particular operation (groupby and apply, then merge the result back into the original table) is so common that pandas has a built-in way to do it, using df.groupby(...)[col].transform(f). This applies function $\textsf{f}$ to each chunk of $\textsf{col}$, and then reassembles the answer back into an array with the same index as $\textsf{df}$. We can then just put it into $\textsf{df}$ as an extra column.

officer_defined_ethnicity outcome n ntot
0 Asian find 116 192
1 Asian nothing 76 192
2 Black find 170 270
3 Black nothing 100 270
4 Other find 28 37
5 Other nothing 9 37
6 White find 1060 1740
7 White nothing 680 1740

Pandas also lets us join indexed arrays on their common indices, and that can often streamline more advanced database-style joins; but that counts as more advanced Pandas usage than we will cover here.