Recipes for data import and cleanup

In my experience, around 75% of the time you spend working with data will be fighting to import it and clean it up. For the most part this is just general-purpose programming, but there are a few library routines that will save you from reinventing the wheel.

Treat this section as a collection of recipes and pointers to useful library routines. If you find yourself needing them, you should read the recipe, try it out, then look online for more information about the library functions it suggests.

Reading from a csv file

When our data is a very simple comma-separated value (CSV) file then it's very easy to import with pandas.read_csv. We can specify either a filename or a url.

If our file is nearly a CSV but has some quirks such as comments or a missing header row, there are plenty of options in pandas.read_csv or pandas.read_table. For extreme quirks we may need to use the raw Python csv.reader.

Reading from a string

If we have a string that we want to treat as a file, we can use io.StringIO:

Reading from an http request

If we want to read from a url but we want more control over the http request, for example sending a POST request or modifying the request header or reading the response header, we can use the requests library to fetch the data as a string.

Parsing log files with regular expressions

A typical line from a web server log might look like this

207.46.13.169 - - [27/Aug/2017:06:52:11 +0000] "GET /marcus/essay/st&h2.html HTTP/1.1" 200 3881 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"

where (according to the Apache web server documentation) the pieces are

To extract these pieces from a single line of the log file, the best tool is regular expressions, a mini-language for string matching that is common across many programming languages. The syntax is terse and takes a lot of practice. I like to start with a small string pattern and incrementally build it up, testing as I go.

How do we extract the fields from a full log file? The vanilla Python code is

with open(myfile) as f:
    for line in f:
        m = re.match(pattern, line)
        # store the fields from m.groups() or m.groupdict() somewhere appropriate

Alternatively, numpy has a handy shortcut for reading in an entire file and splitting it via a regular expression:

Reading json from a web data service

More and more forward-thinking companies and government services make data available by simple web requests. Here is an example, stop-and-search data from data.police.uk.

The first step is to import the Python module for making web requests. When I'm developing data code I like to build it up in small steps, which means lots of repeated requests, so I also like to use another Python module which caches responses. This means I don't hammer the service unnecessarily.

The API documentation tells us the URL for fetching a list of available data. Let's try it.

Now let's look at the body of the response. It's likely to be very long, so we'll only print out the first 300 characters.

It looks like JSON, "JavaScript Object Notation", a common format for web data services.

It's easy to convert it into Python dictionaries and lists, with requests.get(...).json(). Now we can explore what it contains. (Alternatively, just read the web service documentation, if we trust it!)

Pandas has a pretty flexible command for converting nested JSON into nice sane dataframes, pandas.json_normalize.

The police API documentation tells us how to request records for a given police force and month: using a url of the form

https://data.police.uk/api/stops-force?force=avon-and-somerset&date=2017-01

Let's fetch them all. For each item we'll fetch the data, turn it into a dataframe (again using pandas.json_normalize), and then we'll bind everything together.

Scraping a website with xpath

There are fascinating stories to be discovered from public data, and sometimes you have to work to scrape it from web pages. Here's an acount by a BBC data journalist. We'll work with a very simple example: extracting results of the Oxford / Cambridge boat race from the Wikipedia table.

I recommend using XPath queries from the lxml module. XPath is a powerful mini-language for extracting data from hierarchical documents, with wide support across many programming languages — think of it as regular expressions but for html rather than plain text. If you want to scrape websites then it's worth finding a tutorial and learning XPath. For this course, we'll just see how to use XPath in Python.

The first step is to install lxml, which is not included with Python.

!pip install lxml

Now we'll fetch the web page and parse the contents. Most web pages are badly-formatted html (sections not properly closed, etc.), and lxml.html.fromstring makes a reasonable attempt to make sense of it.

This gives us doc, the root <html> element, which we can inspect.

We want to pull out a particular element from the document, namely the table with boat race results, so we need to work out how to refer to it in XPath. The Chrome webbrowser has a handy tool to help with this. Go to the page you're interested in, and click on … | More tools | Developer Tools. Click on the element-selector button at the top left: use 'select element' mode Go back to the web page, and click on a piece close to what you want to select. I clicked on the top left cell of the table: click roughly where you want Go back to the developer tools window, and navigate to the exact element you want. Here, we want the table. Right-click and choose Copy | Copy XPath. copy the XPath of the element you want It gave me the XPath location "//*[@id="mw-content-text"]/div[1]/table[2]". Now we can extract the data.

You should consider the ethics of your web scraping. Here are some thoughts: from Sophie Chou at the MIT Media Lab, and the data journalist Näel Shiab.

Querying an SQL database

Once your data is in an SQL database, access is easy. Here's an example with Postgresql, a dialect of SQL.

Databases are usually secured, and you need various credentials to log in. It's good practice to store our code in a repository, but bad practice to store credentials there too. Instead, we can store credentials in separate file that isn't checked in to the repository. I like to store credentials in a JSON file, something like this:

{"user": "SPQR", "password": "TOPSECRET", "host": "***", "dbname": "***"}

which is easy to load into Python as a dictionary. Then I can use the fields of this dictionary as arguments to psycopg2.connect, to establish the connection.

We can run arbitrary SQL queries, and retrieve the results as a pandas dataframe. Pass in parameters with the %(name)s quoting mechanism, to keep ourselves safe from SQL injection attacks.