In my experience, around 75% of the time you spend working with data will be fighting to import it and clean it up. For the most part this is just general-purpose programming, but there are a few library routines that will save you from reinventing the wheel.
Treat this section as a collection of recipes and pointers to useful library routines. If you find yourself needing them, you should read the recipe, try it out, then look online for more information about the library functions it suggests.
import numpy as np
import pandas
When our data is a very simple comma-separated value (CSV) file then it's very easy to import with pandas.read_csv
. We can specify either a filename or a url.
df = pandas.read_csv('data/iris.csv')
If our file is nearly a CSV but has some quirks such as comments or a missing header row, there are plenty of options in pandas.read_csv
or pandas.read_table
. For extreme quirks we may need to use the raw Python csv.reader
.
If we have a string that we want to treat as a file, we can use io.StringIO
:
import io
data = '''
x,y
10,3
2,5
'''
df = pandas.read_csv(io.StringIO(data))
If we want to read from a url but we want more control over the http request, for example sending a POST request or modifying the request header or reading the response header, we can use the requests
library to fetch the data as a string.
import requests
import io
my_url = "https://www.cl.cam.ac.uk/teaching/current/DataSci/data/iris.csv"
data = requests.get(my_url).content.decode('utf8')
df = pandas.read_csv(io.StringIO(data))
A typical line from a web server log might look like this
207.46.13.169 - - [27/Aug/2017:06:52:11 +0000] "GET /marcus/essay/st&h2.html HTTP/1.1" 200 3881 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
where (according to the Apache web server documentation) the pieces are
207.46.13.169
The IP address that made the request
-
The identity of the client; -
means not available
-
The userid of the logged-in user who made the request; -
means not available
[27/Aug/2017:06:52:11 +0000]
The time the request was received
"GET /marcus/essay/st&h2.html HTTP/1.1"
The type of request, what was requested, and the connection type
200
The http status code (200 means OK)
3881
The size of the object returned to the client, in bytes
-
The referrer URL; -
means not available
"Mozilla/5.0 (...)"
The browser type. The substring bingbot
here tells us that the request comes from Microsoft Bing's web crawler.
To extract these pieces from a single line of the log file, the best tool is regular expressions, a mini-language for string matching that is common across many programming languages. The syntax is terse and takes a lot of practice. I like to start with a small string pattern and incrementally build it up, testing as I go.
import re # standard Python module for regular expressions
s = """
207.46.13.169 - - [27/Aug/2017:06:52:11 +0000] "GET /marcus/essay/st&h2.html HTTP/1.1"
200 3881 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X)
AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53
(compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
"""
# First attempt: match the first two items in the log line.
# If my pattern is right, re.match returns an object.
# If my pattern is wrong, re.match returns None.
pattern_test = r'\s*(\S+)\s*(\S+)'
re.match(pattern_test, s)
<re.Match object; span=(0, 16), match='\n207.46.13.169 -'>
# This is the full pattern I built up to. Python lets us add verbose comments
# to the pattern, which is handy for remembering what your code does when you look
# at it the next morning.
pattern = r'''(?x) # flag saying that this pattern has comments
\s* # any whitespace at the start of the string
(?P<ip>\S+) # one or more non-space characters: the IP address
\s+ # one or more spaces
(?P<client>\S+) # the client identity
\s+
(?P<user>\S+) # the userid
\s+
\[(?P<t>[^\]]*)\] # [, then any number of not-] characters, then ]: the timestamp
\s+
"(?P<req>[^"]*)" # ", then any number of not-" characters, then ": the request string
\s+
(?P<status>\d+) # one or more numerical digits: the http status code
\s+
(?P<size>\d+) # one or more numerical digits: the size
\s+
"(?P<ref>[^"]*)" # the referrer URL
\s+
"(?P<ua>[^"]*)" # the user agent i.e. browser type
'''
m = re.match(pattern, s)
m.groupdict() # returns a dictionary of all the named sub-patterns
{'ip': '207.46.13.169', 'client': '-', 'user': '-', 't': '27/Aug/2017:06:52:11 +0000', 'req': 'GET /marcus/essay/st&h2.html HTTP/1.1', 'status': '200', 'size': '3881', 'ref': '-', 'ua': 'Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Mac OS X)\nAppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11A465 Safari/9537.53\n(compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)'}
How do we extract the fields from a full log file? The vanilla Python code is
with open(myfile) as f:
for line in f:
m = re.match(pattern, line)
# store the fields from m.groups() or m.groupdict() somewhere appropriate
Alternatively, numpy
has a handy shortcut for reading in an entire file and splitting it via a regular expression:
# Split the file into an array, one row per line, one column per field in the pattern
t = [('ip',str,30), ('client',str,10), ('userid',str,10), ('timestamp',str,80),
('req',str,300), ('status',int), ('size',int),
('ref',str,300), ('uagent',str,300)]
df = np.fromregex('data/webaccess_short.log', pattern, dtype=t)
# Make a dictionary out of the columns, according to the named fields in the pattern
df = pandas.DataFrame({v: df[n] for (v,_),n in zip(re.compile(pattern).groupindex.items(), df.dtype.names)})
df.sample(4)
ip | client | user | t | req | status | size | ref | ua | |
---|---|---|---|---|---|---|---|---|---|
362 | 207.46.13.85 | - | - | 27/Aug/2017:08:26:01 +0000 | GET /damon/recipe/breadbutterpud2 HTTP/1.1 | 200 | 1635 | - | Mozilla/5.0 (iPhone; CPU iPhone OS 7_0 like Ma... |
224 | 216.244.66.250 | - | - | 27/Aug/2017:08:02:49 +0000 | GET /robots.txt HTTP/1.1 | 200 | 359 | - | Mozilla/5.0 (compatible; DotBot/1.1; http://ww... |
142 | 86.183.91.158 | - | - | 27/Aug/2017:07:50:23 +0000 | GET /favicon.ico HTTP/1.1 | 404 | 503 | - | MobileSafari/602.1 CFNetwork/811.5.4 Darwin/16... |
243 | 73.246.74.104 | - | - | 27/Aug/2017:08:08:02 +0000 | GET /irene/cross/12-jesus-dies-on-the-cross.jp... | 200 | 77292 | https://www.bing.com/ | Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl... |
More and more forward-thinking companies and government services make data available by simple web requests. Here is an example, stop-and-search data from data.police.uk.
The first step is to import the Python module for making web requests. When I'm developing data code I like to build it up in small steps, which means lots of repeated requests, so I also like to use another Python module which caches responses. This means I don't hammer the service unnecessarily.
import requests, requests_cache
requests_cache.install_cache('data/stopsearch')
The API documentation tells us the URL for fetching a list of available data. Let's try it.
# Make the request. This should print out <Response [200]>, meaning successfully retrieved
AVAILABILITY_URL = 'https://data.police.uk/api/crimes-street-dates'
stations_resp = requests.get(AVAILABILITY_URL)
stations_resp
<Response [200]>
Now let's look at the body of the response. It's likely to be very long, so we'll only print out the first 300 characters.
stations_resp.text[:300]
'[{"date":"2023-09","stop-and-search":["avon-and-somerset","bedfordshire","btp","cambridgeshire","city-of-london","cleveland","cumbria","derbyshire","dorset","durham","essex","gloucestershire","hampshire","hertfordshire","kent","lancashire","leicestershire","lincolnshire","merseyside","norfolk","nort'
It looks like JSON, "JavaScript Object Notation", a common format for web data services.
It's easy to convert it into Python dictionaries and lists, with requests.get(...).json()
. Now we can explore what it contains. (Alternatively, just read the web service documentation, if we trust it!)
x = requests.get(AVAILABILITY_URL).json()
print(type(x)) # x = [...]
print(type(x[0])) # x = [{...}, ...]
print(x[0].keys()) # x = [{date, stop-and-search}, ...]
print(x[0]) # x = [{date, stop-and-search:[area,area,...]}, ...]
<class 'list'> <class 'dict'> dict_keys(['date', 'stop-and-search']) {'date': '2023-09', 'stop-and-search': ['avon-and-somerset', 'bedfordshire', 'btp', 'cambridgeshire', 'city-of-london', 'cleveland', 'cumbria', 'derbyshire', 'dorset', 'durham', 'essex', 'gloucestershire', 'hampshire', 'hertfordshire', 'kent', 'lancashire', 'leicestershire', 'lincolnshire', 'merseyside', 'norfolk', 'north-wales', 'north-yorkshire', 'northamptonshire', 'northumbria', 'south-yorkshire', 'staffordshire', 'suffolk', 'surrey', 'sussex', 'thames-valley', 'west-mercia', 'west-midlands', 'west-yorkshire']}
Pandas has a pretty flexible command for converting nested JSON into nice sane dataframes, pandas.json_normalize
.
availability = pandas.json_normalize(x, 'stop-and-search', 'date').rename(columns={0:'force'})
availability
force | date | |
---|---|---|
0 | avon-and-somerset | 2023-09 |
1 | bedfordshire | 2023-09 |
2 | btp | 2023-09 |
3 | cambridgeshire | 2023-09 |
4 | city-of-london | 2023-09 |
... | ... | ... |
1411 | warwickshire | 2020-10 |
1412 | west-mercia | 2020-10 |
1413 | west-midlands | 2020-10 |
1414 | west-yorkshire | 2020-10 |
1415 | wiltshire | 2020-10 |
1416 rows × 2 columns
The police API documentation tells us how to request records for a given police force and month: using a url of the form
https://data.police.uk/api/stops-force?force=avon-and-somerset&date=2017-01
Let's fetch them all. For each item we'll fetch the data, turn it into a dataframe (again using pandas.json_normalize
), and then we'll bind everything together.
import urllib
BASE_URL = 'https://data.police.uk/api/'
STOPSDATA_URL = urllib.parse.urljoin(BASE_URL, 'stops-force')
availability['url'] = [STOPSDATA_URL + '?' + urllib.parse.urlencode({'force':f, 'date':d})
for f,d in zip(availability.force, availability.date)]
def get_dataframe(q):
url,date,force = (q.url, q.date, q.force)
print(url + " ...", end="\r")
response = requests.get(url)
x = response.json()
df = pandas.json_normalize(x, sep='_')
df.insert(0, 'month', date)
df.insert(0, 'force', force)
return df
# only fetch a little, for illustration purposes
df = [get_dataframe(r) for i,r in availability.iloc[:10].iterrows()]
stopsearch = pandas.concat(df, axis=0, ignore_index=True, sort=False)
https://data.police.uk/api/stops-force?force=durham&date=2023-09 ... ... ......
stopsearch.sample(3)
force | month | age_range | outcome | involved_person | self_defined_ethnicity | gender | legislation | outcome_linked_to_object_of_search | datetime | ... | officer_defined_ethnicity | type | operation_name | object_of_search | outcome_object_id | outcome_object_name | location_latitude | location_street_id | location_street_name | location_longitude | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
720 | bedfordshire | 2023-09 | 18-24 | A no further action disposal | True | White - English/Welsh/Scottish/Northern Irish/... | Male | Misuse of Drugs Act 1971 (section 23) | None | 2023-09-15T22:03:42+00:00 | ... | White | Person search | None | Controlled drugs | bu-no-further-action | A no further action disposal | NaN | NaN | NaN | NaN |
332 | avon-and-somerset | 2023-09 | over 34 | True | Other ethnic group - Not stated | Female | Police and Criminal Evidence Act 1984 (section 1) | None | 2023-09-25T00:00:00+00:00 | ... | White | Person search | None | Article for use in theft | 51.456677 | 2278864.0 | On or near Nightclub | -2.592162 | |||
3959 | durham | 2023-09 | 25-34 | A no further action disposal | True | White - English/Welsh/Scottish/Northern Irish/... | Male | Misuse of Drugs Act 1971 (section 23) | False | 2023-09-03T02:00:00+00:00 | ... | White | Person search | None | Controlled drugs | bu-no-further-action | A no further action disposal | 54.727183 | 2166006.0 | On or near Parking Area | -1.252132 |
3 rows × 23 columns
There are fascinating stories to be discovered from public data, and sometimes you have to work to scrape it from web pages. Here's an acount by a BBC data journalist. We'll work with a very simple example: extracting results of the Oxford / Cambridge boat race from the Wikipedia table.
I recommend using XPath queries from the lxml
module.
XPath is a powerful mini-language for extracting data from hierarchical documents, with wide support across many programming languages — think of it as regular expressions but for html rather than plain text. If you want to scrape websites then it's worth finding a tutorial and learning XPath. For this course, we'll just see how to use XPath in Python.
The first step is to install lxml
, which is not included with Python.
!pip install lxml
Now we'll fetch the web page and parse the contents. Most web pages are badly-formatted html (sections not properly closed, etc.), and lxml.html.fromstring
makes a reasonable attempt to make sense of it.
import requests
import lxml.html
BOATRACE_URL = 'https://en.wikipedia.org/wiki/List_of_The_Boat_Race_results'
resp = requests.get(BOATRACE_URL)
doc = lxml.html.fromstring(resp.content)
This gives us doc
, the root <html>
element, which we can inspect.
print(doc.tag) # the type of element
print(len(doc)) # the number of children
print([n.tag for n in doc]) # tags of its children, <head> and <body>
print(doc.attrib) # get the attributes, e.g. <html class="client-nojs" lang="en" dir="ltr">
print(doc.text, doc.tail) # any text directly under in this element
html 2 ['head', 'body'] {'class': 'client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-typography-survey-disabled vector-toc-available', 'lang': 'en', 'dir': 'ltr'} None
We want to pull out a particular element from the document, namely the table with boat race results, so we need to work out how to refer to it in XPath. The Chrome webbrowser has a handy tool to help with this. Go to the page you're interested in, and click on … | More tools | Developer Tools. Click on the element-selector button at the top left:
# Pick out the table element.
# (XPath queries return lists, but I only want one item, hence the [0].)
table = doc.xpath('//*[@id="mw-content-text"]/div[1]/table[2]')[0]
# Get a list of all rows i.e. <tr> elements inside the table.
# Print one, to check things look OK.
rows = table.xpath('.//tr')
print(lxml.etree.tostring(rows[5], encoding='unicode'))
# Extract the timestamp and winner columns.
# The timestamp is in the second child, in a <span> element with a "data-sort-value" attribute.
# The winner is in the third child.
df = {'t': [row[1].xpath('.//span[@data-sort-value]')[0].text for row in rows[1:]],
'winner': [row[2].text for row in rows[1:]]}
df = pandas.DataFrame(df)
df.iloc[:5]
<tr> <td><a href="/wiki/The_Boat_Race_1841" title="The Boat Race 1841">5</a></td> <td><span data-sort-value="000000001841-04-14-0000" style="white-space:nowrap">14 April 1841</span> ‡</td> <td style="background:#B7E1E4; color:#000;">Cambridge</td> <td>32:03</td> <td>22 lengths</td> <td>1</td> <td>4 </td></tr>
t | winner | |
---|---|---|
0 | 10 June 1829 | Oxford |
1 | 17 June 1836 | Cambridge |
2 | 3 April 1839 | Cambridge |
3 | 15 April 1840 | Cambridge |
4 | 14 April 1841 | Cambridge |
You should consider the ethics of your web scraping. Here are some thoughts: from Sophie Chou at the MIT Media Lab, and the data journalist Näel Shiab.
Once your data is in an SQL database, access is easy. Here's an example with Postgresql, a dialect of SQL.
Databases are usually secured, and you need various credentials to log in. It's good practice to store our code in a repository, but bad practice to store credentials there too. Instead, we can store credentials in separate file that isn't checked in to the repository. I like to store credentials in a JSON file, something like this:
{"user": "SPQR", "password": "TOPSECRET", "host": "***", "dbname": "***"}
which is easy to load into Python as a dictionary. Then I can use the fields of this dictionary as arguments to psycopg2.connect
, to establish the connection.
import psycopg2 # module for connecting to a Postgresql database
import json # standard module for reading json files
creds = json.load(open('res/secret_creds.json'))
conn = psycopg2.connect(**creds)
We can run arbitrary SQL queries, and retrieve the results as a pandas dataframe. Pass in parameters with the %(name)s
quoting mechanism, to keep ourselves safe from SQL injection attacks.
cmd = '''
SELECT *
FROM flood_stations AS s JOIN flood_measures AS m ON (m.station_uri = s.uri)
WHERE river = %(river)s OR town = %(town)s
LIMIT 3
'''
pandas.read_sql(cmd, conn, params={'river': 'River Cam', 'town': 'Cambridge'})
index | uri | label | id | catchment | river | town | lat | lng | index | measure_id | uri | station_uri | qualifier | parameter | period | unit | valuetype | low | high | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 345 | http://environment.data.gov.uk/flood-monitorin... | Great Chesterford | E21778 | Cam and Ely Ouse (Including South Level) | River Cam | Great Chesterford | 52.061730 | 0.194279 | 397 | 398 | http://environment.data.gov.uk/flood-monitorin... | http://environment.data.gov.uk/flood-monitorin... | Stage | Water Level | 900.0 | m | instantaneous | 0.109 | 0.333 |
1 | 800 | http://environment.data.gov.uk/flood-monitorin... | Weston Bampfylde | 52113 | Parrett, Brue and West Somerset Streams | River Cam | Weston Bampfylde | 51.023159 | -2.565568 | 918 | 919 | http://environment.data.gov.uk/flood-monitorin... | http://environment.data.gov.uk/flood-monitorin... | Stage | Water Level | 900.0 | m | instantaneous | 0.026 | 0.600 |
2 | 1272 | http://environment.data.gov.uk/flood-monitorin... | Cambridge Baits Bite | E60101 | Cam and Ely Ouse (Including South Level) | River Cam | Milton | 52.236542 | 0.176925 | 1454 | 1455 | http://environment.data.gov.uk/flood-monitorin... | http://environment.data.gov.uk/flood-monitorin... | Stage | Water Level | 900.0 | mASD | instantaneous | 0.218 | 0.294 |