Tick 2: pandas

We'll be working with data on the COVID-19 pandemic assembled by researchers at Johns Hopkins university. (For more about the dataset, please see An interactive web-based dashboard to track COVID-19 in real time30120-1) by Dong, Du, and Gardner, The Lancet, May 2020.) For consistency of grading, we'll work with a snapshot taken on 13 December and available as https://www.cl.cam.ac.uk/teaching/2021/DataSci/data/covid_global_20201213.csv. If you want to re-run your analyses on up-to-date data, the code to retrieve it is given in the appendix.

This assignment tests your understanding of the pandas library. YOUR CODE MUST USE PANDAS OPERATIONS such as groupby to split and process the data. Do not use Python 'for' loops or list comprehensions (except in plotting code).

Pandas warmup exercises - not assessed

These are optional warmup exercises, to get you used to pandas. Use the following autograder settings:

import ucamcl
GRADER = ucamcl.autograder('https://markmy.solutions', course='scicomp').subsection('ex2')

For these exercises, there is no answer for you to submit. Instead, the autograder shows a model answer. Use it like this:

q = GRADER.fetch_question('p1')  # these warmup exercises are labelled p1, p2, ...
print(q)

Question (p1) from section 4.1 and 5.1. Here are two ways to explore the contents of a column:

stopsearch['outcome'].unique()            # gets a list of unique values
stopsearch.groupby('outcome').apply(len)  # gets counts for each value

The first method returns a list of length 17, and the second returns a Series of length 16. Why are they different lengths? How can you get the counts for all 17 values? How can you sort the list of counts, largest to smallest?

Question (p2) from sections 5.1 and 5.2. We can find the most common age-range in the entire dataset with

stopsearch['age_range'].mode().values[0]

(Note that mode() returns a Series object, and .values[0] gives us just the first item in the Series as a plain Python value.) How can we produce a table listing the most common age-range for each ethnicity and gender?

Question (p3) from sections 5.1 and 5.2. Given the dataframe

df = pandas.DataFrame({'A': [0,0,0,1,1,1], 'B': [0,1,2,0,1,2], 'X': range(6)})

how do you produce a table that has rows for A, columns for B, and shows values of X?

Question (p4). There are three different levels in the stopsearch['type'] column. Here are two different ways to extract three rows, one for each of the levels:

stopsearch.drop_duplicates('type')
stopsearch.groupby('type').tail(1)

Look up the help for these two functions, and explain the difference.

Question (p5). The columns in a pandas DataFrame are Series objects, and they have all sorts of useful methods. Using the methods isin and diff, take the subset of rows where event_type is 0 or 1, and then define a new column Δt containing the time since the preceding event.

df = pandas.DataFrame({'time': [1,2,3,4,5,6,7,8], 'event_type': [0,0,2,1,0,1,2,1]})

Question (p6) from section 6.2. With the dataframe from the last question, add a column Δte that gives within each event_type the time since the last event of that type. Finally, sort the data by event_type and time, using sort_values. jupyter

Questions 2a - worth 1 mark

The first step in any data science work is to find out what the dataset consists of, and to preprocess it into a form that you can work with. Typically, this takes around 70% of a data scientist's time.

GRADER = ucamcl.autograder('https://markmy.solutions', course='scicomp').subsection('tick2a')

Question (q1). Import the CSV file and print out a few lines, choosing the lines at random. The file includes some rows that don't correspond to countries; for these rows Population is missing. Remove these rows, and store what's left as the data frame covid. How many rows are left?

num_rows = ...
GRADER.submit_answer(GRADER.fetch_question('q1'), num_rows)

Question (q2). Complete this data frame listing the unique (Country/Region, Province/State) pairs that you removed.

Country/Region Province/State
Diamond Princess NaN
ans = ... # your answer as a DataFrame
GRADER.submit_answer(GRADER.fetch_question('q2'), ans)

Question (q3). Complete this table listing the number of rows for each different Province/State linked to the UK. Include the case where Province/State is NaN.

Province/State
Anguilla 326
ans = ... # your answer as an Indexed Array
GRADER.submit_answer(GRADER.fetch_question('q3'), ans.reset_index(name='n'))

Question (q4). Take the subset of the table that refers to the UK and for which Province/State is NaN. (This subset excludes the UK's dependent territories).

  1. Convert the date strings to proper dates using pandas.to_datetime
  2. Define a column new_confirmed giving the number of newly confirmed cases each day, and similarly new_deaths.
  3. Define a column smooth_new_confirmed to be the average of new_confirmed over the past seven days.

This dataset happens to have one record for every country and every date, and so it's easy to find the number of newly confirmed cases by finding the difference between successive elements. In numpy we could do np.insert(x[1:]-x[:-1],0,x[0]). Pandas provides diff which does the same thing except for the first element -- but sometimes simplicity is more valuable than correctness of corner cases, and a diff-based answer is fine. For the rolling average, Pandas provides rolling.

Date Confirmed Deaths new_confirmed new_deaths smooth_new_confirmed
2020-01-30 0 0 0 0 0
2020-01-31 2 0 2 0 0.286
ans = ... # your answer as a DataFrame
q = GRADER.fetch_question('q4')
want = pandas.DataFrame(q.rows).assign(Date=lambda df: pandas.to_datetime(df.Date))
submit = ans.merge(want)[q.want_cols].assign(Date=lambda df: df.Date.astype(str))
GRADER.submit_answer(q, submit)

Question (q5). Plot two bar charts, one showing the number of daily confirmed cases, the other showing the number of daily deaths. On your first bar chart, also plot the weekly-smoothed values.

You don't have to submit your plot to the autograder, but you must include it in your submitted notebook. Your plot should look something like this. In your plot, you should pay attention to (1) making sure the subplots share the same x-axis, (2) making sure the x-axis ticks are legible. Don't worry about fine-grained control of the plot, such as plot size or colour scheme.

Question (q6). Most countries in this dataset appear as Country/Region=x, Province/State=NaN, but there are a few, like the UK, that have non-NaN entries for Province/State. To see what the dataset contains, complete the following table. Here metropole is 1 if the country has a Province/State=NaN entry, 0 otherwise; and provinces is the number of distinct non-NaN values; and the table lists all the seven countries for which metropole=0 or provinces>0.

Country/Region metropole provinces
Australia 0 8
Canada 1 13
ans = ... # your answer as a DataFrame
GRADER.submit_answer(GRADER.fetch_question('q6'), ans)

Question (q7). Upon closer inspection of these special-case countries, we conclude there are two types of special case:

Prepare a dataset covidc which fixes up the data for these special-case countries, and includes all the non-special-case countries as is. In addition,

Country/Region Date Confirmed Deaths Population
United Kingdom 2020-05-01 182270 27381 67886004
United Kingdom 2020-05-02 185491 27965 67886004
covidc = ... # your answer as a DataFrame
q = GRADER.fetch_question('q7')
want = pandas.DataFrame(q.rows).assign(Date=lambda df: pandas.to_datetime(df.Date))
submit = covidc.merge(want)[q.want_cols].assign(Date=lambda df: df.Date.astype(str))
GRADER.submit_answer(q, {'num_rows': len(covidc), 'details': submit})

Questions 2b - worth 1 mark

GRADER = ucamcl.autograder('https://markmy.solutions', course='scicomp').subsection('tick2b')

Question (q8). We'd like to know the overall case fatality rate for each country. To get overall figures, we'll just look at the total confirmed cases and deaths for each country. Prepare a dataframe that lists this information.

Country/Region Confirmed Deaths
Australia 28030 908
Austria 319822 4415
ans = ... # your answer as a DataFrame
q = GRADER.fetch_question('q8')
submit = pandas.DataFrame(q.rows).merge(ans)[q.want_cols]
GRADER.submit_answer(q, submit)

Question (q9). The bar chart from question q5 suggests that cases were undercounted in the first wave of the pandemic. Take the first wave to be until 14 July inclusive, and the second wave to be 15 July to the most recent date, and complete the following table which lists the number of cases confirmed in each wave, and the number of deaths.

Country/Region Confirmed1 Deaths1 Confirmed2 Deaths2
France 199509 29929 2149286 27281
United Kingdom 291373 40899 1539583 23127

We'd like to be able to compare the spread of the disease in December versus March, and so we'd like case counts from the first and second waves to be comparable. We shall assume that true fatality rate (the fraction of infected people who die from the disease) is constant, and that the case counts from the first wave are undercounting. Add to your table a new column called case_multiplier, defined as $$ \textsf{case_multiplier} = \frac{\textsf{Deaths1}/\textsf{Confirmed1}}{\textsf{Deaths2}/\textsf{Confirmed2}} $$

ans = ... # your table as a DataFrame
q = GRADER.fetch_question('q9')
submit = pandas.DataFrame(q.rows).merge(ans)[q.want_cols]
GRADER.submit_answer(q, submit)

This is a huge simplification. There are differences in how countries define "death due to COVID", and there are surely differences in standard-of-care, and in demographics. Nonetheless, the plot below shows that many countries have broadly similar case fataility rates in the second wave, so case_multiplier is probably measuring something real; and anyway, if we only use it to compare the two waves within a single country, it doesn't matter that different countries use different definitions. It would be better to use a smoothly-varying function rather than just two levels (a multipler for the 1st, a constant 1 for the 2nd wave), but this takes more work.

Question (q10). For each country,

Country/Region Date Confirmed new_confirmed smooth_new_confirmed infected inc R
United Kingdom 2020-05-01 182270 44170.5 42392.5 436857.1 0.995 0.947
United Kingdom 2020-05-02 185491 30098.0 41689.0 433240.9 0.992 0.917
ans = ... # Your answer as a DataFrame
q = GRADER.fetch_question('q10')
want = pandas.DataFrame(q.rows).assign(Date=lambda df: pandas.to_datetime(df.Date))
submit = ans.merge(want)[q.want_cols].assign(Date=lambda df: df.Date.astype(str))
GRADER.submit_answer(q, submit)

Question (q11). Plot $R$ against against the number of infected people per 100k, one point per day, one subplot per country, showing the UK, Italy, Germany, and the USA. Annotate your plot with axhline to show $R=1$, and with axhspan to show the range of $R$ values that are in between "infections double every two weeks" and "infections halve every two weeks", that is, between $\Delta=2^{1/14}$ and $\Delta=0.5^{1/14}$.

You don't have to submit your plot to the autograder, but you must include it in your submitted notebook. Your plot should look something like this. In your plot, you should pay attention to (1) making sure the subplots share the same x-axis, (2) using a log x-axis, (3) colour-coding the points to indicate what's recent versus what's historic. Annotating your plot with notable events is optional. Don't worry about fine-grained control of the plot, such as plot size or colour scheme.

Investigation

More granular data for the UK is available at https://coronavirus.data.gov.uk/details/download. Repeat this exercise, but applied to your own region. The data includes details of number of tests, saving us the case_multiplier hack.

Appendix

Fetching the data

This data is from a Johns Hopkins university github repository. For more about the dataset, please see An interactive web-based dashboard to track COVID-19 in real time30120-1) by Dong, Du, and Gardner, The Lancet, May 2020.

BASE_URL = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data'
confirmed = pandas.read_csv(f'{BASE_URL}/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')
deaths = pandas.read_csv(f'{BASE_URL}/csse_covid_19_time_series/time_series_covid19_deaths_global.csv')
recovered = pandas.read_csv(f'{BASE_URL}/csse_covid_19_time_series/time_series_covid19_recovered_global.csv')
countries = pandas.read_csv(f'{BASE_URL}/UID_ISO_FIPS_LookUp_Table.csv')

id_vars = ['Province/State','Country/Region','Lat','Long']
x1 = confirmed.melt(id_vars=id_vars, var_name='Date', value_name='Confirmed')
x2 = deaths.melt(id_vars=id_vars, var_name='Date', value_name='Deaths')
x3 = recovered.melt(id_vars=id_vars, var_name='Date', value_name='Recovered')
x = pandas.merge(x1,x2, how='outer', validate='1:1', on=id_vars + ['Date'])
x = pandas.merge(x, x3, how='outer', validate='1:1', on=id_vars + ['Date'])
for v in ['Confirmed','Deaths','Recovered']:
    x[v] = np.where(pandas.isna(x[v]), 0, x[v]).astype(int)
covid = x

covid = covid.merge(countries[['Country_Region','Province_State','Population','iso3']],
                    left_on = ['Country/Region','Province/State'],
                    right_on = ['Country_Region', 'Province_State'],
                    how = 'left')

covid.drop(columns=['Country_Region', 'Province_State'], inplace=True)

covid.to_csv('covid_global_latest.csv', index=False)

Snippets for plotting