tick2

We'll be working with data on the COVID-19 pandemic assembled by researchers at Johns Hopkins university. (For more about the dataset, please see An interactive web-based dashboard to track COVID-19 in real time30120-1) by Dong, Du, and Gardner, The Lancet, May 2020.) For consistency of grading, we'll work with a snapshot taken on 13 December and available as https://www.cl.cam.ac.uk/teaching/2021/DataSci/data/covid_global_20201213.csv. If you want to re-run your analyses on up-to-date data, the code to retrieve it is given in the appendix.

This assignment tests your understanding of the pandas library. YOUR CODE MUST USE PANDAS OPERATIONS such as groupby to split and process the data. Do not use Python 'for' loops or list comprehensions (except in plotting code).

Watch out for whether the question asks you to submit a dataframe or an indexed array. It doesn't matter what order the rows and columns are in, only the contents.
For many of the questions there are diagnostic plots. You don't have to reproduce them but, if you want to, the relevant code is in the appendix. You do have to reproduce the plots for questions (q5) and (q12).

Pandas warmup exercises - not assessed¶

These are optional warmup exercises, to get you used to pandas. Use the following autograder settings:

import ucamcl
GRADER = ucamcl.autograder('https://markmy.solutions', course='scicomp').subsection('ex2')

For these exercises, there is no answer for you to submit. Instead, the autograder shows a model answer. Use it like this:

q = GRADER.fetch_question('p1')  # these warmup exercises are labelled p1, p2, ...
print(q)

Question (p1) from section 4.1 and 5.1. Here are two ways to explore the contents of a column:

stopsearch['outcome'].unique()            # gets a list of unique values
stopsearch.groupby('outcome').apply(len)  # gets counts for each value

The first method returns a list of length 17, and the second returns a Series of length 16. Why are they different lengths? How can you get the counts for all 17 values? How can you sort the list of counts, largest to smallest?

Question (p2) from sections 5.1 and 5.2. We can find the most common age-range in the entire dataset with

stopsearch['age_range'].mode().values[0]

(Note that mode() returns a Series object, and .values[0] gives us just the first item in the Series as a plain Python value.) How can we produce a table listing the most common age-range for each ethnicity and gender?

Question (p3) from sections 5.1 and 5.2. Given the dataframe

df = pandas.DataFrame({'A': [0,0,0,1,1,1], 'B': [0,1,2,0,1,2], 'X': range(6)})

how do you produce a table that has rows for A, columns for B, and shows values of X?

Question (p4). There are three different levels in the stopsearch['type'] column. Here are two different ways to extract three rows, one for each of the levels:

stopsearch.drop_duplicates('type')
stopsearch.groupby('type').tail(1)

Look up the help for these two functions, and explain the difference.

Question (p5). The columns in a pandas DataFrame are Series objects, and they have all sorts of useful methods. Using the methods isin and diff, take the subset of rows where event_type is 0 or 1, and then define a new column Δt containing the time since the preceding event.

df = pandas.DataFrame({'time': [1,2,3,4,5,6,7,8], 'event_type': [0,0,2,1,0,1,2,1]})

Question (p6) from section 6.2. With the dataframe from the last question, add a column Δte that gives within each event_type the time since the last event of that type. Finally, sort the data by event_type and time, using sort_values. jupyter

Questions 2a - worth 1 mark¶

The first step in any data science work is to find out what the dataset consists of, and to preprocess it into a form that you can work with. Typically, this takes around 70% of a data scientist's time.

GRADER = ucamcl.autograder('https://markmy.solutions', course='scicomp').subsection('tick2a')

Question (q1). Import the CSV file and print out a few lines, choosing the lines at random. The file includes some rows that don't correspond to countries; for these rows Population is missing. Remove these rows, and store what's left as the data frame covid. How many rows are left?

num_rows = ...
GRADER.submit_answer(GRADER.fetch_question('q1'), num_rows)

Question (q2). Complete this data frame listing the unique (Country/Region, Province/State) pairs that you removed.

Country/Region	Province/State
Diamond Princess	NaN

ans = ... # your answer as a DataFrame
GRADER.submit_answer(GRADER.fetch_question('q2'), ans)

Question (q3). Complete this table listing the number of rows for each different Province/State linked to the UK. Include the case where Province/State is NaN.

Province/State
Anguilla	326

ans = ... # your answer as an Indexed Array
GRADER.submit_answer(GRADER.fetch_question('q3'), ans.reset_index(name='n'))

Question (q4). Take the subset of the table that refers to the UK and for which Province/State is NaN. (This subset excludes the UK's dependent territories).

Convert the date strings to proper dates using pandas.to_datetime
Define a column new_confirmed giving the number of newly confirmed cases each day, and similarly new_deaths.
Define a column smooth_new_confirmed to be the average of new_confirmed over the past seven days.

This dataset happens to have one record for every country and every date, and so it's easy to find the number of newly confirmed cases by finding the difference between successive elements. In numpy we could do np.insert(x[1:]-x[:-1],0,x[0]). Pandas provides diff which does the same thing except for the first element -- but sometimes simplicity is more valuable than correctness of corner cases, and a diff-based answer is fine. For the rolling average, Pandas provides rolling.

Date	Confirmed	Deaths	new_confirmed	new_deaths	smooth_new_confirmed
2020-01-30	0	0	0	0	0
2020-01-31	2	0	2	0	0.286

ans = ... # your answer as a DataFrame
q = GRADER.fetch_question('q4')
want = pandas.DataFrame(q.rows).assign(Date=lambda df: pandas.to_datetime(df.Date))
submit = ans.merge(want)[q.want_cols].assign(Date=lambda df: df.Date.astype(str))
GRADER.submit_answer(q, submit)

Question (q5). Plot two bar charts, one showing the number of daily confirmed cases, the other showing the number of daily deaths. On your first bar chart, also plot the weekly-smoothed values.

You don't have to submit your plot to the autograder, but you must include it in your submitted notebook. Your plot should look something like this. In your plot, you should pay attention to (1) making sure the subplots share the same x-axis, (2) making sure the x-axis ticks are legible. Don't worry about fine-grained control of the plot, such as plot size or colour scheme.

Question (q6). Most countries in this dataset appear as Country/Region=x, Province/State=NaN, but there are a few, like the UK, that have non-NaN entries for Province/State. To see what the dataset contains, complete the following table. Here metropole is 1 if the country has a Province/State=NaN entry, 0 otherwise; and provinces is the number of distinct non-NaN values; and the table lists all the seven countries for which metropole=0 or provinces>0.

Country/Region	metropole	provinces
Australia	0	8
Canada	1	13

ans = ... # your answer as a DataFrame
GRADER.submit_answer(GRADER.fetch_question('q6'), ans)

Question (q7). Upon closer inspection of these special-case countries, we conclude there are two types of special case:

For Australia, Canada, and China, we should add up the COVID statistics for each of the provinces
For Denmark, France, Netherlands, and UK, we should only keep the Province/State=NaN entries and discard the others.

Prepare a dataset covidc which fixes up the data for these special-case countries, and includes all the non-special-case countries as is. In addition,

convert the date strings to proper dates
merge in Population values and iso3 codes from https://www.cl.cam.ac.uk/teaching/2021/DataSci/data/covid_countries_20201213.csv, since the Population values in the original dataset don't account for the special-cased countries
sort the dataset by Country/Region then by Date

Country/Region	Date	Confirmed	Deaths	Population
United Kingdom	2020-05-01	182270	27381	67886004
United Kingdom	2020-05-02	185491	27965	67886004

covidc = ... # your answer as a DataFrame
q = GRADER.fetch_question('q7')
want = pandas.DataFrame(q.rows).assign(Date=lambda df: pandas.to_datetime(df.Date))
submit = covidc.merge(want)[q.want_cols].assign(Date=lambda df: df.Date.astype(str))
GRADER.submit_answer(q, {'num_rows': len(covidc), 'details': submit})

Questions 2b - worth 1 mark¶

GRADER = ucamcl.autograder('https://markmy.solutions', course='scicomp').subsection('tick2b')

Question (q8). We'd like to know the overall case fatality rate for each country. To get overall figures, we'll just look at the total confirmed cases and deaths for each country. Prepare a dataframe that lists this information.

Country/Region	Confirmed	Deaths
Australia	28030	908
Austria	319822	4415

ans = ... # your answer as a DataFrame
q = GRADER.fetch_question('q8')
submit = pandas.DataFrame(q.rows).merge(ans)[q.want_cols]
GRADER.submit_answer(q, submit)

Question (q9). The bar chart from question q5 suggests that cases were undercounted in the first wave of the pandemic. Take the first wave to be until 14 July inclusive, and the second wave to be 15 July to the most recent date, and complete the following table which lists the number of cases confirmed in each wave, and the number of deaths.

Country/Region	Confirmed1	Deaths1	Confirmed2	Deaths2
France	199509	29929	2149286	27281
United Kingdom	291373	40899	1539583	23127

We'd like to be able to compare the spread of the disease in December versus March, and so we'd like case counts from the first and second waves to be comparable. We shall assume that true fatality rate (the fraction of infected people who die from the disease) is constant, and that the case counts from the first wave are undercounting. Add to your table a new column called case_multiplier, defined as $$ \textsf{case_multiplier} = \frac{\textsf{Deaths1}/\textsf{Confirmed1}}{\textsf{Deaths2}/\textsf{Confirmed2}} $$

ans = ... # your table as a DataFrame
q = GRADER.fetch_question('q9')
submit = pandas.DataFrame(q.rows).merge(ans)[q.want_cols]
GRADER.submit_answer(q, submit)

This is a huge simplification. There are differences in how countries define "death due to COVID", and there are surely differences in standard-of-care, and in demographics. Nonetheless, the plot below shows that many countries have broadly similar case fataility rates in the second wave, so case_multiplier is probably measuring something real; and anyway, if we only use it to compare the two waves within a single country, it doesn't matter that different countries use different definitions. It would be better to use a smoothly-varying function rather than just two levels (a multipler for the 1st, a constant 1 for the 2nd wave), but this takes more work.

Question (q10). For each country,

Find the number of new_confirmed cases each day as in (q4), and multiply the first-wave counts by the corresponding case_multiplier. In maths notation, let $N_t=\alpha_t (C_t-C_{t-1})$, where $C_t$ is the number of confirmed cases recorded in the dataset for date $t$, where where $\alpha_t$ is the case multiplier for $t$ up to 14 July inclusive, and $\alpha_t=1$ for later $t$.
Define a daily smooth_new_confirmed value as in (q4). In maths notation, let $S_t=(N_{t-6}+\cdots+N_t)/7$.
Define a column infected which gives the total number of smooth_new_confirmed over the past $d$ days, where $d=10$ is the average duration of an infection -- this is a crude guess at the number of currently infected people. In maths notation, let $I_t=S_{t-d+1}+\cdots+S_t.$
Define a column inc which measures the relative change. In maths notation, let $\Delta_t=I_t/I_{t-1}$.
Define a column R which reports the daily $R$ number, which can be calculated by $R_t=d(\Delta_t-1)+1$.

Country/Region	Date	Confirmed	new_confirmed	smooth_new_confirmed	infected	inc	R
United Kingdom	2020-05-01	182270	44170.5	42392.5	436857.1	0.995	0.947
United Kingdom	2020-05-02	185491	30098.0	41689.0	433240.9	0.992	0.917

ans = ... # Your answer as a DataFrame
q = GRADER.fetch_question('q10')
want = pandas.DataFrame(q.rows).assign(Date=lambda df: pandas.to_datetime(df.Date))
submit = ans.merge(want)[q.want_cols].assign(Date=lambda df: df.Date.astype(str))
GRADER.submit_answer(q, submit)

Question (q11). Plot $R$ against against the number of infected people per 100k, one point per day, one subplot per country, showing the UK, Italy, Germany, and the USA. Annotate your plot with axhline to show $R=1$, and with axhspan to show the range of $R$ values that are in between "infections double every two weeks" and "infections halve every two weeks", that is, between $\Delta=2^{1/14}$ and $\Delta=0.5^{1/14}$.

You don't have to submit your plot to the autograder, but you must include it in your submitted notebook. Your plot should look something like this. In your plot, you should pay attention to (1) making sure the subplots share the same x-axis, (2) using a log x-axis, (3) colour-coding the points to indicate what's recent versus what's historic. Annotating your plot with notable events is optional. Don't worry about fine-grained control of the plot, such as plot size or colour scheme.

Investigation¶

More granular data for the UK is available at https://coronavirus.data.gov.uk/details/download. Repeat this exercise, but applied to your own region. The data includes details of number of tests, saving us the case_multiplier hack.

Appendix¶

Fetching the data¶

This data is from a Johns Hopkins university github repository. For more about the dataset, please see An interactive web-based dashboard to track COVID-19 in real time30120-1) by Dong, Du, and Gardner, The Lancet, May 2020.

BASE_URL = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data'
confirmed = pandas.read_csv(f'{BASE_URL}/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')
deaths = pandas.read_csv(f'{BASE_URL}/csse_covid_19_time_series/time_series_covid19_deaths_global.csv')
recovered = pandas.read_csv(f'{BASE_URL}/csse_covid_19_time_series/time_series_covid19_recovered_global.csv')
countries = pandas.read_csv(f'{BASE_URL}/UID_ISO_FIPS_LookUp_Table.csv')

id_vars = ['Province/State','Country/Region','Lat','Long']
x1 = confirmed.melt(id_vars=id_vars, var_name='Date', value_name='Confirmed')
x2 = deaths.melt(id_vars=id_vars, var_name='Date', value_name='Deaths')
x3 = recovered.melt(id_vars=id_vars, var_name='Date', value_name='Recovered')
x = pandas.merge(x1,x2, how='outer', validate='1:1', on=id_vars + ['Date'])
x = pandas.merge(x, x3, how='outer', validate='1:1', on=id_vars + ['Date'])
for v in ['Confirmed','Deaths','Recovered']:
    x[v] = np.where(pandas.isna(x[v]), 0, x[v]).astype(int)
covid = x

covid = covid.merge(countries[['Country_Region','Province_State','Population','iso3']],
                    left_on = ['Country/Region','Province/State'],
                    right_on = ['Country_Region', 'Province_State'],
                    how = 'left')

covid.drop(columns=['Country_Region', 'Province_State'], inplace=True)

covid.to_csv('covid_global_latest.csv', index=False)

Snippets for plotting¶

In [ ]:

# Question 7

want_countries = ['Austria','Belgium','Czechia','Denmark','Finland','France','Germany',
                  'Greece','Hungary','Ireland','Italy',
                  'Netherlands','Norway','Poland','Portugal','Spain','Sweden','Switzerland','United Kingdom',
                  'Australia','US','Canada','Japan','China','Russia','Ukraine','India']

fig,ax = plt.subplots(figsize=(8,4))
for c in want_countries:
    df = covidc.loc[covidc['Country/Region']==c]
    x,y = df.Confirmed / df.Population*100000, df.Deaths / df.Population*100000
    ax.plot(x, y, label=c)
    # note: ax.text can only plot one text label at a time
    ax.text(x.iloc[-1], y.iloc[-1], c, ha='left', va='baseline')
ax.set_xlabel('confirmed/100k')
ax.set_ylabel('deaths/100k')
plt.show()

In [ ]:

# Question 8

df = ... # dataframe for Question 8
df = df.loc[df['Country/Region'].isin(want_countries)]

fig,ax = plt.subplots(figsize=(8,6))
ax.scatter(df.Confirmed, df.Deaths)
for i in range(len(df)):
    ax.text(df.Confirmed.iloc[i]*1.055, df.Deaths.iloc[i], df['Country/Region'].iloc[i], ha='left', va='center',
           fontsize=7)
xx = np.power(10, np.linspace(4.5,7.4,100))
for cfr in [.01,.02,.04]:
    ax.plot(xx, xx*cfr, color='0.6', linestyle='dashed')
    ax.text(xx[-1], xx[-1]*cfr, f'cfr {cfr*100}%', ha='left')
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('confirmed cases')
ax.set_ylabel('deaths')
plt.show()

In [ ]:

# Question 9

df = ... # dataframe for Question 9
df = df.loc[df['Country/Region'].isin(want_countries)]

fig,ax = plt.subplots(figsize=(8,6))
ax.scatter(df.Confirmed1, df.Deaths1, color='cornflowerblue')
ax.scatter(df.Confirmed2, df.Deaths2, color='goldenrod')
for i in range(len(df)):
    x1,y1 = df.Confirmed1.iloc[i], df.Deaths1.iloc[i]
    x2,y2 = df.Confirmed2.iloc[i], df.Deaths2.iloc[i]
    ax.plot([x1,x2],[y1,y2], color='black', alpha=.4, linewidth=.5, zorder=-1)
    # the nudge *1.055 is just to offset the text label from the marker
    ax.text(x2*1.055,y2, df['Country/Region'].iloc[i], ha='left', va='center', fontsize=7)
xx = np.power(10, np.linspace(4.5,7.4,100))
for cfr in [.01,.02,.04]:
    ax.plot(xx, xx*cfr, color='0.6', linestyle='dashed')
    ax.text(xx[-1], xx[-1]*cfr, f'cfr {cfr*100}%', ha='left')
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('confirmed cases')
ax.set_ylabel('deaths')
plt.show()

In [ ]:

# Question 12

# Data for special dates

import io
specials = pandas.read_csv(io.StringIO('''
Country/Region,Date,event
"United Kingdom",2020-03-23,"lockdown"
,2020-07-04,"open"
,2020-07-24,"facemasks"
,2020-09-14,"unsocial"
,2020-09-17,"local"
,2020-10-03,"local"
,2020-10-14,"local"
,2020-10-23,"local"
,2020-11-05,"lockdown2"
,2020-12-02,"open2"
'''))
specials['Date'] = pandas.to_datetime(specials['Date'])
specials['Country/Region'] = specials['Country/Region'].fillna(method='ffill')

# To show these specials as annotations:
# merge the specials dataframe with your full per-country-and-date dataframe,
# and then just plot the points where event isn't NaN.

Tick 2: pandas¶

Pandas warmup exercises - not assessed¶

Questions 2a - worth 1 mark¶

Questions 2b - worth 1 mark¶

Investigation¶

Appendix¶

Fetching the data¶

Snippets for plotting¶