Tick 2: pandas¶

We'll be working with data on the COVID-19 pandemic assembled by researchers at Johns Hopkins university. (For more about the dataset, please see An interactive web-based dashboard to track COVID-19 in real time by Dong, Du, and Gardner, The Lancet, May 2020.) For consistency of grading, we'll work with a snapshot taken on 10 December and available as https://www.cl.cam.ac.uk/teaching/2122/DataSci/data/covid_global_20211210.csv. If you want to re-run your analyses on up-to-date data, the code to retrieve it is given in the appendix.

This assignment tests your understanding of the pandas library. YOUR CODE MUST USE PANDAS OPERATIONS such as groupby to split and process the data. Do not use Python 'for' loops or list comprehensions (except in plotting code).

Some questions ask you to prepare a DataFrame, others a table (i.e. a Series). It doesn't matter what order the rows and columns are in, only the contents.

Tips on how to structure your notebooks¶

It's a good habit to tidy up your notebook as you go. While I'm developing a solution I have processing and printouts and answer submission interspersed, but then I tidy it all up and clearly separate out the pieces into:

Data import. Data is sacrosanct. It should be loaded in, then treated as immutable. I like to put all my data import routines right at the top, so that the reader can easily see the basis for my conclusions. Always cite your sources. My personal convention is to use ALLCAPS for raw imported data (the same convention as for global constants).
Data processing. This is a block of code that defines all the dataframes I'll want to reuse. I give these dataframes meaningful names, and document the names. When I want to come back to the notebook and do further work, I can just run this block of code and not be slowed down by printouts and autograder etc. This block of code could just as well exist in a standaline .py file.
Data analysis, and answer submission. Here I have a chunk of code, one for each question. I only create throwaway dataframes here, never reusing them from question to question, so each chunk of code is independent of all the others. This makes it easier for me to dive in and work on one answer, without having to rerun the entire notebook. Also, I use generic variable names like df, so I'm not tempted to reuse the results. All I'll use is the named and documented processed data from section 2, and I won't modify them.

Tick 2a - worth 1 mark¶

import ucamcl
GRADER = ucamcl.autograder('https://markmy.solutions', course='scicomp')
GRADER2a = GRADER.subsection('tick2a')

The first step in any data science work is to find out what the dataset consists of, and to preprocess it into a form that we can work with. Typically, this takes around 70% of a data scientist's time. Our overall goal in Tick 2a is to produce a dataframe covidc (the c stands for clean) with one row per country×date, and a second dataframe ukc which integrates UK vaccination data.

Question (q1). Import the csv file and print out a few lines, choosing the lines at random. The file includes some rows that don't correspond to countries; for these rows Population is missing. Remove these rows, and store what's left as the data frame covid. How many rows are left?

To submit your answer,

num_rows = ...
GRADER2a.submit_answer(GRADER2a.fetch_question('q1'), num_rows)

Question (q2). Complete this data frame listing the unique (Country/Region, Province/State) pairs that you removed.

Country/Region	Province/State
Diamond Princess	NaN

ans = ... # your answer as a DataFrame
GRADER2a.submit_answer(GRADER2a.fetch_question('q2'), ans)

Question (q3). The dataset has columns for Country/Region as well as for Province/State. To get a sense of what these contain, let's look at the UK. Complete this table listing the number of rows for each different Province/State linked to the UK. Include the case where Province/State is NaN.

Province/State
Anguilla	688

ans = ... # your answer as a table
GRADER2a.submit_answer(GRADER2a.fetch_question('q3'), ans.reset_index(name='n'))

Question (q4). Most countries our covid dataframe appear as Country/Region=x, Province/State=NaN, but there are a few, like the UK, that have non-NaN entries for Province/State. Let's look at all of them. Complete the following dataframe, which lists all eight such countries. Here metropole is 1 if the country has a Province/State=NaN entry, 0 otherwise; and provinces is the number of distinct non-NaN values.

Country/Region	metropole	provinces
Australia	0	8
Canada	1	13

ans = ... # your answer as a DataFrame
GRADER2a.submit_answer(GRADER2a.fetch_question('q4'), ans)

Question (q5). Upon closer inspection of these special-case countries, we conclude there are two types of special case:

For Australia, Canada, and China, we should add up the COVID statistics for each of the provinces
For Denmark, France, Netherlands, New Zealand, and UK, we should only keep the Province/State=NaN entries and discard the others.

Prepare a dataset covidc which fixes up the data for these special-case countries, and includes all the non-special-case countries as is. In addition,

convert the date strings to proper dates
merge in Population values and iso3 codes from https://www.cl.cam.ac.uk/teaching/2122/DataSci/data/covidcountries_20210831.csv, since the Population values in the original dataset don't account for the special-cased countries
sort the dataset by Country/Region then by Date

Country/Region	Date	Confirmed	Deaths	Population
Canada	2021-09-05	1522599	27071	38246108
Denmark	2021-07-28	314135	2547	5837213

covidc = ... # your answer as a DataFrame

q = GRADER2a.fetch_question('q5')
want_rows = pandas.DataFrame(q.rows).assign(Date=lambda df: pandas.to_datetime(df.Date))
submit = covidc.merge(want_rows)[q.cols].assign(Date=lambda df: df.Date.astype(str))
GRADER2a.submit_answer(q, {'num_rows': len(covidc), 'details': submit})

Question (q6). For the UK, we have an additional dataset: https://www.cl.cam.ac.uk/teaching/2122/DataSci/data/covidvax_uk_20211209.csv, downloaded from coronavirus.data.gov, with vaccination data. Create a dataframe consisting of the UK rows from covidc, with an additional column totvax2 giving the cumulative number of people who have received their second vaccination dose.

Date	Confirmed	Deaths	Population	totvax2
2020-12-10	1787783	63082	67886004	7199
2021-12-01	10276007	145140	67886004	42696015

ukc = ... # your answer as a DataFrame

q = GRADER2a.fetch_question('q6')
dates = pandas.to_datetime(q.dates)
submit = ukc.loc[ukc.Date.isin(dates), q.cols].assign(Date=lambda df: df.Date.astype(str))
GRADER2a.submit_answer(q, {'num_rows': len(ukc), 'details': submit})

Tick 2b - worth 1 mark¶

GRADER2b = GRADER.subsection('tick2b')

Question (q7). Let's start by getting a general view of the data on cases and deaths. We'll look at the UK, since it's more comprehensible to start with a single country which we're familiar with. With the dataset ukc,

Define a column new_confirmed giving the number of newly confirmed cases each day, and similarly new_deaths.
Define a column smooth_new_confirmed to be the average of new_confirmed over the past 7 days.

This dataset happens to have one record for every date, so newly confirmed cases are just the differences between successive items. It's good practice to first check that there is one record per date, with an assert statement. For the rolling average, pandas provides rolling.

Date	Confirmed	new_confirmed	smooth_new_confirmed
2020-01-30	0	0	0
2020-01-31	2	2	0.2857
2020-02-01	2	0	0.2857

ans = ... # your answer as a DataFrame
q = GRADER2b.fetch_question('q7')
dates = pandas.to_datetime(q.dates)
submit = ans.loc[ans.Date.isin(dates), q.cols].assign(Date=lambda df: df.Date.astype(str))
GRADER2b.submit_answer(q, submit)

Question (q8). Plot two bar charts, one showing the number of daily confirmed cases, the other showing the number of daily deaths. On your first bar chart, also plot the weekly-smoothed values. Also plot the percent of the population that has been double-vaccinated.

You don't have to submit your plot to the autograder, but you must include it in your submitted notebook. Your plot should look something like this. In your plot, you should pay attention to (1) making sure the subplots share the same x-axis, (2) making sure the x-axis ticks are legible. Don't worry about fine-grained control of the plot, such as plot size or colour scheme.

Question (q9). The UK has experienced three waves. In wave1 (dates before 2020-08-01) there was under-reporting of cases, and no vaccine. In wave2 (dates before 2021-05-01) there was full reporting, but still hardly any vaccination. In wave3, vaccination is more widespread. Let's see how this impacts case fatality rate.

Split the data into these three waves. For each wave, count the number of cases and deaths, and compute the case fatality rate (deaths divided by cases). Also report the vaccination rate as a fraction of total population, at the end of the wave.

wave	Confirmed	Deaths	vaxrate	cfr
wave1	303181	41189	0	0.1359

ans = ... # your answer as a DataFrame
GRADER2b.submit_answer(GRADER2b.fetch_question('q9'), ans)

Question (q10). We'd like to know the overall case fatality rate for each country. Prepare a dataframe that lists this information. Only use data from 2020-08-01 onwards, since case numbers before then are unreliable.

Case fatality rate gives us a hint about quality of healthcare, but it's a huge simplification because of differences in how countries define "death due to COVID" and in demographics. Nonetheless, it's a useful starting point.

Country/Region	Confirmed	Deaths	cfr
France	7697355	87556	0.011375
United Kingdom	10357800	104946	0.010132

ans = ... # your answer as a DataFrame
q = GRADER2b.fetch_question('q10')
submit = ans.loc[ans['Country/Region'].isin(q.countries), q.cols]
GRADER2b.submit_answer(q, submit)

Question (q11). Plot the number of deaths and the number of confirmed cases for selected countries, using your answer from question (q10), on a log-log scale. Also plot guides to indicate case fatality rates of 0.5%, 1%, and 2%. Show these countries:

['Austria','Belgium','Czechia','Denmark','Finland','France','Germany',
 'Greece','Hungary','Ireland','Italy',
 'Netherlands','Norway','Poland','Portugal','Spain','Sweden','Switzerland','United Kingdom',
 'Australia','US','Canada','Japan','China','Russia','Ukraine','India']

You don't have to submit your plot to the autograder, but you must include it in your submitted notebook. Your plot should look something like this. Don't worry about fine-grained control of the plot, such as plot size or colour scheme or exact placement of labels.

Question (q12). For each country,

Find the number of new_confirmed cases and smooth_new_confirmed as in question (q7).
Define a column infected which gives the total number of smooth_new_confirmed over the past $d$ days, where $d=10$ is the average duration of an infection — this is a crude guess at the number of currently infected people. In maths notation, let $I_t=S_{t-d+1}+\cdots+S_t.$
Define a column inc which measures the relative change. In maths notation, let $\Delta_t=I_t/I_{t-1}$.
Define a column R which reports the daily $R$ number, which can be estimated using $R_t=d(\Delta_t-1)+1$.

Your code must use pandas operations such as groupby to split and process the data. Do not use Python 'for' loops or list comprehensions.

Country/Region	Date	Confirmed	new_confirmed	smooth_new_confirmed	infected	inc	R
United Kingdom	2021-10-01	7841625	34589	34305.4	333985.6	1.0114	1.114
United Kingdom	2021-10-02	7871014	29389	34254.4	337180.0	1.0096	1.096

ans = ... # Your answer as a DataFrame
q = GRADER2b.fetch_question('q12')
want = pandas.DataFrame(q.rows).assign(Date=lambda df: pandas.to_datetime(df.Date))
submit = ans.merge(want)[q.cols].assign(Date=lambda df: df.Date.astype(str))
GRADER2b.submit_answer(q, submit)

Question (q13). Plot $R$ against against the number of infected people per 100k, one point per day, one subplot per country, showing the UK, Germany, and the USA. Annotate your plot with axhline to show $R=1$, and with axhspan to show the range of $R$ values that are in between "infections double every two weeks" and "infections halve every two weeks", that is, between $\Delta=2^{1/14}$ and $\Delta=0.5^{1/14}$.

You don't have to submit your plot to the autograder, but you must include it in your submitted notebook. Your plot should look something like this. In your plot, you should pay attention to (1) making sure the subplots share the same x-axis, (2) using a log x-axis, (3) colour-coding the points to indicate what's recent versus what's historic. Annotating your plot with notable events is optional. Don't worry about fine-grained control of the plot, such as plot size or colour scheme.

Appendix¶

Fetching the data¶

Vaccination data is from coronavirus.data.gov.uk. Choose Area type: Nation, select the metric cumPeopleVaccinatedSecondDoseByVaccinationDate, and download a csv.

COVID data and country data is from a Johns Hopkins university github repository. For more about the dataset, please see An interactive web-based dashboard to track COVID-19 in real time30120-1) by Dong, Du, and Gardner, The Lancet, May 2020.

BASE_URL = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data'
confirmed = pandas.read_csv(f'{BASE_URL}/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')
deaths = pandas.read_csv(f'{BASE_URL}/csse_covid_19_time_series/time_series_covid19_deaths_global.csv')
recovered = pandas.read_csv(f'{BASE_URL}/csse_covid_19_time_series/time_series_covid19_recovered_global.csv')
countries = pandas.read_csv(f'{BASE_URL}/UID_ISO_FIPS_LookUp_Table.csv')

id_vars = ['Province/State','Country/Region','Lat','Long']
x1 = confirmed.melt(id_vars=id_vars, var_name='Date', value_name='Confirmed')
x2 = deaths.melt(id_vars=id_vars, var_name='Date', value_name='Deaths')
x3 = recovered.melt(id_vars=id_vars, var_name='Date', value_name='Recovered')
x = pandas.merge(x1,x2, how='outer', validate='1:1', on=id_vars + ['Date'])
x = pandas.merge(x, x3, how='outer', validate='1:1', on=id_vars + ['Date'])
for v in ['Confirmed','Deaths','Recovered']: x[v] = x[v].fillna(0).astype(int)
covid = x

covid = covid.merge(countries[['Country_Region','Province_State','Population','iso3']],
                    left_on = ['Country/Region','Province/State'],
                    right_on = ['Country_Region', 'Province_State'],
                    how = 'left')

covid.drop(columns=['Country_Region', 'Province_State'], inplace=True)

covid.to_csv('covid_global_latest.csv', index=False)