Tick 2: pandas

We'll be working with data on the COVID-19 pandemic assembled by researchers at Johns Hopkins university. (For more about the dataset, please see An interactive web-based dashboard to track COVID-19 in real time by Dong, Du, and Gardner, The Lancet, May 2020.) For consistency of grading, we'll work with a snapshot taken on 10 December and available as https://www.cl.cam.ac.uk/teaching/2122/DataSci/data/covid_global_20211210.csv. If you want to re-run your analyses on up-to-date data, the code to retrieve it is given in the appendix.

This assignment tests your understanding of the pandas library. YOUR CODE MUST USE PANDAS OPERATIONS such as groupby to split and process the data. Do not use Python 'for' loops or list comprehensions (except in plotting code).

Some questions ask you to prepare a DataFrame, others a table (i.e. a Series). It doesn't matter what order the rows and columns are in, only the contents.

Tips on how to structure your notebooks

It's a good habit to tidy up your notebook as you go. While I'm developing a solution I have processing and printouts and answer submission interspersed, but then I tidy it all up and clearly separate out the pieces into:

  1. Data import. Data is sacrosanct. It should be loaded in, then treated as immutable. I like to put all my data import routines right at the top, so that the reader can easily see the basis for my conclusions. Always cite your sources. My personal convention is to use ALLCAPS for raw imported data (the same convention as for global constants).

  2. Data processing. This is a block of code that defines all the dataframes I'll want to reuse. I give these dataframes meaningful names, and document the names. When I want to come back to the notebook and do further work, I can just run this block of code and not be slowed down by printouts and autograder etc. This block of code could just as well exist in a standaline .py file.

  3. Data analysis, and answer submission. Here I have a chunk of code, one for each question. I only create throwaway dataframes here, never reusing them from question to question, so each chunk of code is independent of all the others. This makes it easier for me to dive in and work on one answer, without having to rerun the entire notebook. Also, I use generic variable names like df, so I'm not tempted to reuse the results. All I'll use is the named and documented processed data from section 2, and I won't modify them.

Tick 2a - worth 1 mark

import ucamcl
GRADER = ucamcl.autograder('https://markmy.solutions', course='scicomp')
GRADER2a = GRADER.subsection('tick2a')

The first step in any data science work is to find out what the dataset consists of, and to preprocess it into a form that we can work with. Typically, this takes around 70% of a data scientist's time. Our overall goal in Tick 2a is to produce a dataframe covidc (the c stands for clean) with one row per country×date, and a second dataframe ukc which integrates UK vaccination data.

Question (q1). Import the csv file and print out a few lines, choosing the lines at random. The file includes some rows that don't correspond to countries; for these rows Population is missing. Remove these rows, and store what's left as the data frame covid. How many rows are left?

To submit your answer,

num_rows = ...
GRADER2a.submit_answer(GRADER2a.fetch_question('q1'), num_rows)

Question (q2). Complete this data frame listing the unique (Country/Region, Province/State) pairs that you removed.

Country/Region Province/State
Diamond Princess NaN
ans = ... # your answer as a DataFrame
GRADER2a.submit_answer(GRADER2a.fetch_question('q2'), ans)

Question (q3). The dataset has columns for Country/Region as well as for Province/State. To get a sense of what these contain, let's look at the UK. Complete this table listing the number of rows for each different Province/State linked to the UK. Include the case where Province/State is NaN.

Province/State
Anguilla 688
ans = ... # your answer as a table
GRADER2a.submit_answer(GRADER2a.fetch_question('q3'), ans.reset_index(name='n'))

Question (q4). Most countries our covid dataframe appear as Country/Region=x, Province/State=NaN, but there are a few, like the UK, that have non-NaN entries for Province/State. Let's look at all of them. Complete the following dataframe, which lists all eight such countries. Here metropole is 1 if the country has a Province/State=NaN entry, 0 otherwise; and provinces is the number of distinct non-NaN values.

Country/Region metropole provinces
Australia 0 8
Canada 1 13
ans = ... # your answer as a DataFrame
GRADER2a.submit_answer(GRADER2a.fetch_question('q4'), ans)

Question (q5). Upon closer inspection of these special-case countries, we conclude there are two types of special case:

Prepare a dataset covidc which fixes up the data for these special-case countries, and includes all the non-special-case countries as is. In addition,

Country/Region Date Confirmed Deaths Population
Canada 2021-09-05 1522599 27071 38246108
Denmark 2021-07-28 314135 2547 5837213
covidc = ... # your answer as a DataFrame

q = GRADER2a.fetch_question('q5')
want_rows = pandas.DataFrame(q.rows).assign(Date=lambda df: pandas.to_datetime(df.Date))
submit = covidc.merge(want_rows)[q.cols].assign(Date=lambda df: df.Date.astype(str))
GRADER2a.submit_answer(q, {'num_rows': len(covidc), 'details': submit})

Question (q6). For the UK, we have an additional dataset: https://www.cl.cam.ac.uk/teaching/2122/DataSci/data/covidvax_uk_20211209.csv, downloaded from coronavirus.data.gov, with vaccination data. Create a dataframe consisting of the UK rows from covidc, with an additional column totvax2 giving the cumulative number of people who have received their second vaccination dose.

Date Confirmed Deaths Population totvax2
2020-12-10 1787783 63082 67886004 7199
2021-12-01 10276007 145140 67886004 42696015
ukc = ... # your answer as a DataFrame

q = GRADER2a.fetch_question('q6')
dates = pandas.to_datetime(q.dates)
submit = ukc.loc[ukc.Date.isin(dates), q.cols].assign(Date=lambda df: df.Date.astype(str))
GRADER2a.submit_answer(q, {'num_rows': len(ukc), 'details': submit})

Tick 2b - worth 1 mark

GRADER2b = GRADER.subsection('tick2b')

Question (q7). Let's start by getting a general view of the data on cases and deaths. We'll look at the UK, since it's more comprehensible to start with a single country which we're familiar with. With the dataset ukc,

  1. Define a column new_confirmed giving the number of newly confirmed cases each day, and similarly new_deaths.
  2. Define a column smooth_new_confirmed to be the average of new_confirmed over the past 7 days.

This dataset happens to have one record for every date, so newly confirmed cases are just the differences between successive items. It's good practice to first check that there is one record per date, with an assert statement. For the rolling average, pandas provides rolling.

Date Confirmed Deaths new_confirmed new_deaths smooth_new_confirmed
2020-01-30 0 0 0 0 0
2020-01-31 2 0 2 0 0.2857
2020-02-01 2 0 0 0 0.2857
ans = ... # your answer as a DataFrame
q = GRADER2b.fetch_question('q7')
dates = pandas.to_datetime(q.dates)
submit = ans.loc[ans.Date.isin(dates), q.cols].assign(Date=lambda df: df.Date.astype(str))
GRADER2b.submit_answer(q, submit)

Question (q8). Plot two bar charts, one showing the number of daily confirmed cases, the other showing the number of daily deaths. On your first bar chart, also plot the weekly-smoothed values. Also plot the percent of the population that has been double-vaccinated.

You don't have to submit your plot to the autograder, but you must include it in your submitted notebook. Your plot should look something like this. In your plot, you should pay attention to (1) making sure the subplots share the same x-axis, (2) making sure the x-axis ticks are legible. Don't worry about fine-grained control of the plot, such as plot size or colour scheme.

Question (q9). The UK has experienced three waves. In wave1 (dates before 2020-08-01) there was under-reporting of cases, and no vaccine. In wave2 (dates before 2021-05-01) there was full reporting, but still hardly any vaccination. In wave3, vaccination is more widespread. Let's see how this impacts case fatality rate.

Split the data into these three waves. For each wave, count the number of cases and deaths, and compute the case fatality rate (deaths divided by cases). Also report the vaccination rate as a fraction of total population, at the end of the wave.

wave Confirmed Deaths vaxrate cfr
wave1 303181 41189 0 0.1359
ans = ... # your answer as a DataFrame
GRADER2b.submit_answer(GRADER2b.fetch_question('q9'), ans)

Question (q10). We'd like to know the overall case fatality rate for each country. Prepare a dataframe that lists this information. Only use data from 2020-08-01 onwards, since case numbers before then are unreliable.

Case fatality rate gives us a hint about quality of healthcare, but it's a huge simplification because of differences in how countries define "death due to COVID" and in demographics. Nonetheless, it's a useful starting point.

Country/Region Confirmed Deaths cfr
France 7697355 87556 0.011375
United Kingdom 10357800 104946 0.010132
ans = ... # your answer as a DataFrame
q = GRADER2b.fetch_question('q10')
submit = ans.loc[ans['Country/Region'].isin(q.countries), q.cols]
GRADER2b.submit_answer(q, submit)

Question (q11). Plot the number of deaths and the number of confirmed cases for selected countries, using your answer from question (q10), on a log-log scale. Also plot guides to indicate case fatality rates of 0.5%, 1%, and 2%. Show these countries:

['Austria','Belgium','Czechia','Denmark','Finland','France','Germany',
 'Greece','Hungary','Ireland','Italy',
 'Netherlands','Norway','Poland','Portugal','Spain','Sweden','Switzerland','United Kingdom',
 'Australia','US','Canada','Japan','China','Russia','Ukraine','India']

You don't have to submit your plot to the autograder, but you must include it in your submitted notebook. Your plot should look something like this. Don't worry about fine-grained control of the plot, such as plot size or colour scheme or exact placement of labels.

Question (q12). For each country,

Your code must use pandas operations such as groupby to split and process the data. Do not use Python 'for' loops or list comprehensions.
Country/Region Date Confirmed new_confirmed smooth_new_confirmed infected inc R
United Kingdom 2021-10-01 7841625 34589 34305.4 333985.6 1.0114 1.114
United Kingdom 2021-10-02 7871014 29389 34254.4 337180.0 1.0096 1.096
ans = ... # Your answer as a DataFrame
q = GRADER2b.fetch_question('q12')
want = pandas.DataFrame(q.rows).assign(Date=lambda df: pandas.to_datetime(df.Date))
submit = ans.merge(want)[q.cols].assign(Date=lambda df: df.Date.astype(str))
GRADER2b.submit_answer(q, submit)

Question (q13). Plot $R$ against against the number of infected people per 100k, one point per day, one subplot per country, showing the UK, Germany, and the USA. Annotate your plot with axhline to show $R=1$, and with axhspan to show the range of $R$ values that are in between "infections double every two weeks" and "infections halve every two weeks", that is, between $\Delta=2^{1/14}$ and $\Delta=0.5^{1/14}$.

You don't have to submit your plot to the autograder, but you must include it in your submitted notebook. Your plot should look something like this. In your plot, you should pay attention to (1) making sure the subplots share the same x-axis, (2) using a log x-axis, (3) colour-coding the points to indicate what's recent versus what's historic. Annotating your plot with notable events is optional. Don't worry about fine-grained control of the plot, such as plot size or colour scheme.

Appendix

Fetching the data

Vaccination data is from coronavirus.data.gov.uk. Choose Area type: Nation, select the metric cumPeopleVaccinatedSecondDoseByVaccinationDate, and download a csv.

COVID data and country data is from a Johns Hopkins university github repository. For more about the dataset, please see An interactive web-based dashboard to track COVID-19 in real time30120-1) by Dong, Du, and Gardner, The Lancet, May 2020.

BASE_URL = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data'
confirmed = pandas.read_csv(f'{BASE_URL}/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')
deaths = pandas.read_csv(f'{BASE_URL}/csse_covid_19_time_series/time_series_covid19_deaths_global.csv')
recovered = pandas.read_csv(f'{BASE_URL}/csse_covid_19_time_series/time_series_covid19_recovered_global.csv')
countries = pandas.read_csv(f'{BASE_URL}/UID_ISO_FIPS_LookUp_Table.csv')

id_vars = ['Province/State','Country/Region','Lat','Long']
x1 = confirmed.melt(id_vars=id_vars, var_name='Date', value_name='Confirmed')
x2 = deaths.melt(id_vars=id_vars, var_name='Date', value_name='Deaths')
x3 = recovered.melt(id_vars=id_vars, var_name='Date', value_name='Recovered')
x = pandas.merge(x1,x2, how='outer', validate='1:1', on=id_vars + ['Date'])
x = pandas.merge(x, x3, how='outer', validate='1:1', on=id_vars + ['Date'])
for v in ['Confirmed','Deaths','Recovered']: x[v] = x[v].fillna(0).astype(int)
covid = x

covid = covid.merge(countries[['Country_Region','Province_State','Population','iso3']],
                    left_on = ['Country/Region','Province/State'],
                    right_on = ['Country_Region', 'Province_State'],
                    how = 'left')

covid.drop(columns=['Country_Region', 'Province_State'], inplace=True)

covid.to_csv('covid_global_latest.csv', index=False)