We'll be working with data on the COVID-19 pandemic assembled by researchers at Johns Hopkins university. (For more about the dataset, please see An interactive web-based dashboard to track COVID-19 in real time by Dong, Du, and Gardner, The Lancet, May 2020.) For consistency of grading, we'll work with a snapshot taken on 10 December and available as https://www.cl.cam.ac.uk/teaching/2122/DataSci/data/covid_global_20211210.csv. If you want to re-run your analyses on up-to-date data, the code to retrieve it is given in the appendix.
Some questions ask you to prepare a DataFrame, others a table (i.e. a Series). It doesn't matter what order the rows and columns are in, only the contents.
It's a good habit to tidy up your notebook as you go. While I'm developing a solution I have processing and printouts and answer submission interspersed, but then I tidy it all up and clearly separate out the pieces into:
Data import. Data is sacrosanct. It should be loaded in, then treated as immutable. I like to put all my data import routines right at the top, so that the reader can easily see the basis for my conclusions. Always cite your sources. My personal convention is to use ALLCAPS for raw imported data (the same convention as for global constants).
Data processing. This is a block of code that defines all the dataframes I'll want to reuse. I give these dataframes meaningful names, and document the names. When I want to come back to the notebook and do further work, I can just run this block of code and not be slowed down by printouts and autograder etc. This block of code could just as well exist in a standaline .py file.
Data analysis, and answer submission.
Here I have a chunk of code, one for each question. I only create throwaway dataframes here, never reusing them from question to question, so each chunk of code is independent of all the others. This makes it easier for me to dive in and work on one answer, without having to rerun the entire notebook.
Also, I use generic variable names like df
, so I'm not tempted to reuse the results. All I'll use is the named and documented processed data from section 2, and I won't modify them.
import ucamcl
GRADER = ucamcl.autograder('https://markmy.solutions', course='scicomp')
GRADER2a = GRADER.subsection('tick2a')
The first step in any data science work is to find out what the dataset consists of, and to preprocess it into a form that we can work with. Typically, this takes around 70% of a data scientist's time. Our overall goal in Tick 2a is to produce a dataframe covidc
(the c
stands for clean) with one row per country×date, and a second dataframe ukc
which integrates UK vaccination data.
Question (q1). Import the csv file and print out a few lines, choosing the lines at random. The file
includes some rows that don't correspond to countries; for these rows Population
is missing. Remove these rows, and store what's left as the data frame covid
. How many rows are left?
To submit your answer,
num_rows = ...
GRADER2a.submit_answer(GRADER2a.fetch_question('q1'), num_rows)
Question (q2). Complete this data frame listing the unique (Country/Region, Province/State)
pairs that you removed.
Country/Region | Province/State |
---|---|
Diamond Princess | NaN |
ans = ... # your answer as a DataFrame
GRADER2a.submit_answer(GRADER2a.fetch_question('q2'), ans)
Question (q3). The dataset has columns for Country/Region
as well as for Province/State
. To get a sense of what these contain, let's look at the UK. Complete this table listing the number of rows for each different Province/State
linked to the UK. Include the case where Province/State
is NaN.
Province/State | |
---|---|
Anguilla | 688 |
ans = ... # your answer as a table
GRADER2a.submit_answer(GRADER2a.fetch_question('q3'), ans.reset_index(name='n'))
Question (q4). Most countries our covid
dataframe appear as Country/Region=x, Province/State=NaN
, but there are a few, like the UK, that have non-NaN entries for Province/State
. Let's look at all of them. Complete the following dataframe, which lists all eight such countries. Here metropole
is 1 if the country has a Province/State=NaN
entry, 0 otherwise; and provinces
is the number of distinct non-NaN values.
Country/Region | metropole | provinces |
---|---|---|
Australia | 0 | 8 |
Canada | 1 | 13 |
ans = ... # your answer as a DataFrame
GRADER2a.submit_answer(GRADER2a.fetch_question('q4'), ans)
Question (q5). Upon closer inspection of these special-case countries, we conclude there are two types of special case:
Province/State=NaN
entries and discard the others.Prepare a dataset covidc
which fixes up the data for these special-case countries, and includes all the non-special-case countries as is. In addition,
Population
values and iso3
codes from
https://www.cl.cam.ac.uk/teaching/2122/DataSci/data/covidcountries_20210831.csv, since the Population values in the original dataset don't account for the special-cased countriesCountry/Region
then by Date
Country/Region | Date | Confirmed | Deaths | Population |
---|---|---|---|---|
Canada | 2021-09-05 | 1522599 | 27071 | 38246108 |
Denmark | 2021-07-28 | 314135 | 2547 | 5837213 |
covidc = ... # your answer as a DataFrame
q = GRADER2a.fetch_question('q5')
want_rows = pandas.DataFrame(q.rows).assign(Date=lambda df: pandas.to_datetime(df.Date))
submit = covidc.merge(want_rows)[q.cols].assign(Date=lambda df: df.Date.astype(str))
GRADER2a.submit_answer(q, {'num_rows': len(covidc), 'details': submit})
Question (q6). For the UK, we have an additional dataset: https://www.cl.cam.ac.uk/teaching/2122/DataSci/data/covidvax_uk_20211209.csv, downloaded from coronavirus.data.gov, with vaccination data. Create a dataframe consisting of the UK rows from covidc
, with an additional column totvax2
giving the cumulative number of people who have received their second vaccination dose.
Date | Confirmed | Deaths | Population | totvax2 |
---|---|---|---|---|
2020-12-10 | 1787783 | 63082 | 67886004 | 7199 |
2021-12-01 | 10276007 | 145140 | 67886004 | 42696015 |
ukc = ... # your answer as a DataFrame
q = GRADER2a.fetch_question('q6')
dates = pandas.to_datetime(q.dates)
submit = ukc.loc[ukc.Date.isin(dates), q.cols].assign(Date=lambda df: df.Date.astype(str))
GRADER2a.submit_answer(q, {'num_rows': len(ukc), 'details': submit})
GRADER2b = GRADER.subsection('tick2b')
Question (q7). Let's start by getting a general view of the data on cases and deaths. We'll look at the UK, since it's more comprehensible to start with a single country which we're familiar with.
With the dataset ukc
,
new_confirmed
giving the number of newly confirmed cases each day, and similarly new_deaths
.smooth_new_confirmed
to be the average of new_confirmed
over the past 7 days. This dataset happens to have one record for every date, so newly confirmed cases are just the differences between successive items.
It's good practice to first check that there is one record per date, with an assert
statement. For the rolling average,
pandas provides rolling
.
Date | Confirmed | Deaths | new_confirmed | new_deaths | smooth_new_confirmed |
---|---|---|---|---|---|
2020-01-30 | 0 | 0 | 0 | 0 | 0 |
2020-01-31 | 2 | 0 | 2 | 0 | 0.2857 |
2020-02-01 | 2 | 0 | 0 | 0 | 0.2857 |
ans = ... # your answer as a DataFrame
q = GRADER2b.fetch_question('q7')
dates = pandas.to_datetime(q.dates)
submit = ans.loc[ans.Date.isin(dates), q.cols].assign(Date=lambda df: df.Date.astype(str))
GRADER2b.submit_answer(q, submit)
Question (q8). Plot two bar charts, one showing the number of daily confirmed cases, the other showing the number of daily deaths. On your first bar chart, also plot the weekly-smoothed values. Also plot the percent of the population that has been double-vaccinated.
You don't have to submit your plot to the autograder, but you must include it in your submitted notebook. Your plot should look something like this. In your plot, you should pay attention to (1) making sure the subplots share the same x-axis, (2) making sure the x-axis ticks are legible. Don't worry about fine-grained control of the plot, such as plot size or colour scheme.
Question (q9). The UK has experienced three waves. In wave1
(dates before 2020-08-01) there was under-reporting of cases, and no vaccine. In wave2
(dates before 2021-05-01) there was full reporting, but still hardly any vaccination. In wave3
, vaccination is more widespread. Let's see how this impacts case fatality rate.
Split the data into these three waves. For each wave, count the number of cases and deaths, and compute the case fatality rate (deaths divided by cases). Also report the vaccination rate as a fraction of total population, at the end of the wave.
wave | Confirmed | Deaths | vaxrate | cfr |
---|---|---|---|---|
wave1 | 303181 | 41189 | 0 | 0.1359 |
ans = ... # your answer as a DataFrame
GRADER2b.submit_answer(GRADER2b.fetch_question('q9'), ans)
Question (q10). We'd like to know the overall case fatality rate for each country. Prepare a dataframe that lists this information. Only use data from 2020-08-01 onwards, since case numbers before then are unreliable.
Case fatality rate gives us a hint about quality of healthcare, but it's a huge simplification because of differences in how countries define "death due to COVID" and in demographics. Nonetheless, it's a useful starting point.
Country/Region | Confirmed | Deaths | cfr |
---|---|---|---|
France | 7697355 | 87556 | 0.011375 |
United Kingdom | 10357800 | 104946 | 0.010132 |
ans = ... # your answer as a DataFrame
q = GRADER2b.fetch_question('q10')
submit = ans.loc[ans['Country/Region'].isin(q.countries), q.cols]
GRADER2b.submit_answer(q, submit)
Question (q11). Plot the number of deaths and the number of confirmed cases for selected countries, using your answer from question (q10), on a log-log scale. Also plot guides to indicate case fatality rates of 0.5%, 1%, and 2%. Show these countries:
['Austria','Belgium','Czechia','Denmark','Finland','France','Germany',
'Greece','Hungary','Ireland','Italy',
'Netherlands','Norway','Poland','Portugal','Spain','Sweden','Switzerland','United Kingdom',
'Australia','US','Canada','Japan','China','Russia','Ukraine','India']
You don't have to submit your plot to the autograder, but you must include it in your submitted notebook. Your plot should look something like this. Don't worry about fine-grained control of the plot, such as plot size or colour scheme or exact placement of labels.
Question (q12). For each country,
new_confirmed
cases and smooth_new_confirmed
as in question (q7). infected
which gives the total number of smooth_new_confirmed
over the past $d$ days, where $d=10$ is the average duration of an infection — this is a crude guess at the number of currently infected people. In maths notation, let $I_t=S_{t-d+1}+\cdots+S_t.$inc
which measures the relative change. In maths notation, let $\Delta_t=I_t/I_{t-1}$.R
which reports the daily $R$ number, which can be estimated using $R_t=d(\Delta_t-1)+1$.Country/Region | Date | Confirmed | new_confirmed | smooth_new_confirmed | infected | inc | R |
---|---|---|---|---|---|---|---|
United Kingdom | 2021-10-01 | 7841625 | 34589 | 34305.4 | 333985.6 | 1.0114 | 1.114 |
United Kingdom | 2021-10-02 | 7871014 | 29389 | 34254.4 | 337180.0 | 1.0096 | 1.096 |
ans = ... # Your answer as a DataFrame
q = GRADER2b.fetch_question('q12')
want = pandas.DataFrame(q.rows).assign(Date=lambda df: pandas.to_datetime(df.Date))
submit = ans.merge(want)[q.cols].assign(Date=lambda df: df.Date.astype(str))
GRADER2b.submit_answer(q, submit)
Question (q13). Plot $R$ against against the number of infected people per 100k, one point per day, one subplot per country, showing the UK, Germany, and the USA. Annotate your plot with axhline to show $R=1$, and with axhspan to show the range of $R$ values that are in between "infections double every two weeks" and "infections halve every two weeks", that is, between $\Delta=2^{1/14}$ and $\Delta=0.5^{1/14}$.
You don't have to submit your plot to the autograder, but you must include it in your submitted notebook. Your plot should look something like this. In your plot, you should pay attention to (1) making sure the subplots share the same x-axis, (2) using a log x-axis, (3) colour-coding the points to indicate what's recent versus what's historic. Annotating your plot with notable events is optional. Don't worry about fine-grained control of the plot, such as plot size or colour scheme.
Vaccination data is from coronavirus.data.gov.uk. Choose Area type: Nation, select the metric cumPeopleVaccinatedSecondDoseByVaccinationDate, and download a csv.
COVID data and country data is from a Johns Hopkins university github repository. For more about the dataset, please see An interactive web-based dashboard to track COVID-19 in real time30120-1) by Dong, Du, and Gardner, The Lancet, May 2020.
BASE_URL = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data'
confirmed = pandas.read_csv(f'{BASE_URL}/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')
deaths = pandas.read_csv(f'{BASE_URL}/csse_covid_19_time_series/time_series_covid19_deaths_global.csv')
recovered = pandas.read_csv(f'{BASE_URL}/csse_covid_19_time_series/time_series_covid19_recovered_global.csv')
countries = pandas.read_csv(f'{BASE_URL}/UID_ISO_FIPS_LookUp_Table.csv')
id_vars = ['Province/State','Country/Region','Lat','Long']
x1 = confirmed.melt(id_vars=id_vars, var_name='Date', value_name='Confirmed')
x2 = deaths.melt(id_vars=id_vars, var_name='Date', value_name='Deaths')
x3 = recovered.melt(id_vars=id_vars, var_name='Date', value_name='Recovered')
x = pandas.merge(x1,x2, how='outer', validate='1:1', on=id_vars + ['Date'])
x = pandas.merge(x, x3, how='outer', validate='1:1', on=id_vars + ['Date'])
for v in ['Confirmed','Deaths','Recovered']: x[v] = x[v].fillna(0).astype(int)
covid = x
covid = covid.merge(countries[['Country_Region','Province_State','Population','iso3']],
left_on = ['Country/Region','Province/State'],
right_on = ['Country_Region', 'Province_State'],
how = 'left')
covid.drop(columns=['Country_Region', 'Province_State'], inplace=True)
covid.to_csv('covid_global_latest.csv', index=False)