from IPython.display import YouTubeVideo
YouTubeVideo('RD_7YhsQ8hg', width=560, height=315)
We'll be working with data on the COVID-19 pandemic assembled by researchers at Johns Hopkins university. (For more about the dataset, please see An interactive web-based dashboard to track COVID-19 in real time30120-1) by Dong, Du, and Gardner, The Lancet, May 2020.) For consistency of grading, we'll work with a snapshot taken on 13 December and available as https://www.cl.cam.ac.uk/teaching/2021/DataSci/data/covid_global_20201213.csv. If you want to re-run your analyses on up-to-date data, the code to retrieve it is given in the appendix.
These are optional warmup exercises, to get you used to pandas. Use the following autograder settings:
import ucamcl
GRADER = ucamcl.autograder('https://markmy.solutions', course='scicomp').subsection('ex2')
For these exercises, there is no answer for you to submit. Instead, the autograder shows a model answer. Use it like this:
q = GRADER.fetch_question('p1') # these warmup exercises are labelled p1, p2, ...
print(q)
Question (p1) from section 4.1 and 5.1. Here are two ways to explore the contents of a column:
stopsearch['outcome'].unique() # gets a list of unique values
stopsearch.groupby('outcome').apply(len) # gets counts for each value
The first method returns a list of length 17, and the second returns a Series of length 16. Why are they different lengths? How can you get the counts for all 17 values? How can you sort the list of counts, largest to smallest?
Question (p2) from sections 5.1 and 5.2. We can find the most common age-range in the entire dataset with
stopsearch['age_range'].mode().values[0]
(Note that mode()
returns a Series object, and .values[0]
gives us just the first item in the Series as a plain Python value.)
How can we produce a table listing the most common age-range for each ethnicity and gender?
Question (p3) from sections 5.1 and 5.2. Given the dataframe
df = pandas.DataFrame({'A': [0,0,0,1,1,1], 'B': [0,1,2,0,1,2], 'X': range(6)})
how do you produce a table that has rows for A, columns for B, and shows values of X?
Question (p4). There are three different levels in the stopsearch['type']
column. Here are two different ways to extract three rows, one for each of the levels:
stopsearch.drop_duplicates('type')
stopsearch.groupby('type').tail(1)
Look up the help for these two functions, and explain the difference.
Question (p5). The columns in a pandas DataFrame are Series objects, and they have all sorts of useful methods. Using the methods isin
and diff
,
take the subset of rows where event_type
is 0 or 1, and then define a new column Δt containing the time since the preceding event.
df = pandas.DataFrame({'time': [1,2,3,4,5,6,7,8], 'event_type': [0,0,2,1,0,1,2,1]})
Question (p6) from section 6.2. With the dataframe from the last question, add a column Δte that gives within each event_type the time since the last event of that type. Finally, sort the data by event_type and time, using sort_values
.
jupyter
The first step in any data science work is to find out what the dataset consists of, and to preprocess it into a form that you can work with. Typically, this takes around 70% of a data scientist's time.
GRADER = ucamcl.autograder('https://markmy.solutions', course='scicomp').subsection('tick2a')
Question (q1). Import the CSV file and print out a few lines, choosing the lines at random. The file
includes some rows that don't correspond to countries; for these rows Population
is missing. Remove these rows, and store what's left as the data frame covid
. How many rows are left?
num_rows = ...
GRADER.submit_answer(GRADER.fetch_question('q1'), num_rows)
Question (q2). Complete this data frame listing the unique (Country/Region, Province/State)
pairs that you removed.
Country/Region | Province/State |
---|---|
Diamond Princess | NaN |
ans = ... # your answer as a DataFrame
GRADER.submit_answer(GRADER.fetch_question('q2'), ans)
Question (q3). Complete this table listing the number of rows for each different Province/State
linked to the UK. Include the case where Province/State
is NaN.
Province/State | |
---|---|
Anguilla | 326 |
ans = ... # your answer as an Indexed Array
GRADER.submit_answer(GRADER.fetch_question('q3'), ans.reset_index(name='n'))
Question (q4). Take the subset of the table that refers to the UK and for which Province/State
is NaN. (This subset excludes the UK's dependent territories).
pandas.to_datetime
new_confirmed
giving the number of newly confirmed cases each day, and similarly new_deaths
.smooth_new_confirmed
to be the average of new_confirmed
over the past seven days. This dataset happens to have one record for every country and every date, and so it's easy to find the number of newly confirmed cases by finding the difference between successive elements. In numpy we could do np.insert(x[1:]-x[:-1],0,x[0])
. Pandas provides
diff
which does the same thing except for the first element -- but sometimes simplicity is more valuable than correctness of corner cases, and a diff
-based answer is fine.
For the rolling average, Pandas provides rolling
.
Date | Confirmed | Deaths | new_confirmed | new_deaths | smooth_new_confirmed |
---|---|---|---|---|---|
2020-01-30 | 0 | 0 | 0 | 0 | 0 |
2020-01-31 | 2 | 0 | 2 | 0 | 0.286 |
ans = ... # your answer as a DataFrame
q = GRADER.fetch_question('q4')
want = pandas.DataFrame(q.rows).assign(Date=lambda df: pandas.to_datetime(df.Date))
submit = ans.merge(want)[q.want_cols].assign(Date=lambda df: df.Date.astype(str))
GRADER.submit_answer(q, submit)
Question (q5). Plot two bar charts, one showing the number of daily confirmed cases, the other showing the number of daily deaths. On your first bar chart, also plot the weekly-smoothed values.
You don't have to submit your plot to the autograder, but you must include it in your submitted notebook. Your plot should look something like this. In your plot, you should pay attention to (1) making sure the subplots share the same x-axis, (2) making sure the x-axis ticks are legible. Don't worry about fine-grained control of the plot, such as plot size or colour scheme.
Question (q6). Most countries in this dataset appear as Country/Region=x, Province/State=NaN
, but there are a few, like the UK, that have non-NaN entries for Province/State
. To see what the dataset contains, complete the following table. Here metropole
is 1 if the country has a Province/State=NaN
entry, 0 otherwise; and provinces
is the number of distinct non-NaN values; and the table lists all the seven countries for which metropole=0
or provinces>0
.
Country/Region | metropole | provinces |
---|---|---|
Australia | 0 | 8 |
Canada | 1 | 13 |
ans = ... # your answer as a DataFrame
GRADER.submit_answer(GRADER.fetch_question('q6'), ans)
Question (q7). Upon closer inspection of these special-case countries, we conclude there are two types of special case:
Province/State=NaN
entries and discard the others.Prepare a dataset covidc
which fixes up the data for these special-case countries, and includes all the non-special-case countries as is. In addition,
Population
values and iso3
codes from
https://www.cl.cam.ac.uk/teaching/2021/DataSci/data/covid_countries_20201213.csv, since the Population values in the original dataset don't account for the special-cased countriesCountry/Region
then by Date
Country/Region | Date | Confirmed | Deaths | Population |
---|---|---|---|---|
United Kingdom | 2020-05-01 | 182270 | 27381 | 67886004 |
United Kingdom | 2020-05-02 | 185491 | 27965 | 67886004 |
covidc = ... # your answer as a DataFrame
q = GRADER.fetch_question('q7')
want = pandas.DataFrame(q.rows).assign(Date=lambda df: pandas.to_datetime(df.Date))
submit = covidc.merge(want)[q.want_cols].assign(Date=lambda df: df.Date.astype(str))
GRADER.submit_answer(q, {'num_rows': len(covidc), 'details': submit})
GRADER = ucamcl.autograder('https://markmy.solutions', course='scicomp').subsection('tick2b')
Question (q8). We'd like to know the overall case fatality rate for each country. To get overall figures, we'll just look at the total confirmed cases and deaths for each country. Prepare a dataframe that lists this information.
Country/Region | Confirmed | Deaths |
---|---|---|
Australia | 28030 | 908 |
Austria | 319822 | 4415 |
ans = ... # your answer as a DataFrame
q = GRADER.fetch_question('q8')
submit = pandas.DataFrame(q.rows).merge(ans)[q.want_cols]
GRADER.submit_answer(q, submit)
Question (q9). The bar chart from question q5 suggests that cases were undercounted in the first wave of the pandemic. Take the first wave to be until 14 July inclusive, and the second wave to be 15 July to the most recent date, and complete the following table which lists the number of cases confirmed in each wave, and the number of deaths.
Country/Region | Confirmed1 | Deaths1 | Confirmed2 | Deaths2 |
---|---|---|---|---|
France | 199509 | 29929 | 2149286 | 27281 |
United Kingdom | 291373 | 40899 | 1539583 | 23127 |
We'd like to be able to compare the spread of the disease in December versus March, and so we'd like case counts from the first and second waves to be comparable. We shall assume that true fatality rate (the fraction of infected people who die from the disease) is constant, and that the case counts from the first wave are undercounting. Add to your table a new column called case_multiplier
, defined as
$$
\textsf{case_multiplier} = \frac{\textsf{Deaths1}/\textsf{Confirmed1}}{\textsf{Deaths2}/\textsf{Confirmed2}}
$$
ans = ... # your table as a DataFrame
q = GRADER.fetch_question('q9')
submit = pandas.DataFrame(q.rows).merge(ans)[q.want_cols]
GRADER.submit_answer(q, submit)
This is a huge simplification. There are differences in how countries define "death due to COVID", and there
are surely differences in standard-of-care, and in demographics. Nonetheless, the plot below shows that many countries have broadly similar case fataility rates in the second wave, so case_multiplier
is probably measuring something real; and anyway, if we only use it to compare the two waves within a single country, it doesn't matter that different countries use different definitions. It would be better to use a smoothly-varying function rather than just two levels (a multipler for the 1st, a constant 1 for the 2nd wave), but this takes more work.
Question (q10). For each country,
new_confirmed
cases each day as in (q4), and multiply the first-wave counts by the corresponding case_multiplier
. In maths notation, let $N_t=\alpha_t (C_t-C_{t-1})$, where $C_t$ is the number of confirmed cases recorded in the dataset for date $t$, where where $\alpha_t$ is the case multiplier for $t$ up to 14 July inclusive, and $\alpha_t=1$ for later $t$.smooth_new_confirmed
value as in (q4). In maths notation, let $S_t=(N_{t-6}+\cdots+N_t)/7$.infected
which gives the total number of smooth_new_confirmed
over the past $d$ days, where $d=10$ is the average duration of an infection -- this is a crude guess at the number of currently infected people. In maths notation, let $I_t=S_{t-d+1}+\cdots+S_t.$inc
which measures the relative change. In maths notation, let $\Delta_t=I_t/I_{t-1}$.R
which reports the daily $R$ number, which can be calculated by $R_t=d(\Delta_t-1)+1$.Country/Region | Date | Confirmed | new_confirmed | smooth_new_confirmed | infected | inc | R |
---|---|---|---|---|---|---|---|
United Kingdom | 2020-05-01 | 182270 | 44170.5 | 42392.5 | 436857.1 | 0.995 | 0.947 |
United Kingdom | 2020-05-02 | 185491 | 30098.0 | 41689.0 | 433240.9 | 0.992 | 0.917 |
ans = ... # Your answer as a DataFrame
q = GRADER.fetch_question('q10')
want = pandas.DataFrame(q.rows).assign(Date=lambda df: pandas.to_datetime(df.Date))
submit = ans.merge(want)[q.want_cols].assign(Date=lambda df: df.Date.astype(str))
GRADER.submit_answer(q, submit)
Question (q11). Plot $R$ against against the number of infected people per 100k, one point per day, one subplot per country, showing the UK, Italy, Germany, and the USA. Annotate your plot with axhline to show $R=1$, and with axhspan to show the range of $R$ values that are in between "infections double every two weeks" and "infections halve every two weeks", that is, between $\Delta=2^{1/14}$ and $\Delta=0.5^{1/14}$.
You don't have to submit your plot to the autograder, but you must include it in your submitted notebook. Your plot should look something like this. In your plot, you should pay attention to (1) making sure the subplots share the same x-axis, (2) using a log x-axis, (3) colour-coding the points to indicate what's recent versus what's historic. Annotating your plot with notable events is optional. Don't worry about fine-grained control of the plot, such as plot size or colour scheme.
More granular data for the UK is available at https://coronavirus.data.gov.uk/details/download. Repeat this exercise, but applied to your own region. The data includes details of number of tests, saving us the case_multiplier
hack.
This data is from a Johns Hopkins university github repository. For more about the dataset, please see An interactive web-based dashboard to track COVID-19 in real time30120-1) by Dong, Du, and Gardner, The Lancet, May 2020.
BASE_URL = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data'
confirmed = pandas.read_csv(f'{BASE_URL}/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')
deaths = pandas.read_csv(f'{BASE_URL}/csse_covid_19_time_series/time_series_covid19_deaths_global.csv')
recovered = pandas.read_csv(f'{BASE_URL}/csse_covid_19_time_series/time_series_covid19_recovered_global.csv')
countries = pandas.read_csv(f'{BASE_URL}/UID_ISO_FIPS_LookUp_Table.csv')
id_vars = ['Province/State','Country/Region','Lat','Long']
x1 = confirmed.melt(id_vars=id_vars, var_name='Date', value_name='Confirmed')
x2 = deaths.melt(id_vars=id_vars, var_name='Date', value_name='Deaths')
x3 = recovered.melt(id_vars=id_vars, var_name='Date', value_name='Recovered')
x = pandas.merge(x1,x2, how='outer', validate='1:1', on=id_vars + ['Date'])
x = pandas.merge(x, x3, how='outer', validate='1:1', on=id_vars + ['Date'])
for v in ['Confirmed','Deaths','Recovered']:
x[v] = np.where(pandas.isna(x[v]), 0, x[v]).astype(int)
covid = x
covid = covid.merge(countries[['Country_Region','Province_State','Population','iso3']],
left_on = ['Country/Region','Province/State'],
right_on = ['Country_Region', 'Province_State'],
how = 'left')
covid.drop(columns=['Country_Region', 'Province_State'], inplace=True)
covid.to_csv('covid_global_latest.csv', index=False)
# Question 7
want_countries = ['Austria','Belgium','Czechia','Denmark','Finland','France','Germany',
'Greece','Hungary','Ireland','Italy',
'Netherlands','Norway','Poland','Portugal','Spain','Sweden','Switzerland','United Kingdom',
'Australia','US','Canada','Japan','China','Russia','Ukraine','India']
fig,ax = plt.subplots(figsize=(8,4))
for c in want_countries:
df = covidc.loc[covidc['Country/Region']==c]
x,y = df.Confirmed / df.Population*100000, df.Deaths / df.Population*100000
ax.plot(x, y, label=c)
# note: ax.text can only plot one text label at a time
ax.text(x.iloc[-1], y.iloc[-1], c, ha='left', va='baseline')
ax.set_xlabel('confirmed/100k')
ax.set_ylabel('deaths/100k')
plt.show()
# Question 8
df = ... # dataframe for Question 8
df = df.loc[df['Country/Region'].isin(want_countries)]
fig,ax = plt.subplots(figsize=(8,6))
ax.scatter(df.Confirmed, df.Deaths)
for i in range(len(df)):
ax.text(df.Confirmed.iloc[i]*1.055, df.Deaths.iloc[i], df['Country/Region'].iloc[i], ha='left', va='center',
fontsize=7)
xx = np.power(10, np.linspace(4.5,7.4,100))
for cfr in [.01,.02,.04]:
ax.plot(xx, xx*cfr, color='0.6', linestyle='dashed')
ax.text(xx[-1], xx[-1]*cfr, f'cfr {cfr*100}%', ha='left')
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('confirmed cases')
ax.set_ylabel('deaths')
plt.show()
# Question 9
df = ... # dataframe for Question 9
df = df.loc[df['Country/Region'].isin(want_countries)]
fig,ax = plt.subplots(figsize=(8,6))
ax.scatter(df.Confirmed1, df.Deaths1, color='cornflowerblue')
ax.scatter(df.Confirmed2, df.Deaths2, color='goldenrod')
for i in range(len(df)):
x1,y1 = df.Confirmed1.iloc[i], df.Deaths1.iloc[i]
x2,y2 = df.Confirmed2.iloc[i], df.Deaths2.iloc[i]
ax.plot([x1,x2],[y1,y2], color='black', alpha=.4, linewidth=.5, zorder=-1)
# the nudge *1.055 is just to offset the text label from the marker
ax.text(x2*1.055,y2, df['Country/Region'].iloc[i], ha='left', va='center', fontsize=7)
xx = np.power(10, np.linspace(4.5,7.4,100))
for cfr in [.01,.02,.04]:
ax.plot(xx, xx*cfr, color='0.6', linestyle='dashed')
ax.text(xx[-1], xx[-1]*cfr, f'cfr {cfr*100}%', ha='left')
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('confirmed cases')
ax.set_ylabel('deaths')
plt.show()
# Question 12
# Data for special dates
import io
specials = pandas.read_csv(io.StringIO('''
Country/Region,Date,event
"United Kingdom",2020-03-23,"lockdown"
,2020-07-04,"open"
,2020-07-24,"facemasks"
,2020-09-14,"unsocial"
,2020-09-17,"local"
,2020-10-03,"local"
,2020-10-14,"local"
,2020-10-23,"local"
,2020-11-05,"lockdown2"
,2020-12-02,"open2"
'''))
specials['Date'] = pandas.to_datetime(specials['Date'])
specials['Country/Region'] = specials['Country/Region'].fillna(method='ffill')
# To show these specials as annotations:
# merge the specials dataframe with your full per-country-and-date dataframe,
# and then just plot the points where event isn't NaN.