Course pages 2017–18

Foundations of Data Science

Lecture notes and handouts

Notes for §0 and §1 [pdf] as handed out, including errors
Notes for §2 [pdf] as handed out, including errors
Notes for §3 [pdf] as handed out
Notes for §4 [pdf] as handed out, including errors
Notes for §5 [pdf] as handed out, including errors
Full corrected notes [pdf]

Example sheets

Example sheet 0 [pdf] (not intended for supervision). Solutions [pdf]
Example sheet 1 [pdf]
Example sheet 2 [pdf] as handed out, with an error in question 5(b)
Walkthrough of q1(a): Bayesian posterior calculation [pdf]
Example sheet 3a [pdf] as handed out, with errors in question 9
Example sheet 3b [pdf]
Revision example sheet 4 [pdf] and solutions [pdf] (with typos fixed in 3(c) and 5(d))

Errata

Here are errors in the notes discovered so far. See the corrected notes [pdf] for what the replacement should be. There are also minor typos, not listed here.

Section 1.2 page 11. The solution for π_x is only valid for p<1/2 (Thanks to A Student)
Section 1.5 page 19. "Binomial … takes values in {0,1,…}" (Thanks to RJG)
Section 1.5 page 19. The formula for P(X=r) for a binomial random variable is wrong.
Section 1.6 page 22. "That doesn't mean that X and Y are independent" (Thanks to RJG)
Section 2.1 page 26. The formula for std.dev(aX+b) is wrong (Thanks to RJG)
Section 2.3 page 30. The formula with tan^-1(y/x) is wrong (thanks to M. Bull)
Section 2.3 page 31. The formula with "→1_x≥μ" should read "→1_x≤μ" (Thanks to RJG)
Section 2.3 page 33. A (y₁-y₀) has gone missing at the bottom of the page.
Section 4.3.1 page 51. The corrected notes have a more useful explicit definition of stationary distribution
Section 4.3.4 page 60. The definition says "A state x is said to be periodic" but it should read "A state x is said to be aperiodic". (Thanks to RJG)
Section 5.1.3 page 68. Not every function f can be written as claimed. (Thanks to RJG)
Example sheet 2 question 5(b). Should say 1000 simulated values of Y, not 10000. (Thanks to RJG)
Example sheet 3a question 9. The code for the random web surfer has an error. Line 5 should read if len(neighbours)>0 and random.random()≤d. Also, the formula only holds when all nodes have at least one neighbour.

Supervisions and the exam

Do I have to learn Python?
No. You can give your answers in any language, even pseudocode (unless your supervisor instructs you otherwise). But if you want internships etc. in data science and machine learning, I suggest you do learn Python and numpy, in your own time, and the snippets I show in lectures may be useful.
Do I have to answer all the questions on the example sheet?
No. You should spend the time indicated on the sheet. Save the rest of the example sheet for revision. (Most of the later questions on each sheet are short refresher questions.)
What will exam questions look like?
They will mostly look like the long questions on the example sheets, except that
- I will give you formulae for standard distributions (since you won't have Wikipedia access in the exam)
- For complicated derivations, I will usually give you the answer and ask you to prove it. I will split questions into linked sub-pieces.
- I might ask you for pseudocode
- I might show you output plots or tables, and ask you to evaluate them.
Can I trust Google?
Search the web for formulae and function definitions, but only when you know exactly what you're looking for. Do not rely on search engines to learn about methods and techniques and approaches; you will almost certainly be misled.

What topics will come up in the exam? You should understand these 'set pieces' and be able to apply them to new scenarios:

maximum likelihood estimation
calculating the expectation and variance of sums of random variables
applying the Normal approximation
Bayesian posterior calculation
Bootstrap resampling for confidence intervals
Stationary distribution of a Markov chain

You should also have picked up enough examples during the course to

be comfortable doing a variety of probability calculations
—with independent random variables, or dependent random variables, or conditional probabilities
be able to invent a probabilistic model
—choose parameters to express an idea, reason about causal models and descriptive models

Slides from lectures, including handwritten

— on Moodle

Timetable and announcements

— on Moodle

Code snippets

— at notebooks.azure.com

Q&A forum

— at allanswered.com (please signup with a @cam.ac.uk email address)

Department of Computer Science and Technology