# Department of Computer Science and Technology

Course pages 2017–18

# Foundations of Data Science

### Lecture notes and handouts

• Notes for §0 and §1 [pdf] as handed out, including errors
• Notes for §2 [pdf] as handed out, including errors
• Notes for §3 [pdf] as handed out
• Notes for §4 [pdf] as handed out, including errors
• Notes for §5 [pdf] as handed out, including errors
• Full corrected notes [pdf]

### Errata

Here are errors in the notes discovered so far. See the corrected notes [pdf] for what the replacement should be. There are also minor typos, not listed here.

• Section 1.2 page 11. The solution for πx is only valid for p<1/2 (Thanks to A Student)
• Section 1.5 page 19. "Binomial … takes values in {0,1,…}" (Thanks to RJG)
• Section 1.5 page 19. The formula for P(X=r) for a binomial random variable is wrong.
• Section 1.6 page 22. "That doesn't mean that X and Y are independent" (Thanks to RJG)
• Section 2.1 page 26. The formula for std.dev(aX+b) is wrong (Thanks to RJG)
• Section 2.3 page 30. The formula with tan-1(y/x) is wrong (thanks to M. Bull)
• Section 2.3 page 31. The formula with "→1x≥μ" should read "→1x≤μ" (Thanks to RJG)
• Section 2.3 page 33. A (y1-y0) has gone missing at the bottom of the page.
• Section 4.3.1 page 51. The corrected notes have a more useful explicit definition of stationary distribution
• Section 4.3.4 page 60. The definition says "A state x is said to be periodic" but it should read "A state x is said to be aperiodic". (Thanks to RJG)
• Section 5.1.3 page 68. Not every function f can be written as claimed. (Thanks to RJG)
• Example sheet 2 question 5(b). Should say 1000 simulated values of Y, not 10000. (Thanks to RJG)
• Example sheet 3a question 9. The code for the random web surfer has an error. Line 5 should read `if len(neighbours)>0 and random.random()≤d`. Also, the formula only holds when all nodes have at least one neighbour.

### Supervisions and the exam

• Do I have to learn Python?
No. You can give your answers in any language, even pseudocode (unless your supervisor instructs you otherwise). But if you want internships etc. in data science and machine learning, I suggest you do learn Python and numpy, in your own time, and the snippets I show in lectures may be useful.
• Do I have to answer all the questions on the example sheet?
No. You should spend the time indicated on the sheet. Save the rest of the example sheet for revision. (Most of the later questions on each sheet are short refresher questions.)
• What will exam questions look like?
They will mostly look like the long questions on the example sheets, except that
• I will give you formulae for standard distributions (since you won't have Wikipedia access in the exam)
• For complicated derivations, I will usually give you the answer and ask you to prove it. I will split questions into linked sub-pieces.
• I might ask you for pseudocode
• I might show you output plots or tables, and ask you to evaluate them.
Search the web for formulae and function definitions, but only when you know exactly what you're looking for. Do not rely on search engines to learn about methods and techniques and approaches; you will almost certainly be misled.

What topics will come up in the exam? You should understand these 'set pieces' and be able to apply them to new scenarios:

• maximum likelihood estimation
• calculating the expectation and variance of sums of random variables
• applying the Normal approximation
• Bayesian posterior calculation
• Bootstrap resampling for confidence intervals
• Stationary distribution of a Markov chain

You should also have picked up enough examples during the course to

• be comfortable doing a variety of probability calculations
—with independent random variables, or dependent random variables, or conditional probabilities
• be able to invent a probabilistic model
—choose parameters to express an idea, reason about causal models and descriptive models

— on Moodle

— on Moodle

### Code snippets

— at notebooks.azure.com