Course pages 2018–19
Foundations of Data Science
Principal lecturer: Dr Damon Wischik
Taken by: Part IB CST 50%, Part IB CST 75%
Past exam questions
No. of lectures and practical classes: 12 + 4
Suggested hours of supervisions: 3
Prerequisite courses:
either Mathematics for Natural Sciences, or the equivalent from the
Maths Tripos
This course is a prerequisite for:
Part II Machine Learning and Bayesian Inference, Information Retrieval,
Quantum Computing, Natural Language Processing.
Aims
This course introduces fundamental tools for describing and reasoning about data. There are two themes: describing the behaviour of random systems; and making inferences based on data generated by such systems. The course will survey a wide range of models and tools, and it will emphasize how to design a model and what sorts of questions one might ask about it.
Lectures
- Likelihood.
Random variables. Random samples.
Maximum likelihood estimation, likelihood profile.
- Random variables.
Rules for expectation and variance.
Generating random variables. Empirical distribution.
Monte Carlo estimation; law of large numbers. Central limit theorem.
- Inference.
Estimation, confidence intervals, hypothesis testing, prediction.
Bootstrap. Bayesianism. Logistic regression, natural parameters.
- Feature spaces.
Vector spaces, bases, inner products, projection.
Model fitting as projection. Linear modeling.
Choice of features.
- Random processes.
Markov chains.
Stationarity and convergence.
Drift models.
Examples, including estimation and memory.
- Probabilistic modelling.
Independence; joint distributions.
Descriptive, discriminative, and causal models. Latent variable models. Random fields.
Objectives
At the end of the course students should
- be able to formulate basic probabilistic models, including discrete
time Markov chains and linear models
- be familiar with common random variables and their uses, and with the
use of empirical distributions rather than formulae
- be able to use expectation and conditional expectation,
limit theorems, equilibrium distributions
- understand different types of inference about noisy data, including
model fitting, hypothesis testing, and making predictions
- understand the fundamental properties of inner product spaces and orthonormal systems, and their application to model representation
Recommended reading
* F.M. Dekking, C. Kraaikamp, H.P. Lopuhaä, L.E. Meester (2005). A modern introduction to probability and statistics: understanding why and how. Springer.
S.M. Ross (2002). Probability models for computer science. Harcourt / Academic Press.
M. Mitzenmacher & E. Upfal (2005). Probability and computing: randomized algorithms and probabilistic analysis. Cambridge University Press.