Data Science
Principal lecturer: Dr Damon Wischik
Taken by: Part IB CST
Term: Michaelmas
Hours: 16 (16 lectures)
Format: In-person lectures
Suggested hours of supervisions: 4
Prerequisites: Mathematics for Natural Sciences
This course is a prerequisite for: Advanced Data Science, Computer Systems Modelling, Machine Learning and Bayesian Inference, Natural Language Processing, Quantum Computing
Exam: Paper 6 Question 5, 6
Past exam questions, Moodle, timetable
Aims
This course introduces fundamental tools for describing and reasoning about data. There are two themes: designing probability models to describe systems; and drawing conclusions based on data generated by such systems.
Lectures
- Specifying and fitting probability models. Random variables. Maximum likelihood estimation. Generative and supervised models. Goodness of fit.
- Feature spaces. Vector spaces, bases, inner products, projection. Linear models. Model fitting as projection. Design of features.
- Handling probability models. Handling pdf and cdf. Bayes’s rule. Monte Carlo estimation. Empirical distribution.
- Inference. Bayesianism. Frequentist confidence intervals, hypothesis testing. Bootstrap resampling.
- Random processes. Markov chains. Stationarity, and drift analysis. Processes with memory. Learning a random process.
Objectives
At the end of the course students should
- be able to formulate basic probabilistic models, including discrete time Markov chains and linear models
- be familiar with common random variables and their uses, and with the use of empirical distributions rather than formulae
- understand different types of inference about noisy data, including model fitting, hypothesis testing, and making predictions
- understand the fundamental properties of inner product spaces and orthonormal systems, and their application to modelling
Recommended reading
* F.M. Dekking, C. Kraaikamp, H.P. Lopuhaä, L.E. Meester (2005). A modern introduction to probability and statistics: understanding why and how. Springer.
S.M. Ross (2002). Probability models for computer science. Harcourt / Academic Press.
M. Mitzenmacher and E. Upfal (2005). Probability and computing: randomized algorithms and probabilistic analysis. Cambridge University Press.