skip to primary navigationskip to content

Department of Computer Science and Technology

Part IB CST

 

Course pages 2022–23

Data Science

Principal lecturer: Dr Damon Wischik
Taken by: Part IB CST
Term: Michaelmas
Hours: 16 (16 lectures)
Format: In-person lectures
Suggested hours of supervisions: 4
Prerequisites: Mathematics for Natural Sciences
This course is a prerequisite for: Advanced Data Science, Computer Systems Modelling, Machine Learning and Bayesian Inference, Natural Language Processing, Quantum Computing
Exam: Paper 6 Question 5, 6, 5, 6
Past exam questions, Moodle, timetable

Aims

This course introduces fundamental tools for describing and reasoning about data. There are two themes: describing the behaviour of random systems; and making inferences based on data generated by such systems. The course will survey a wide range of models and tools, and it will emphasize how to design a model and what sorts of questions one might ask about it.

Lectures

  • Likelihood. Random variables. Random samples. Maximum likelihood estimation, likelihood profile.
  • Random variables. Rules for expectation and variance. Generating random variables. Empirical distribution. Monte Carlo estimation; law of large numbers. Central limit theorem.
  • Inference. Estimation, confidence intervals, hypothesis testing, prediction. Bootstrap. Bayesianism. Logistic regression, natural parameters.
  • Feature spaces. Vector spaces, bases, inner products, projection. Model fitting as projection. Linear modeling. Choice of features.
  • Random processes. Markov chains. Stationarity and convergence. Drift models. Examples, including estimation and memory.
  • Probabilistic modelling. Independence; joint distributions. Descriptive, discriminative, and causal models. Latent variable models. Random fields.

Objectives

At the end of the course students should

  • be able to formulate basic probabilistic models, including discrete time Markov chains and linear models
  • be familiar with common random variables and their uses, and with the use of empirical distributions rather than formulae
  • be able to use expectation and conditional expectation, limit theorems, equilibrium distributions
  • understand different types of inference about noisy data, including model fitting, hypothesis testing, and making predictions
  • understand the fundamental properties of inner product spaces and orthonormal systems, and their application to model representation

Recommended reading

* F.M. Dekking, C. Kraaikamp, H.P. Lopuhaä, L.E. Meester (2005). A modern introduction to probability and statistics: understanding why and how. Springer.

S.M. Ross (2002). Probability models for computer science. Harcourt / Academic Press.

M. Mitzenmacher and E. Upfal (2005). Probability and computing: randomized algorithms and probabilistic analysis. Cambridge University Press.