Computer Laboratory

Course pages 2017–18

Foundations of Data Science

Principal lecturer: Dr Damon Wischik
Taken by: Part IB CST 50%, Part IB CST 75%

No. of lectures and practical classes: 12 + 4
Suggested hours of supervisions: 3
Prerequisite courses: either Mathematics for Natural Sciences, or the equivalent from the Maths Tripos
This course is a prerequisite: for Part IB Formal Models of Language, and for Part II Machine Learning and Bayesian Inference, Bioinformatics, Computer Systems Modelling, Information Theory, Quantum Computing, Natural Language Processing, Advanced Graphics.

Aims

This course introduces fundamental tools for describing and reasoning about data. There are two themes: describing the behaviour of random systems; and making inferences based on data generated by such systems. The course will survey a wide range of models and tools, and it will emphasize how to design a model and what sorts of questions one might ask about it.

Lectures

  • Probabilistic models. Examples: random sample, graphical models, Markov models. Common random variables and their uses. Joint distributions, independence. Rules for expectation and variance.

  • Distributions of random variables. Generating random variables. Empirical distribution. Comparing distributions. Law of large numbers, central limit theorem.

  • Inference. Maximum likelihood estimation, likelihood profile. Bootstrap, hypothesis testing, confidence intervals for parameters and predictions. Bayesianism, point estimation, classification. Case study: training a naive Bayes classifier.

  • Feature spaces. Vector spaces, bases, inner products, projection. Model fitting as projection; linear modeling. Orthogonalisation, and application to linear models. Dimension reduction.

  • Random processes. Drift models. Markov chain convergence: notions, and calculations. Examples. Notions of estimation and control. Examples of processes with memory.

Objectives

At the end of the course students should

  • be able to formulate basic probabilistic models, including discrete time Markov chains, graphical models, and linear models

  • be familiar with common random variables and their uses, and with the use of empirical distributions rather than formulae

  • be able to use expectation and conditional expectation, limit theorems, equilibrium distributions

  • understand different types of inference about noisy data, including model fitting, hypothesis testing, and making predictions

  • understand the fundamental properties of inner product spaces and orthonormal systems, and their application to model representation

Recommended reading

* F.M. Dekking, C. Kraaikamp, H.P. Lopuhaä, L.E. Meester (2005). A modern introduction to probability and statistics: understanding why and how. Springer.

S.M. Ross (2002). Probability models for computer science. Harcourt / Academic Press.

M. Mitzenmacher & E. Upfal (2005). Probability and computing: randomized algorithms and probabilistic analysis. Cambridge University Press.