Computer Laboratory

Course pages 2015–16

Principles of Data Science

Principal lecturer: Dr Richard Gibbens
Taken by: MPhil ACS, Part III
Code: L120
Hours: 16
Prerequisites: Undergraduate level mathematical knowledge of linear algebra, calculus, optimization, probability and statistics together with some experience with at least one language or package to handle data analysis.

Aims

This module will introduce students to the principles of Data Science that underpin key tools and techniques used both to describe and to gain insights into the properties of often large and complex datasets. The approach taken in the module will combine the development of mathematical theory with case studies taken from real-world application domains such as communications networks and road transport networks. The case studies will also highlight the use of modern software packages including R for both statistical computation as well as the graphical visualisation of statistical properties and results.

Syllabus

  • Statistical Learning (4 lectures)
    • Introduction
    • Estimation, models and prediction accuracy
    • Notions of supervised and unsupervised learning
  • Linear Regression (3 lectures)
    • Simple linear regression
    • Mulitple linear regression
    • Case studies
  • Classification (3 lectures)
    • Logistic regression
    • Linear discriminant analysis
    • Applications
  • Resampling Methods (3 lectures)
    • Cross-validation
    • The bootstrap method
    • Applications
  • Linear Model Selection and Regularization (3 lectures)
    • Approaches to subset selection
    • Shrinkage methods
    • Methods for dimension reduction
    • Applications

Objectives

On completion of this module, students should:

  • understand the ideas of statistical approaches to learning
  • understand how to approach answering statistical questions involving large and complex data sets
  • appreciate the range of basic techniques available to Data Scientists
  • be familiar with the use of statistical software for computation and for visualization

Course work

Course work will consist of two exercies.

  1. A literature survey of state-of-the-art research on one of the specified topics. The literature survey should be around 2500 words and be based on around 10–20 papers.
  2. A practical project investigating a specified data set which will involve using software, such as R, for analysis and graphical visualization. The project will be assessed by a written project report not more than 2500 words with details of all software used (code and scripts) supplied as an additional appendix.

Assessment

  1. Literature survey (50% of final mark)
  2. Practical project (50% of final mark)

Recommended reading

This module will draw directly on Chapters 1–6 of the following for the core material on Statistical Learning.

Gareth James, Daniela Witten, Trevor Hastie & Robert Tibshirani (2014). An Introduction to Statistical Learning (with Applications in R). Springer (1st ed.). See the book website for much related information.