Course pages 2016–17

Principles of Data Science

Principal lecturer: Dr Richard Gibbens
Taken by: MPhil ACS, Part III
Code: L120
Hours: 16
Class limit: 15 students
Prerequisites: Undergraduate level mathematical knowledge of linear algebra, calculus, optimization, probability and statistics together with some experience with at least one language or package to handle data analysis.

Aims

This module will introduce students to the principles of Data Science that underpin key tools and techniques used both to describe and to gain insights into the properties of often large and complex datasets. The approach taken in the module will combine the development of mathematical theory with case studies taken from real-world application domains such as communications networks and road transport networks. The case studies will also highlight the use of modern software packages including R for both statistical computation as well as the graphical visualisation of statistical properties and results.

Syllabus

Statistical Learning (4 lectures)
- Introduction
- Estimation, models and prediction accuracy
- Notions of supervised and unsupervised learning
Linear Regression (3 lectures)
- Simple linear regression
- Mulitple linear regression
- Case studies
Classification (3 lectures)
- Logistic regression
- Linear discriminant analysis
- Applications
Resampling Methods (3 lectures)
- Cross-validation
- The bootstrap method
- Applications
Linear Model Selection and Regularization (3 lectures)
- Approaches to subset selection
- Shrinkage methods
- Methods for dimension reduction
- Applications

Objectives

On completion of this module, students should:

understand the ideas of statistical approaches to learning
understand how to approach answering statistical questions involving large and complex data sets
appreciate the range of basic techniques available to Data Scientists together with their mathematical underpinnings
be familiar with the use of the R statistical package for computation and for visualization

Course work

Course work will consist of two exercies.

A literature survey on one of the specified topics. The literature survey should be around 2500 words and be based on around 10–20 papers.
A practical project investigating a specified data set which will involve using R software for analysis and graphical visualization. The project will be assessed by a written project report of not more than 2500 words with details of all software used (code and scripts) supplied as an additional appendix.

Assessment

Literature survey (50% of final mark)
Practical project (50% of final mark)

Computer Laboratory