# Computer Laboratory

Course pages 2016–17

# Principles of Data Science

Principal lecturer: Dr Richard Gibbens
Taken by: MPhil ACS, Part III
Code: L120
Hours: 16
Class limit: 15 students
Prerequisites: Undergraduate level mathematical knowledge of linear algebra, calculus, optimization, probability and statistics together with some experience with at least one language or package to handle data analysis.

## Aims

This module will introduce students to the principles of Data Science that underpin key tools and techniques used both to describe and to gain insights into the properties of often large and complex datasets. The approach taken in the module will combine the development of mathematical theory with case studies taken from real-world application domains such as communications networks and road transport networks. The case studies will also highlight the use of modern software packages including R for both statistical computation as well as the graphical visualisation of statistical properties and results.

## Syllabus

• Statistical Learning (4 lectures)
• Introduction
• Estimation, models and prediction accuracy
• Notions of supervised and unsupervised learning
• Linear Regression (3 lectures)
• Simple linear regression
• Mulitple linear regression
• Case studies
• Classification (3 lectures)
• Logistic regression
• Linear discriminant analysis
• Applications
• Resampling Methods (3 lectures)
• Cross-validation
• The bootstrap method
• Applications
• Linear Model Selection and Regularization (3 lectures)
• Approaches to subset selection
• Shrinkage methods
• Methods for dimension reduction
• Applications

## Objectives

On completion of this module, students should:

• understand the ideas of statistical approaches to learning
• understand how to approach answering statistical questions involving large and complex data sets
• appreciate the range of basic techniques available to Data Scientists together with their mathematical underpinnings
• be familiar with the use of the R statistical package for computation and for visualization

## Course work

Course work will consist of two exercies.

1. A literature survey on one of the specified topics. The literature survey should be around 2500 words and be based on around 10–20 papers.
2. A practical project investigating a specified data set which will involve using R software for analysis and graphical visualization. The project will be assessed by a written project report of not more than 2500 words with details of all software used (code and scripts) supplied as an additional appendix.

## Assessment

1. Literature survey (50% of final mark)
2. Practical project (50% of final mark)