Computer Laboratory

Course pages 2015–16

Machine Learning and Algorithms for Data Mining

Principal lecturers: Dr Mateja Jamnik, Dr Pietro Lio', Dr Thomas Sauerwald
Taken by: MPhil ACS, Part III
Code: L42
Hours: 16
Prerequisites: None, but familiarity with basic mathematics, artificial intelligence, algorithms, statistics beneficial.

Aims

This module aims to introduce students to basic principles and methods of machine learning algorithms that are typically used for mining large data sets. In particular, we will look into algorithms typically used for analysing networks, fundamental principles of techniques such as decision trees and support vector machines, and finally, neural network architectures. The students will gain practical understanding through a coding exercise where they will implement and apply one machine learning algorithm on a particular large data set.

Syllabus

  • Basics of Data Mining (3 lectures)
    • missing and noisy data
    • visual pattern recognition in big data
    • dimensionality reduction
  • Algorithms on Networks (4 Lectures)
    • streaming algorithms: number of distinct elements, frequency estimation, count-min-sketch
    • network algorithms: gossiping, random walks, load balancing
  • Support Vector Machines (3 lectures)
    • maximising margine
    • deriving margine
    • slack penalty for non-linear data
    • loss functions
    • common kernel functions
    • implementation of kernels
    • non-parametric SVM-based clustering
    • regression
    • multiclass SVM
    • application: e.g., OCR on hand writing, vision, hypertext
  • Decision Trees and Decision Support Systems (3 lectures)
    • classification tree algorithms (e.g., survival trees, clustering trees, linear splits, class prior, binary splits)
    • data integration and calibration (e.g., rank quality of data, how it is used, check consistency)
    • multi-layer networks
    • decision support systems
    • multivariate parameter evidence synthesis
    • recommender systems
    • application: e.g., Health Care and disease diagnosis, survival analysis
  • Neural Networks (3 lectures)
    • basic principles of self-organisation and supervised learning
    • representation aspects of neural networks, neural circuits, neurons
    • learning and neural coding
    • symbolic, semantic and cognitive architectures
    • application: e.g., neuroscience (EEG, MEG and EMG Data)

Note that some content may vary, and the number of lectures per topic is provisional; the final plan will depend on the students' background and the number of students taking the course.

Objectives

On completion of this module, students should:

  • understand the issues involved in dealing with large amount of data
  • understand the principles of a number of machine learning algorithms
  • be able to implement and apply different machine learning algorithms on large data sets
  • know how to analyse large data sets
  • be familiar with potential applications of different algorithms
  • be able to critically analyse and evaluate a research area

Coursework

Coursework will consist of two practical exercises.

First, students will carry out a literature survey of state-of-the-art research on one of the provided topics (which may include algorithms on networks, support vector machines or decision support systems or neural networks and their applications). The literature survey should be at most 2500 words long and based on approximately 10-20 papers.

Second, students will carry out a project where they will be given a large data set (which may come from a range of different types of data sets) and will be asked to implement a particular machine learning algorithm (which will have been covered in the course), and then run an analysis on the provided data set using their implementation. The students will then write a 2500 word project report on their analysis of the data set resulting from applying their own implementation of the algorithm.

Assessment

  • literature survey on a chosen topic of at most 2500 words (50% of the final mark);
  • coding practical and written report on the practical of at most 2500 words (50% of the final mark).

Recommended reading

Leskovec, J & Rajaraman, A. & Ullman, J (2014). Mining of Massive Datasets. The book is available online from here.
Bishop, C. (2007). Pattern Recognition and Machine Learning. More information supporting the book can be found here.

Additional relevant material and research papers will be suggested during lectures.