Department of Computer Science and Technology

Course pages 2017–18

Subsections


Paper 3: Machine Learning and Real-world Data

This course is only taken by Part 1A and Part IB Paper 3 students.

Lecturers: Professor A.A. Copestake, Dr H. Yannakoudakis and Dr. P. Buttery

No. of lectures and practical classes: 16

Suggested hours of supervisions: 4

Prerequisite courses: NST Mathematics


Aims

This course introduces students to machine learning algorithms as used in real-world applications, and to the experimental methodology necessary to perform statistical analysis of large-scale data from unpredictable processes. Students will perform 3 extended practicals, as follows:

  • Statistical classification: Determining movie review sentiment using Naive Bayes (7 sessions);
  • Sequence Analysis: Hidden Markov Modelling and its application to a task from biology (predicting protein interactions with a cell membrane) (4 sessions);
  • Analysis of social networks, including detection of cliques and central nodes (5 sessions).


Syllabus

  • Topic One: Statistical Classification [7 sessions].
    Introduction to sentiment classification.
    Naive Bayes parameter estimation.
    Statistical laws of language.
    Statistical tests for classification tasks.
    Cross-validation and test sets.
    Uncertainty and human agreement.

  • Topic Two: Sequence Analysis [4 sessions].
    Hidden Markov Models (HMM) and HMM training.
    The Viterbi algorithm.
    Using an HMM in a biological application.

  • Topic Three: Social Networks [5 sessions].
    Properties of networks: Degree, Diameter.
    Betweenness Centrality.
    Clustering using betweenness centrality.

Objectives

By the end of the course students should be able to:

  • understand and program two simple supervised machine learning algorithms;
  • use these algorithms in statistically valid experiments, including the design of baselines, evaluation metrics, statistical testing of results, and provision against overtraining;
  • visualise the connectivity and centrality in large networks;
  • use clustering (i.e., a type of unsupervised machine learning) for detection of cliques in unstructured networks.

Recommended reading

Jurafsky, D. & Martin, J. (2008). Speech and language processing. Prentice Hall.

Easley, D. and Kleinberg, J. (2010). Networks, crowds, and markets: reasoning about a highly connected world. Cambridge University Press.