skip to primary navigationskip to content

Department of Computer Science and Technology



Course pages 2023–24

Machine Learning and Real-world Data

Principal lecturer: Prof Simone Teufel
Additional lecturer: Dr Andreas Vlachos
Taken by: Part IA CST
Term: Lent
Hours: 16
Format: In-person lectures
Suggested hours of supervisions: 4
This course is a prerequisite for: Advanced Data Science, Natural Language Processing
Exam: Paper 3 Question 7, 8, 9
Past exam questions, Moodle, timetable


This course introduces students to machine learning algorithms as used in real-world applications, and to the experimental methodology necessary to perform statistical analysis of large-scale data from unpredictable processes. Students will perform 3 extended practicals, as follows:

  • Statistical classification: Determining movie review sentiment using Naive Bayes (7 sessions);
  • Sequence Analysis: Hidden Markov Modelling and its application to a task from biology (predicting protein interactions with a cell membrane) (4 sessions);
  • Analysis of social networks, including detection of cliques and central nodes (5 sessions).


  • Topic One: Statistical Classification [7 sessions].
    Introduction to sentiment classification.
    Naive Bayes parameter estimation.
    Statistical laws of language.
    Statistical tests for classification tasks.
    Cross-validation and test sets.
    Uncertainty and human agreement.
  • Topic Two: Sequence Analysis [4 sessions].
    Hidden Markov Models (HMM) and HMM training.
    The Viterbi algorithm.
    Using an HMM in a biological application.
  • Topic Three: Social Networks [5 sessions].
    Properties of networks: Degree, Diameter.
    Betweenness Centrality.
    Clustering using betweenness centrality.


By the end of the course students should be able to:

  • understand and program two simple supervised machine learning algorithms;
  • use these algorithms in statistically valid experiments, including the design of baselines, evaluation metrics, statistical testing of results, and provision against overtraining;
  • visualise the connectivity and centrality in large networks;
  • use clustering (i.e., a type of unsupervised machine learning) for detection of cliques in unstructured networks.

Recommended reading

Jurafsky, D. and Martin, J. (2008). Speech and language processing. Prentice Hall.

Easley, D. and Kleinberg, J. (2010). Networks, crowds, and markets: reasoning about a highly connected world. Cambridge University Press.