skip to primary navigationskip to content

Department of Computer Science and Technology


Course pages 2020–21

Machine Learning and Real-world Data

Principal lecturer: Prof Simone Teufel
Taken by: Part IA CST, Part IB CST 50%
Hours: 16
This course is a prerequisite for: Data Science: principles and practice, Data Science: principles and practice, Natural Language Processing
Past exam questions


This course introduces students to machine learning algorithms as used in real-world applications, and to the experimental methodology necessary to perform statistical analysis of large-scale data from unpredictable processes. Students will perform 3 extended practicals, as follows:

  • Statistical classification: Determining movie review sentiment using Naive Bayes (7 sessions);
  • Sequence Analysis: Hidden Markov Modelling and its application to a task from biology (predicting protein interactions with a cell membrane) (4 sessions);
  • Analysis of social networks, including detection of cliques and central nodes (5 sessions).


  • Topic One: Statistical Classification [7 sessions].
    Introduction to sentiment classification.
    Naive Bayes parameter estimation.
    Statistical laws of language.
    Statistical tests for classification tasks.
    Cross-validation and test sets.
    Uncertainty and human agreement.
  • Topic Two: Sequence Analysis [4 sessions].
    Hidden Markov Models (HMM) and HMM training.
    The Viterbi algorithm.
    Using an HMM in a biological application.
  • Topic Three: Social Networks [5 sessions].
    Properties of networks: Degree, Diameter.
    Betweenness Centrality.
    Clustering using betweenness centrality.


By the end of the course students should be able to:

  • understand and program two simple supervised machine learning algorithms;
  • use these algorithms in statistically valid experiments, including the design of baselines, evaluation metrics, statistical testing of results, and provision against overtraining;
  • visualise the connectivity and centrality in large networks;
  • use clustering (i.e., a type of unsupervised machine learning) for detection of cliques in unstructured networks.

Recommended reading

Jurafsky, D. and Martin, J. (2008). Speech and language processing. Prentice Hall.

Easley, D. and Kleinberg, J. (2010). Networks, crowds, and markets: reasoning about a highly connected world. Cambridge University Press.