Computer Laboratory

Course pages 2016–17

Machine Learning and Real-world Data

Principal lecturers: Dr Simone Teufel, Prof Ann Copestake
Taken by: Part IA CST 75%
Past exam questions

No. of lectures and practical classes: 16
Suggested hours of supervisions: 4
Prerequisite courses: NST Mathematics


This course introduces students to machine learning algorithms as used in real-world applications, and to the experimental methodology necessary to perform statistical processing of large-scale unpredictable processes such as language, social networks or genetic data. Students will perform 3 extended practicals, as follows:

  • Statistical classification: Determining a movie review’s sentiment using Naive Bayes (7 sessions)
  • Sequence Analysis: Detection of proteins in genetic data using Hidden Markov Modelling (4 sessions)
  • Network analysis of a social network, including detection of cliques and central nodes (5 sessions)


  • Topic One: Statistical Classification [7 sessions].
    Introduction to Sentiment Classification.
    Naive Bayes Parameter Estimation.
    Statistical Laws of Language.
    Smoothing and Statistical Tests.
    Uncertainty and Human Agreement.

  • Topic Two: Sequence Analysis [4 sessions].
    Simple HMM Parameter Estimation.
    The Viterbi Algorithm.
    Random Baselines and Evaluation Metrics.
    Application to Protein Detection Data.

  • Topic Three: Network Analysis [5 sessions].
    Degree, Diameter, Visualisation.
    Random Networks and Small World Property.
    Betweenness Centrality.
    Clique Finding.


By the end of the course students should be able to

  • understand and program two simple supervised machine learning algorithms;
  • use these algorithms in statistically valid experiments, including the design of baselines, evaluation metrics, statistical testing of results, and provision against overtraining;
  • visualise and interpret examples of statistical laws of language;
  • visualise the connectivity and centrality in large networks;
  • use clustering (i.e., a type of unsupervised machine learning) for detection of cliques in unstructured networks.

Recommended reading

Jurafsky, D. & Martin, J. (2008). Speech and language processing. Prentice Hall.

Durbin, R., Eddy, S., Krough, A. & Mitchison, G. (1998). Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press.

Easley, D. and Kleinberg, J. (2010). Networks, crowds, and markets: reasoning about a highly connected world. Cambridge University Press.