Course pages 2016–17

# Machine Learning and Real-world Data

**Principal lecturers:** Dr Simone Teufel, Prof Ann Copestake**Taken by:** Part IA CST 75%**Past exam questions**

No. of lectures and practical classes: 16

Suggested hours of supervisions: 4

Prerequisite courses: NST Mathematics

## Aims

This course introduces students to machine learning algorithms as used in real-world applications, and to the experimental methodology necessary to perform statistical processing of large-scale unpredictable processes such as language, social networks or genetic data. Students will perform 3 extended practicals, as follows:

- Statistical classification: Determining a movie review’s sentiment using Naive Bayes (7 sessions)
- Sequence Analysis: Detection of proteins in genetic data using Hidden Markov Modelling (4 sessions)
- Network analysis of a social network, including detection of cliques and central nodes (5 sessions)

## Syllabus

**Topic One: Statistical Classification [7 sessions].**

Introduction to Sentiment Classification.

Naive Bayes Parameter Estimation.

Statistical Laws of Language.

Smoothing and Statistical Tests.

Overtraining.

Uncertainty and Human Agreement.**Topic Two: Sequence Analysis [4 sessions].**

Simple HMM Parameter Estimation.

The Viterbi Algorithm.

Random Baselines and Evaluation Metrics.

Application to Protein Detection Data.**Topic Three: Network Analysis [5 sessions].**

Degree, Diameter, Visualisation.

Random Networks and Small World Property.

Betweenness Centrality.

Clique Finding.

## Objectives

By the end of the course students should be able to

- understand and program two simple supervised machine learning algorithms;
- use these algorithms in statistically valid experiments, including the design of baselines, evaluation metrics, statistical testing of results, and provision against overtraining;
- visualise and interpret examples of statistical laws of language;
- visualise the connectivity and centrality in large networks;
- use clustering (i.e., a type of unsupervised machine learning) for detection of cliques in unstructured networks.

## Recommended reading

Jurafsky, D. & Martin, J. (2008). *Speech and language
processing*. Prentice Hall.

Durbin, R., Eddy, S., Krough, A. & Mitchison, G. (1998). *Biological
sequence analysis: probabilistic models of proteins and nucleic
acids*. Cambridge University Press.

Easley, D. and Kleinberg, J. (2010). *Networks, crowds, and markets:
reasoning about a highly connected world*. Cambridge University
Press.