Course pages 2012–13

Statistical Machine Translation

Principal lecturers: Dr Stephen Clark, Dr Bill Byrne
Taken by: MPhil ACS, Part III
Code: L102
Hours: 16 (10 lectures + extended practical session covering 6 sessions)
Prerequisites: L100 Introduction to Natural Language Processing

Aims

This module provides an in-depth introduction to statistical machine translation (MT), the dominant approach to providing large-scale, robust translation applicable to many language pairs (and the approach currently used by Google).

Syllabus

Overview: [2 lectures]: Translation as an economic, political, and cultural activity. Machine translation as a problem in natural language processing. Syntax and morphology in translation. Translation memories; example and rule-based based MT. Interlingua.
Alignment: automatic translations in text [2 lectures]: Parallel texts and their role in building translation systems and measuring translation quality. Document and sentence alignment: models and algorithms. Word and phrase alignment: models and algorithms. Techniques for automatic measurement of alignment quality. Webcrawling for parallel text.
Weighted finite state transducers: algorithms for natural language processing and MT [2 lectures]
SMT systems [4 lectures]: Extraction of translation rules from parallel text. Phrase-based, Hiero, syntax-based MT. Techniques for automatic measurement of translation quality. Minimum error rate training. Language models for SMT: simple back-off, MapReduce. MT system combination. Practical issues in SMT: true casing; source text pre-processing; handling morphology; system building procedure.

All lectures will be given by Dr Clark or Dr Byrne.

Objectives

On completion of this module, students should understand:

the role of parallel text in MT;
how alignment models can be estimated from parallel text;
how alignment models capture divergent language properties such as word order;
the use of WFSTs in translation and some other basic NLP tasks;
the extraction of translation rules from parallel text;
various phrase-based translation architectures, including Hiero;
parameter optimization procedures for SMT;
the role of language models in SMT;
the evaluation of SMT systems using automatic metrics;
system combination techniques for SMT.

Practical work

There will be two substantial practical exercises associated with this module.

Practical 1: 2 sessions. Parallel text, alignment models and WFSTs.
Practical 2: 4 sessions. SMT system construction and evaluation.

Assessment

Written report covering the practical worth 35% of the marks.
One final take-home test covering all the material. Final take-home test will contribute 65% to the final mark. Questions set and marked by Dr Clark and Dr Byrne.

Computer Laboratory