Department of Computer Science and Technology

Course pages 2018–19

Advanced topics in machine learning and natural language processing

Organisation and Instructions

We will run 5 or 6 of the following topics and ask all students taking the module to rank ALL topics in order of preference. Please send your rankings to Ted Briscoe by noon on Friday 11th January 2019. We will assign students to topics based on preferences as much as possible.

Each student will attend 4 topics and each topic will consist of 4 sessions. Each topic will typically consist of one preliminary lecture followed by 3 reading and discussion sessions, so that a typical topic can accommodate up to 6 students presenting a paper each, allowing at least 10 minutes general discussion per session. Each student will be required to write an essay or undertake a short project and write a project report on ONE of their chosen topics. The topic organiser will first mark these and help you formulate a project or essay. One module organiser will second mark the assessed work, which will consist of a maximum of 5000 words.

Topic List

Do LSTMs learn syntax?

Description

It's over thirty years since Jeff Elman described the Simple Recurrent Network (SRN) and demonstrated that a connectionist (aka neural) network with recurrent hidden units could learn to approximate some aspects of natural language syntax, given enough exposure to controlled data. (Computational) Linguists remained sceptical, partly because of the difficulty of training such networks, though some Cognitive Scientists continued to explore such models as (rough) approximations of how the brain might process language. The invention of gated Long Short Term Memory (LSTM) recurrent networks 20 years ago added impetus to this research programme by solving the problem of vanishing gradients during the training of such networks to capture long-distance dependencies. However, it wasn't until the advent of very large datasets, powerful parallel processors (GPUs), and techniques such as pre-training word embeddings or the use of auxiliary loss functions, that LSTMs and variants such as Gated Recurrent Unit (GRU) networks really took off in Computational Linguistics.

Today the best language models, (super)taggers, and sequential classifiers for a wide variety of tasks are based on (mostly bi-directional) LSTM models (try a search for `LSTM' in the ACL Anthology.) Nevertheless, what exactly these `black boxes' learn remains contentious. In the overview, I'll describe the model(s) and the various approaches researchers have taken to exploring their learning capabilities and learnt representations. Then we'll look in detail at some of the recent papers addressing how much natural language 'syntax' is or can be acquired by such models.

Resources & Datasets

Background Reading

Readings

Project suggestions

A feasible mini-project would be to test a pre-trained LSTM on some (further) linguistic constructions or to train and test one on a (variant) artificial language (broadly following one of the methods described in the readings)

Imitation Learning

Imitation learning was initially proposed in robotics as a way to better robots (Schaal, 1999). The connecting theme is to combine the reward function in the end of the action sequence with demonstrations of the task in hand by an expert. Since then it has been applied to a number of tasks which can be modelled as a sequence of actions taken by an agent. These include the video game agents, moving cameras to track players and structured prediction in various tasks in natural language processing.

Over the years there has been a number of algorithms proposed, in the literature but without necessarily making the connections between the various approaches clear. The initial lecture will set the criteria to be used to examine the algorithms with.

Each student will present a paper and corresponding algorithm from the list of papers below and may write a report testing it on a dataset of their choice.

Readings

Interpreting the black box: explainable neural networks

  • Proposer: Marek Rei

Neural networks are one of the most powerful classes of machine learning models, achieving state-of-the-art results on a wide range of benchmarks. A key aspect behind their success is the ability to discover representations that can capture relevant underlying structure in the training data. However, most of these architectures are known to be 'black box' models, as it is very difficult to infer why a neural model has made some specific prediction.

Information in a neural architecture generally passes through multiple non-linear layers and gets combined with millions of weights, making it extremely challenging to provide human-interpretable explanations or visualizations of the decision process. Recent work on adversarial examples has also shown that neural networks are often vulnerable to carefully constructed modifications of the inputs, which at the same time are indistinguishable to humans, leading researchers to ask what these models are actually learning and how they can be improved.

Creating neural network architectures that are interpretable is an active research area, as such models would provide multiple benefits:

  • Data analysis. Knowing which information the model uses to make decisions can reveal patterns and regularities in the dataset, providing novel insight about the task that it is solving.
  • Model improvement. Understanding why the model makes specific incorrect decisions can inform us how to improve it and guide the model development.
  • Providing explanations. When automated systems are making potentially life-changing decisions, users will want to receive human-interpretable explanations for why these specific decisions were made.

The latest regulations also require that practical machine learning models, making decisions that can affect users, need to be able to provide an explanation for their behaviour, making the need for interpretable models even more pressing.

In this module we will discuss different methods for interpreting the internal decisions of neural models, along with explicitly designing the architectures to be human-interpretable.

  • Introductory Slides

    Papers for student presentations

  • Variational inference

    • Proposer: Ryan Cotterell

    In this module we will explore the foundations and modern applications of variational inference. At its core, variational inference is a trick for taking an intractable summation and replacing it with a tractable optimization problem. Our tour will start with the classic tutorial of Jordan et al. from 1999 and then we will work our way through a variety of more recent examples. Why study variational inference? From Bayesian neural networks to latent-variable modeling, variational techniques are omnipresent in ML and NLP. The roots of the technique can be traced back to statistical physics—indeed, they may be traced back to Richard Feynman himself. (See this illuminating blog post.) Want a quick overview? Jason Eisner surveys the landscape in this marvelously clear introduction to the subject.

    Each student will present a paper from the list of papers below and will write a summary (think scribing) and, perhaps, implement it and test it on a dataset of their choice.

    Readings

    • An Introduction to Variational Methods for Graphical Models. Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola and Lawrence K. Saul Machine Learning, 1999.

      Why this paper? If you had to read one tutorial on variational inference, this is it. It's classic and is still relevant and generally amazing. Everything else on the reading list is a (relatively minor) extension of the math in this work.

    • The wake-sleep algorithm for unsupervised neural networks. Geoffrey E. Hinton, Peter Dayan, Brendan J. Frey and Radford M. Neal Science, 1995.

      Why this paper? This paper discusses an older algorithm, wake-sleep, that was a precursor to the now famous variational autoencoder. Reading the paper also gives the proper historical perspective: What Hinton et al. term a neural network in 1995 would almost always be called a directed graphical model with latent binary variables these days. In short, the term has evolved in an interesting way—at least, as far as I can tell. Moreover, their method introduces the concept of an inference network that is quite hip these days in ML.

    • Latent Dirichlet Allocation. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Journal of Machine Learning Research, 2003.

      Why this paper? The LDA paper launched a 1000 spin-offs. I dare you to find a senior researcher who hasn't slipped in one or more random variable into LDA and worked out an inference scheme. Here, we won't be interested in the model per se, but rather the appendix! The journal version of this paper is such an excellent piece of scholarship that the appendix that is pedagogically amazing. Ready to meet the Digamma function?

    • Joint Parsing and Alignment with Weakly Synchronized Grammars. David Burkett, John Blitzer and Dan Klein NAACL, 2010.

      Why this paper? As an NLP person, I had to sneak something on NLP in this module. The paper is a great example of how to use variational inference in a structured NLP model. Specifically, they show how to use structured mean field for joint alignment and parsing. The authors also offer a longer tutorial on Structured Variational Inference, which they gave at ACL 2013.

    • Stochastic Variational Inference. Matthew D. Hoffman, David M. Blei, Chong Wang and John Paisley Journal of Machine Learning Research, 2013.

      Why this paper? This journal paper is another pedagogical gem from Dave Blei's group. It goes over how stochastic approximation techniques can be used to speed up variational inference. The ideas here are very relevant for understanding the variational autoencoder, a popular model in many areas of ML.

    • Auto-Encoding Variational Bayes. Diederik P. Kingma and Max Welling International Conference on Learning Representations, 2014.

      Why this paper? Who isn’t talking about the variational autoencoder? This instant classic is a must-read for the module. However, by the time we get to this paper, I hope to have shown that it’s a very clean combination of three basic ideas: variational inference (with an inference network à la wake-sleep), stochastic approximation and variance reduction. These existing techniques were crocheted together for great good and much hype.

    NLP & ML for Speech

    • Proposers: Paula Buttery, Andrew Caines, Helen Yannakoudakis

    Description

    Writing and speech are very different. Far from being an impoverished version of writing, speech is in fact its own communication mode with rules, conventions and a grammar of its own (Carter & McCarthy 2017). Natural language processing techniques assume a written input made up of well-formed, punctuated text. Firstly speech has to be transcribed, then reshaped and segmented to resemble something like written text. We discuss the challenges inherent to speech NLP, state-of-the-art machine learning techniques which have been applied to these tasks, and how the problem could be decoupled from traditional NLP for writing.

    Resources & Datasets

    Background Reading

    Readings

    Project suggestions

    Projects could include a survey of current approaches, a focus on particular linguistic features of speech, or a proposal for amendments to existing NLP technology in order to better deal with speech data.

    Deep learning and bioinformatics

    • Proposer: Pietro Lió

    Bioinformatics is vibrant field at the intersection of biology, statistics, and computer science. Bioinformatics uses statistical and computational methodologies to support experimental molecular biology and it is in part responsible for the current successes and advances of biomedicine at the molecular level.

    Readings

    Autoencoders and generative models

    • Proposer: Damon Wischik
    • Where and when: Tue 5 March 2–3pm, Fri 8 March 3–5pm, Tue 12 March 2–3pm. Room SW01. (See MPhil/ACS timetable)

    Autoencoders are a neural network architecture for learning "concepts". The dream is that we should be able to feed in an unlabelled dataset e.g. a collection of pictures of faces, and the machine should figure out high-level concepts e.g. "smiling" or "wears glasses", and it should then be able to extrapolate e.g. turn a frown into a smile.

    Mathematically, this has been seen as a problem of compressed representation (designing an encoder network to turn a high-dimensional datapoint into a low-dimensional codepoint), and of probabilistic generative modelling (designing a decoder network to turn a random low-dimensional codepoint into a new synthetic datapoint).

    In this module, we will explore three different designs of autoencoders, and some applications.

    Background reading

    Readings

    Some of these papers come with supplemental reading. This is to indicate which aspects of the paper matter most to posterity.