Course pages 2018–19

Advanced topics in machine learning and natural language processing

Organisation and Instructions

We will run 5 or 6 of the following topics and ask all students taking the module to rank ALL topics in order of preference. Please send your rankings to Ted Briscoe by noon on Friday 11th January 2019. We will assign students to topics based on preferences as much as possible.

Each student will attend 4 topics and each topic will consist of 4 sessions. Each topic will typically consist of one preliminary lecture followed by 3 reading and discussion sessions, so that a typical topic can accommodate up to 6 students presenting a paper each, allowing at least 10 minutes general discussion per session. Each student will be required to write an essay or undertake a short project and write a project report on ONE of their chosen topics. The topic organiser will first mark these and help you formulate a project or essay. One module organiser will second mark the assessed work, which will consist of a maximum of 5000 words.

Topic List

Do LSTMs learn syntax?
Imitation learning
Interpreting the black box: explainable neural network models
Variational inference
NLP & ML for speech
Deep learning and bioinformatics
Autoencoders and generative models

Do LSTMs learn syntax?

Proposer: Ted Briscoe

Description

It's over thirty years since Jeff Elman described the Simple Recurrent Network (SRN) and demonstrated that a connectionist (aka neural) network with recurrent hidden units could learn to approximate some aspects of natural language syntax, given enough exposure to controlled data. (Computational) Linguists remained sceptical, partly because of the difficulty of training such networks, though some Cognitive Scientists continued to explore such models as (rough) approximations of how the brain might process language. The invention of gated Long Short Term Memory (LSTM) recurrent networks 20 years ago added impetus to this research programme by solving the problem of vanishing gradients during the training of such networks to capture long-distance dependencies. However, it wasn't until the advent of very large datasets, powerful parallel processors (GPUs), and techniques such as pre-training word embeddings or the use of auxiliary loss functions, that LSTMs and variants such as Gated Recurrent Unit (GRU) networks really took off in Computational Linguistics.

Today the best language models, (super)taggers, and sequential classifiers for a wide variety of tasks are based on (mostly bi-directional) LSTM models (try a search for `LSTM' in the ACL Anthology.) Nevertheless, what exactly these `black boxes' learn remains contentious. In the overview, I'll describe the model(s) and the various approaches researchers have taken to exploring their learning capabilities and learnt representations. Then we'll look in detail at some of the recent papers addressing how much natural language 'syntax' is or can be acquired by such models.

Resources & Datasets

Introductory Slides (after lecture)
Stephen Pulman's Extended Wheeler Lecture Slides
See links in readings for code and datasets

Background Reading

Readings

Project suggestions

A feasible mini-project would be to test a pre-trained LSTM on some (further) linguistic constructions or to train and test one on a (variant) artificial language (broadly following one of the methods described in the readings)

Imitation Learning

Proposer: Andreas Vlachos
Further details

Imitation learning was initially proposed in robotics as a way to better robots (Schaal, 1999). The connecting theme is to combine the reward function in the end of the action sequence with demonstrations of the task in hand by an expert. Since then it has been applied to a number of tasks which can be modelled as a sequence of actions taken by an agent. These include the video game agents, moving cameras to track players and structured prediction in various tasks in natural language processing.

Over the years there has been a number of algorithms proposed, in the literature but without necessarily making the connections between the various approaches clear. The initial lecture will set the criteria to be used to examine the algorithms with.

Each student will present a paper and corresponding algorithm from the list of papers below and may write a report testing it on a dataset of their choice.

Readings

Search-based Structured Prediction Hal Daumé III, John Langford and Daniel Marcu Machine Learning Journal (MLJ), 2009
A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning Stephane Ross, Geoffrey J. Gordon, J. Andrew Bagnell Artificial Intelligence and Statistics Conference (AISTATS), 2011
Learning to search better than your teacher Kai-Wei Chang, Akshay Krishnamurthy, Alekh Agarwal, Hal Daum^eacute; III and John Langford International Conference on Machine Learning (ICML), 2015
Sequence Level Training with Recurrent Neural Networks Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, Wojciech Zaremba International Conference on Machine Learning (ICLR), 2016
Hierarchical Imitation and Reinforcement Learning Hoang M. Le, Nan Jiang, Alekh Agarwal, Miroslav Dudík, Yisong Yue and Hal Daumé III International Conference on Machine Learning (ICML), 2018
Residual Loss Prediction: Reinforcement Learning with no Incremental Feedback Hal Daumé III, John Langford and Amr Sharaf International Conference on Machine Learning (ICLR), 2018

Interpreting the black box: explainable neural networks

Proposer: Marek Rei

Neural networks are one of the most powerful classes of machine learning models, achieving state-of-the-art results on a wide range of benchmarks. A key aspect behind their success is the ability to discover representations that can capture relevant underlying structure in the training data. However, most of these architectures are known to be 'black box' models, as it is very difficult to infer why a neural model has made some specific prediction.

Information in a neural architecture generally passes through multiple non-linear layers and gets combined with millions of weights, making it extremely challenging to provide human-interpretable explanations or visualizations of the decision process. Recent work on adversarial examples has also shown that neural networks are often vulnerable to carefully constructed modifications of the inputs, which at the same time are indistinguishable to humans, leading researchers to ask what these models are actually learning and how they can be improved.

Creating neural network architectures that are interpretable is an active research area, as such models would provide multiple benefits:

Data analysis. Knowing which information the model uses to make decisions can reveal patterns and regularities in the dataset, providing novel insight about the task that it is solving.
Model improvement. Understanding why the model makes specific incorrect decisions can inform us how to improve it and guide the model development.
Providing explanations. When automated systems are making potentially life-changing decisions, users will want to receive human-interpretable explanations for why these specific decisions were made.

The latest regulations also require that practical machine learning models, making decisions that can affect users, need to be able to provide an explanation for their behaviour, making the need for interpretable models even more pressing.

In this module we will discuss different methods for interpreting the internal decisions of neural models, along with explicitly designing the architectures to be human-interpretable.

Introductory Slides

Papers for student presentations

Why should i trust you?: Explaining the predictions of any classifier (KDD 2016) Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin.
Generating visual explanations (ECCV 2016) Hendricks, Lisa Anne, Zeynep Akata, Marcus Rohrbach, Jeff Donahue, Bernt Schiele, and Trevor Darrell.
Show, attend and tell: Neural image caption generation with visual attention (ICML 2015) Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio.
Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization (ICCV 2017) Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra.
Explainable Prediction of Medical Codes from Clinical Text (NAACL 2018) James Mullenbach, Sarah Wiegreffe, Jon Duke, Jimeng Sun, and Jacob Eisenstein.
Evaluating neural network explanation methods using hybrid documents and morphosyntactic agreement (ACL 2018) Nina Poerner, Hinrich Schütze, and Benjamin Roth.

Variational inference

Proposer: Ryan Cotterell

In this module we will explore the foundations and modern applications of variational inference. At its core, variational inference is a trick for taking an intractable summation and replacing it with a tractable optimization problem. Our tour will start with the classic tutorial of Jordan et al. from 1999 and then we will work our way through a variety of more recent examples. Why study variational inference? From Bayesian neural networks to latent-variable modeling, variational techniques are omnipresent in ML and NLP. The roots of the technique can be traced back to statistical physics—indeed, they may be traced back to Richard Feynman himself. (See this illuminating blog post.) Want a quick overview? Jason Eisner surveys the landscape in this marvelously clear introduction to the subject.

Each student will present a paper from the list of papers below and will write a summary (think scribing) and, perhaps, implement it and test it on a dataset of their choice.

Readings

An Introduction to Variational Methods for Graphical Models. Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola and Lawrence K. Saul Machine Learning, 1999.
Why this paper? If you had to read one tutorial on variational inference, this is it. It's classic and is still relevant and generally amazing. Everything else on the reading list is a (relatively minor) extension of the math in this work.
The wake-sleep algorithm for unsupervised neural networks. Geoffrey E. Hinton, Peter Dayan, Brendan J. Frey and Radford M. Neal Science, 1995.
Why this paper? This paper discusses an older algorithm, wake-sleep, that was a precursor to the now famous variational autoencoder. Reading the paper also gives the proper historical perspective: What Hinton et al. term a neural network in 1995 would almost always be called a directed graphical model with latent binary variables these days. In short, the term has evolved in an interesting way—at least, as far as I can tell. Moreover, their method introduces the concept of an inference network that is quite hip these days in ML.
Latent Dirichlet Allocation. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Journal of Machine Learning Research, 2003.
Why this paper? The LDA paper launched a 1000 spin-offs. I dare you to find a senior researcher who hasn't slipped in one or more random variable into LDA and worked out an inference scheme. Here, we won't be interested in the model per se, but rather the appendix! The journal version of this paper is such an excellent piece of scholarship that the appendix that is pedagogically amazing. Ready to meet the Digamma function?
Joint Parsing and Alignment with Weakly Synchronized Grammars. David Burkett, John Blitzer and Dan Klein NAACL, 2010.
Why this paper? As an NLP person, I had to sneak something on NLP in this module. The paper is a great example of how to use variational inference in a structured NLP model. Specifically, they show how to use structured mean field for joint alignment and parsing. The authors also offer a longer tutorial on Structured Variational Inference, which they gave at ACL 2013.
Stochastic Variational Inference. Matthew D. Hoffman, David M. Blei, Chong Wang and John Paisley Journal of Machine Learning Research, 2013.
Why this paper? This journal paper is another pedagogical gem from Dave Blei's group. It goes over how stochastic approximation techniques can be used to speed up variational inference. The ideas here are very relevant for understanding the variational autoencoder, a popular model in many areas of ML.
Auto-Encoding Variational Bayes. Diederik P. Kingma and Max Welling International Conference on Learning Representations, 2014.
Why this paper? Who isn’t talking about the variational autoencoder? This instant classic is a must-read for the module. However, by the time we get to this paper, I hope to have shown that it’s a very clean combination of three basic ideas: variational inference (with an inference network à la wake-sleep), stochastic approximation and variance reduction. These existing techniques were crocheted together for great good and much hype.

NLP & ML for Speech

Proposers: Paula Buttery, Andrew Caines, Helen Yannakoudakis

Description

Writing and speech are very different. Far from being an impoverished version of writing, speech is in fact its own communication mode with rules, conventions and a grammar of its own (Carter & McCarthy 2017). Natural language processing techniques assume a written input made up of well-formed, punctuated text. Firstly speech has to be transcribed, then reshaped and segmented to resemble something like written text. We discuss the challenges inherent to speech NLP, state-of-the-art machine learning techniques which have been applied to these tasks, and how the problem could be decoupled from traditional NLP for writing.

Resources & Datasets

Introductory slides (after lecture)
See links in readings for code and datasets

Background Reading

Spoken Grammar: Where Are We and Where Are We Going?, Carter & McCarthy Applied Linguistics 2017
Grammars of Spoken English: New Outcomes of Corpus‐Oriented Research, Leech, Language Learning, 2008
Speech and Language Processing, Jurafsky & Martin, 2nd edition, Chapters 9 & 10

Readings

Project suggestions

Projects could include a survey of current approaches, a focus on particular linguistic features of speech, or a proposal for amendments to existing NLP technology in order to better deal with speech data.

Deep learning and bioinformatics

Proposer: Pietro Lió

Bioinformatics is vibrant field at the intersection of biology, statistics, and computer science. Bioinformatics uses statistical and computational methodologies to support experimental molecular biology and it is in part responsible for the current successes and advances of biomedicine at the molecular level.

Readings

Learning to design RNA, Runge, Stoll, Falkner, and Hutter (ICLR 2019)
Generative adversarial networks simulate gene expression and predict perturbations in single cells, Ghahramani, Wat, Luscombe (preprint, 2018)
Parapred: antibody paratope prediction using convolutional and recurrent neural networks, Liberis, Veličković, Sormanni, Vendruscolo, Liò (Bioinformatics 2018)
Multi-omics data integration using cross-modal neural networks, Bica, Veličković, Xiao, Liò (ESANN 2018)
Using deep learning to model the hierarchical structure and function of a cell, Ma et al. (Nature Methods 2018)
MoIGAN: an implicit generative model for small molecular graphs, De Cao, Kipf (ICML 2018)

Autoencoders and generative models

Proposer: Damon Wischik
Where and when: Tue 5 March 2–3pm, Fri 8 March 3–5pm, Tue 12 March 2–3pm. Room SW01. (See MPhil/ACS timetable)

Autoencoders are a neural network architecture for learning "concepts". The dream is that we should be able to feed in an unlabelled dataset e.g. a collection of pictures of faces, and the machine should figure out high-level concepts e.g. "smiling" or "wears glasses", and it should then be able to extrapolate e.g. turn a frown into a smile.

Mathematically, this has been seen as a problem of compressed representation (designing an encoder network to turn a high-dimensional datapoint into a low-dimensional codepoint), and of probabilistic generative modelling (designing a decoder network to turn a random low-dimensional codepoint into a new synthetic datapoint).

In this module, we will explore three different designs of autoencoders, and some applications.

Background reading

Introduction to autoencoders, a clear overview of the field
Applied deep learning: autoencoders, a hands-on tutorial using Keras
Chapter 14 of Deep Learning by Goodfellow, Bengio, and Courville (2016), a textbook with a thorough review of the historical development
Introductory slides for this topic

Readings

Some of these papers come with supplemental reading. This is to indicate which aspects of the paper matter most to posterity.

Sparse feature learning for deep belief networks, Ranzato, Boureau, LeCun (NIPS 2007).
- lecture notes by Andrew Ng and tutorial by his group
- Slides by vrs26
Unsupervised learning of video representations using LSTMs, Srivastava, Mansimov, Salakhutdinov (ICML 2015).
- Walkthrough and Keras implementation
- For more on RNNs and LSTMs, see the notes on Do LSTMs learn syntax?
- Slides by sat62
Extracting and composing robust features with denoising autoencoders, Vincent, Larochelle, Bengio, Manzagol (ICML 2008).
- Deep autoencoders for collaborative filtering, a walkthrough and TensorFlow implementation of predicting movie recommendations with a denoising autoencoder
- Slides by zz362
Auto-Encoding Variational Bayes, Kingma, Welling (ICLR 2013).
- See also the notes on Variational inference
- Slides by mfb37
β-VAE: learning basic visual concepts with a constrained variational framework, Higgins et al. (ICLR 2017).
- a blog post explaining the idea
- Some neuroscience background: Early visual concept learning with unsupervised deep learning, Higgins et al. (2016). And a blog post explaining it.
- Presenter: dai24
Grammar variational autoencoder, Kusner, Paige, Hernández-Lobato (ICML 2017).
- This is a nice application that brings together ideas from the other readings: autoencoding sequences, variational autoencoding, and latent space interpretation.
- Slides by er513

Department of Computer Science and Technology

Advanced topics in machine learning and natural language processing

Organisation and Instructions

Topic List

Do LSTMs learn syntax?

Description

Resources & Datasets

Background Reading

Readings

Project suggestions

Imitation Learning

Readings

Interpreting the black box: explainable neural networks

Papers for student presentations

Variational inference

Readings

NLP & ML for Speech

Description

Resources & Datasets

Background Reading

Readings

Project suggestions

Deep learning and bioinformatics

Readings

Autoencoders and generative models

Background reading

Readings