Course pages 2018–19
Advanced topics in machine learning and natural language processing
Organisation and Instructions
We will run 5 or 6 of the following topics and ask all students taking the module to rank ALL topics in order of preference. Please send your rankings to Ted Briscoe by noon on Friday 11th January 2019. We will assign students to topics based on preferences as much as possible.
Each student will attend 4 topics and each topic will consist of 4 sessions. Each topic will typically consist of one preliminary lecture followed by 3 reading and discussion sessions, so that a typical topic can accommodate up to 6 students presenting a paper each, allowing at least 10 minutes general discussion per session. Each student will be required to write an essay or undertake a short project and write a project report on ONE of their chosen topics. The topic organiser will first mark these and help you formulate a project or essay. One module organiser will second mark the assessed work, which will consist of a maximum of 5000 words.
Topic List
- Do LSTMs learn syntax?
- Imitation learning
- Interpreting the black box: explainable neural network models
- Variational inference
- NLP & ML for speech
- Deep learning and bioinformatics
- Autoencoders and generative models
Do LSTMs learn syntax?
- Proposer: Ted Briscoe
Description
It's over thirty years since Jeff Elman described the Simple Recurrent Network (SRN) and demonstrated that a connectionist (aka neural) network with recurrent hidden units could learn to approximate some aspects of natural language syntax, given enough exposure to controlled data. (Computational) Linguists remained sceptical, partly because of the difficulty of training such networks, though some Cognitive Scientists continued to explore such models as (rough) approximations of how the brain might process language. The invention of gated Long Short Term Memory (LSTM) recurrent networks 20 years ago added impetus to this research programme by solving the problem of vanishing gradients during the training of such networks to capture long-distance dependencies. However, it wasn't until the advent of very large datasets, powerful parallel processors (GPUs), and techniques such as pre-training word embeddings or the use of auxiliary loss functions, that LSTMs and variants such as Gated Recurrent Unit (GRU) networks really took off in Computational Linguistics.
Today the best language models, (super)taggers, and sequential classifiers for a wide variety of tasks are based on (mostly bi-directional) LSTM models (try a search for `LSTM' in the ACL Anthology.) Nevertheless, what exactly these `black boxes' learn remains contentious. In the overview, I'll describe the model(s) and the various approaches researchers have taken to exploring their learning capabilities and learnt representations. Then we'll look in detail at some of the recent papers addressing how much natural language 'syntax' is or can be acquired by such models.
Resources & Datasets
- Introductory Slides (after lecture)
- Stephen Pulman's Extended Wheeler Lecture Slides
- See links in readings for code and datasets
Background Reading
- Recurrent Neural Networks
- Recurrent and Recursive Nets, Goodfellow et al. Deep Learning 2016
- Distributed Representations, Simple Recurrent Networks, and Grammatical Structure, Elman, Machine Learning, 1991
- Learning and development in neural networks: the importance of starting small, Elman, Cognition, 1993
- Recurrent Neural Networks as Weighted Language Recognizers, Chen et al., NAACL, 2018
Readings
- Colorless green
recurrent networks dream hierarchically, Gulordava et al., NAACL, 2018
Presenter's Slides (after presentation) - Targeted Syntactic
Evaluation of Language Models, Marvin, Linzen, 2018
Presenter's Slides (after presentation) - Do RNNs learn
human-like abstract word order preferences?, Futrell, Levy, 2018
Presenter's Slides (after presentation) - What do RNN
Language Models Learn about Filler-Gap Dependencies?,
Wilcox et al., EMNLP, 2018
Presenter's Slides (after presentation) - Evaluating the
Ability of LSTMs to Learn Context-Free Grammars, Sennhauser,
Berwick, EMNLP, 2018
Presenter's Slides (after presentation) - On Evaluating the
Generalization of LSTM Models in Formal Languages, Suzgun et al., 2018
Presenter's Slides (after presentation)
Project suggestions
A feasible mini-project would be to test a pre-trained LSTM on some (further) linguistic constructions or to train and test one on a (variant) artificial language (broadly following one of the methods described in the readings)
Imitation Learning
- Proposer: Andreas Vlachos
- Further details
Imitation learning was initially proposed in robotics as a way to better robots (Schaal, 1999). The connecting theme is to combine the reward function in the end of the action sequence with demonstrations of the task in hand by an expert. Since then it has been applied to a number of tasks which can be modelled as a sequence of actions taken by an agent. These include the video game agents, moving cameras to track players and structured prediction in various tasks in natural language processing.
Over the years there has been a number of algorithms proposed, in the literature but without necessarily making the connections between the various approaches clear. The initial lecture will set the criteria to be used to examine the algorithms with.
Each student will present a paper and corresponding algorithm from the list of papers below and may write a report testing it on a dataset of their choice.
Readings
- Search-based Structured Prediction Hal Daumé III, John Langford and Daniel Marcu Machine Learning Journal (MLJ), 2009
- A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning Stephane Ross, Geoffrey J. Gordon, J. Andrew Bagnell Artificial Intelligence and Statistics Conference (AISTATS), 2011
- Learning to search better than your teacher Kai-Wei Chang, Akshay Krishnamurthy, Alekh Agarwal, Hal Daum^eacute; III and John Langford International Conference on Machine Learning (ICML), 2015
- Sequence Level Training with Recurrent Neural Networks Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, Wojciech Zaremba International Conference on Machine Learning (ICLR), 2016
- Hierarchical Imitation and Reinforcement Learning Hoang M. Le, Nan Jiang, Alekh Agarwal, Miroslav Dudík, Yisong Yue and Hal Daumé III International Conference on Machine Learning (ICML), 2018
- Residual Loss Prediction: Reinforcement Learning with no Incremental Feedback Hal Daumé III, John Langford and Amr Sharaf International Conference on Machine Learning (ICLR), 2018
Interpreting the black box: explainable neural networks
- Proposer: Marek Rei
Neural networks are one of the most powerful classes of machine learning models, achieving state-of-the-art results on a wide range of benchmarks. A key aspect behind their success is the ability to discover representations that can capture relevant underlying structure in the training data. However, most of these architectures are known to be 'black box' models, as it is very difficult to infer why a neural model has made some specific prediction.
Information in a neural architecture generally passes through multiple non-linear layers and gets combined with millions of weights, making it extremely challenging to provide human-interpretable explanations or visualizations of the decision process. Recent work on adversarial examples has also shown that neural networks are often vulnerable to carefully constructed modifications of the inputs, which at the same time are indistinguishable to humans, leading researchers to ask what these models are actually learning and how they can be improved.
Creating neural network architectures that are interpretable is an active research area, as such models would provide multiple benefits:
- Data analysis. Knowing which information the model uses to make decisions can reveal patterns and regularities in the dataset, providing novel insight about the task that it is solving.
- Model improvement. Understanding why the model makes specific incorrect decisions can inform us how to improve it and guide the model development.
- Providing explanations. When automated systems are making potentially life-changing decisions, users will want to receive human-interpretable explanations for why these specific decisions were made.
The latest regulations also require that practical machine learning models, making decisions that can affect users, need to be able to provide an explanation for their behaviour, making the need for interpretable models even more pressing.
In this module we will discuss different methods for interpreting the internal decisions of neural models, along with explicitly designing the architectures to be human-interpretable.
Papers for student presentations
- Why should i trust you?: Explaining the predictions of any classifier (KDD 2016) Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin.
- Generating visual explanations (ECCV 2016) Hendricks, Lisa Anne, Zeynep Akata, Marcus Rohrbach, Jeff Donahue, Bernt Schiele, and Trevor Darrell.
- Show, attend and tell: Neural image caption generation with visual attention (ICML 2015) Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio.
- Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization (ICCV 2017) Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra.
- Explainable Prediction of Medical Codes from Clinical Text (NAACL 2018) James Mullenbach, Sarah Wiegreffe, Jon Duke, Jimeng Sun, and Jacob Eisenstein.
- Evaluating neural network explanation methods using hybrid documents and morphosyntactic agreement (ACL 2018) Nina Poerner, Hinrich Schütze, and Benjamin Roth.
Variational inference
- Proposer: Ryan Cotterell
In this module we will explore the foundations and modern applications of variational inference. At its core, variational inference is a trick for taking an intractable summation and replacing it with a tractable optimization problem. Our tour will start with the classic tutorial of Jordan et al. from 1999 and then we will work our way through a variety of more recent examples. Why study variational inference? From Bayesian neural networks to latent-variable modeling, variational techniques are omnipresent in ML and NLP. The roots of the technique can be traced back to statistical physics—indeed, they may be traced back to Richard Feynman himself. (See this illuminating blog post.) Want a quick overview? Jason Eisner surveys the landscape in this marvelously clear introduction to the subject.
Each student will present a paper from the list of papers below and will write a summary (think scribing) and, perhaps, implement it and test it on a dataset of their choice.
Readings
- An
Introduction to Variational Methods for Graphical Models. Michael
I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola and Lawrence K. Saul
Machine Learning, 1999.
Why this paper? If you had to read one tutorial on variational inference, this is it. It's classic and is still relevant and generally amazing. Everything else on the reading list is a (relatively minor) extension of the math in this work.
- The
wake-sleep algorithm for unsupervised neural networks. Geoffrey
E. Hinton, Peter Dayan, Brendan J. Frey and Radford M. Neal Science,
1995.
Why this paper? This paper discusses an older algorithm, wake-sleep, that was a precursor to the now famous variational autoencoder. Reading the paper also gives the proper historical perspective: What Hinton et al. term a neural network in 1995 would almost always be called a directed graphical model with latent binary variables these days. In short, the term has evolved in an interesting way—at least, as far as I can tell. Moreover, their method introduces the concept of an inference network that is quite hip these days in ML.
- Latent
Dirichlet Allocation. David M. Blei, Andrew Y. Ng, and Michael
I. Jordan. Journal of Machine Learning Research, 2003.
Why this paper? The LDA paper launched a 1000 spin-offs. I dare you to find a senior researcher who hasn't slipped in one or more random variable into LDA and worked out an inference scheme. Here, we won't be interested in the model per se, but rather the appendix! The journal version of this paper is such an excellent piece of scholarship that the appendix that is pedagogically amazing. Ready to meet the Digamma function?
- Joint Parsing
and Alignment with Weakly Synchronized Grammars. David Burkett,
John Blitzer and Dan Klein NAACL, 2010.
Why this paper? As an NLP person, I had to sneak something on NLP in this module. The paper is a great example of how to use variational inference in a structured NLP model. Specifically, they show how to use structured mean field for joint alignment and parsing. The authors also offer a longer tutorial on Structured Variational Inference, which they gave at ACL 2013.
- Stochastic
Variational Inference. Matthew D. Hoffman, David M. Blei, Chong
Wang and John Paisley Journal of Machine Learning Research, 2013.
Why this paper? This journal paper is another pedagogical gem from Dave Blei's group. It goes over how stochastic approximation techniques can be used to speed up variational inference. The ideas here are very relevant for understanding the variational autoencoder, a popular model in many areas of ML.
- Auto-Encoding
Variational Bayes. Diederik P. Kingma and Max Welling
International Conference on Learning Representations, 2014.
Why this paper? Who isn’t talking about the variational autoencoder? This instant classic is a must-read for the module. However, by the time we get to this paper, I hope to have shown that it’s a very clean combination of three basic ideas: variational inference (with an inference network à la wake-sleep), stochastic approximation and variance reduction. These existing techniques were crocheted together for great good and much hype.
NLP & ML for Speech
- Proposers: Paula Buttery, Andrew Caines, Helen Yannakoudakis
Description
Writing and speech are very different. Far from being an impoverished version of writing, speech is in fact its own communication mode with rules, conventions and a grammar of its own (Carter & McCarthy 2017). Natural language processing techniques assume a written input made up of well-formed, punctuated text. Firstly speech has to be transcribed, then reshaped and segmented to resemble something like written text. We discuss the challenges inherent to speech NLP, state-of-the-art machine learning techniques which have been applied to these tasks, and how the problem could be decoupled from traditional NLP for writing.
Resources & Datasets
- Introductory slides (after lecture)
- See links in readings for code and datasets
Background Reading
- Spoken Grammar: Where Are We and Where Are We Going?, Carter & McCarthy Applied Linguistics 2017
- Grammars of Spoken English: New Outcomes of Corpus‐Oriented Research, Leech, Language Learning, 2008
- Speech and Language Processing, Jurafsky & Martin, 2nd edition, Chapters 9 & 10
Readings
- Towards
automatic assessment of spontaneous spoken English, Wang et al.,
Speech Communication 2018
Presenter's Slides - Incremental
dependency parsing and disfluency detection in spoken learner
English, Moore et al., TSD 2015
Presenter's Slides - Disfluency
Detection using Auto-Correlational Neural Networks, Lou et al.,
EMNLP 2018
Presenter's Slides - Parsing
Speech: a Neural Approach to Integrating Lexical and
Acoustic-Prosodic Information, Tran et al., NAACL 2018
Presenter's Slides (after presentation) - Sentence
Boundary Detection Based on Parallel Lexical and Acoustic Models,
Che et al., INTERSPEECH 2016
Presenter's Slides (after presentation) - Enriching ASR Lattices with POS Tags for Dependency Parsing, Stiefel & Vu, SCNLP 2017
- Speech- and Text-driven Features for Automated Scoring of English Speaking Tasks, Loukina et al., SCNLP 2017
Project suggestions
Projects could include a survey of current approaches, a focus on particular linguistic features of speech, or a proposal for amendments to existing NLP technology in order to better deal with speech data.
Deep learning and bioinformatics
- Proposer: Pietro Lió
Bioinformatics is vibrant field at the intersection of biology, statistics, and computer science. Bioinformatics uses statistical and computational methodologies to support experimental molecular biology and it is in part responsible for the current successes and advances of biomedicine at the molecular level.
Readings
- Learning to design RNA, Runge, Stoll, Falkner, and Hutter (ICLR 2019)
- Generative adversarial networks simulate gene expression and predict perturbations in single cells, Ghahramani, Wat, Luscombe (preprint, 2018)
- Parapred: antibody paratope prediction using convolutional and recurrent neural networks, Liberis, Veličković, Sormanni, Vendruscolo, Liò (Bioinformatics 2018)
- Multi-omics data integration using cross-modal neural networks, Bica, Veličković, Xiao, Liò (ESANN 2018)
- Using deep learning to model the hierarchical structure and function of a cell, Ma et al. (Nature Methods 2018)
- MoIGAN: an implicit generative model for small molecular graphs, De Cao, Kipf (ICML 2018)
Autoencoders and generative models
- Proposer: Damon Wischik
- Where and when: Tue 5 March 2–3pm, Fri 8 March 3–5pm, Tue 12 March 2–3pm. Room SW01. (See MPhil/ACS timetable)
Autoencoders are a neural network architecture for learning "concepts". The dream is that we should be able to feed in an unlabelled dataset e.g. a collection of pictures of faces, and the machine should figure out high-level concepts e.g. "smiling" or "wears glasses", and it should then be able to extrapolate e.g. turn a frown into a smile.
Mathematically, this has been seen as a problem of compressed representation (designing an encoder network to turn a high-dimensional datapoint into a low-dimensional codepoint), and of probabilistic generative modelling (designing a decoder network to turn a random low-dimensional codepoint into a new synthetic datapoint).
In this module, we will explore three different designs of autoencoders, and some applications.
Background reading
- Introduction to autoencoders, a clear overview of the field
- Applied deep learning: autoencoders, a hands-on tutorial using Keras
- Chapter 14 of Deep Learning by Goodfellow, Bengio, and Courville (2016), a textbook with a thorough review of the historical development
- Introductory slides for this topic
Readings
Some of these papers come with supplemental reading. This is to indicate which aspects of the paper matter most to posterity.
- Sparse feature learning for deep belief networks, Ranzato, Boureau, LeCun (NIPS 2007).
- Unsupervised learning of video representations using LSTMs, Srivastava, Mansimov, Salakhutdinov (ICML 2015).
- Walkthrough and Keras implementation
- For more on RNNs and LSTMs, see the notes on Do LSTMs learn syntax?
- Slides by sat62
- Extracting and composing robust features with denoising autoencoders, Vincent, Larochelle, Bengio, Manzagol (ICML 2008).
- Deep autoencoders for collaborative filtering, a walkthrough and TensorFlow implementation of predicting movie recommendations with a denoising autoencoder
- Slides by zz362
- Auto-Encoding Variational Bayes, Kingma, Welling (ICLR 2013).
- See also the notes on Variational inference
- Slides by mfb37
- β-VAE: learning basic visual concepts with a constrained variational framework, Higgins et al. (ICLR 2017).
- a blog post explaining the idea
- Some neuroscience background: Early visual concept learning with unsupervised deep learning, Higgins et al. (2016). And a blog post explaining it.
- Presenter: dai24
- Grammar variational autoencoder, Kusner, Paige, Hernández-Lobato (ICML 2017).