Computer Laboratory

Course material 2010–11

Machine Learning for Language Processing

Principal lecturer: Prof Ted Briscoe
Additional lecturer: Dr Mark Gales
Taken by: MPhil ACS
Syllabus

Mark Gales' lectures:

Lectures 1 & 2
Lecture 3
Lecture 4
Lecture 5
Lecture 6
Lecture 7
Lecture 8

Instructions, Schedule and Reading List

See Syllabus for lecture details, assessment scheme, and deadlines

L = lecture, S = Seminar (Two presentations per seminar)

Please select three papers that you would like to present in order of preference by noon on 26/1 and email your selections to Ted.Briscoe@cl.cam.ac.uk. I will assign papers by 5pm.

Your presentations should be about 20 minutes. You should summarise the aims of the paper, explain the techniques use and experiments reported as accessibly as possible, and critically evaluate the work described. You may prepare handouts, slides, use the data or overhead projector, or whiteboard, etc.

All students should read all the papers and come to each seminar prepared to discuss them after the presentations.

Assignment

You may write an essay on a topic related to the paper you present, or any of the course material. Alternatively you may undertake a small project on text classification using existing datasets and machine learning software and submit a project report. In both cases, your essay or report should not exceed 5000 words and will be due in around the end of the first week of Easter Term

You should discuss and agree your essay topic or project with us by email before the end of the Lent Term at the latest. Write a proposal of up to 500 words outlining the topic or project giving a preliminary reading list and indicating what resources you plan to use, if relevant. The first draft of your proposal should reach us by Monday, 7th March at the latest.

Your essay topic should involve an in-depth critical evaluation of a specific machine learning technique and its application to language processing or of a specific language processing task and machine learning techniques that have been applied to that task. Little credit will be given for summaries of papers. An example of a possible title/topic on named entity recognition might be `To what extent do we need sequential models to achieve accurate NER?' This essay might critically examine the claim made by Ratinov and Roth that NE recognition and classification can be done accurately by conditioning only on the class label assigned to the previous word(s) (as well as other invariant observed features of the context) without (Viterbi) decoding to find the most likely path of label assignments. In doing this, it might review the NER task definition and consider how dealing adequately with conjoined or otherwise complex NEs (Mazur and Dale) might affect their claims. It might also propose an experiment that would resolve the issue empirically and/or identify one that has been published that sheds some light on it.

Suitable small projects will need to make use of existing labelled datasets and existing machine learning tools that are distributed and documented, so that they can be completed in reasonable time. Some examples of text classificataion tasks and datasets are: spam filtering (lingspam, genspam), sentiment of movie reviews ("sentiment polarity datasets" Pang), named entity recognition (conll shared task ner), hedge (scope) detection (conll shared task hedge scope), language identification (altw 2010 langid dataset), document topic classification (Reuters-21578), genre classification (genre collection repository), and many more. Some examples of (good) machine learning toolkits are SVMlight, WEKA, Mallet, and MinorThird. A project might replicate a published experiment but try different feature types or a different classfier and describe the experiment and report results in a comparable manner to the relevant (short) paper.

Week 1: Mark Gales, 24/1 L, 26/1 L, Classification by ML

Week 2: Ted Briscoe, 31/1 S, 2/2 S, Document Topic Classification

Papers:

  • 1) Nigam & McCallum, A comparison of event models for naive bayes text classification, 1998
  • 2) Rennie, Shih et al. Tackling the poor assumptions of naive bayes text classifiers, ICML, 2003
  • 3&4) Lewis, Yang, Rose, Li, RCV1: A New Benchmark Collection for Text Categorization Research, JMLR, 2004

    Week 3: Mark Gales, 7/2 L, 9/2 L, Graphical Models 1 \& 2

    Week 4: ; Ted Briscoe, 14/2 S, Spam Filtering; Mark Gales, 16/2 L, Graphical Models 3

    Papers:

  • 5) Sahami, Mehran et al, A bayesian approach to filtering junk email, AAAI, Wkshp on Text Classification, 1998
  • 6) Medlock, An adaptive, semi-structured language model approach to spam filtering on a new corpus, CEAS 2006

    Week 5: Mark Gales, 21/2 L, Graphical Models 4; Ted Briscoe, 23/2 S, NER 1

    Papers:

  • 7) Zhou & Su, Named Entity Recognition using an HMM-based Chunk Tagger, ACL02
  • 8) Ratinov & Roth, Design Challenges and Misconceptions in NER, CoNLL 2009

    Week 6: Ted Briscoe, 28/2 S, NER 2; Mark Gales, 2/3 L, SVMs

    Papers:

  • 9) Mazur & Dale, Disambiguating Conjunctions in Named Entities, 2005
  • 10) Vlachos, Tackling the BioCreative2 Gene Mention task with Conditional Random Fields and Syntactic Parsing, 2007

    Week 7: Ted Briscoe, 7/3 S, 9/3 S, Relation Extraction

    Papers:

  • 11) Aron Culotta and Jeffrey Sorensen, Dependency tree kernels for relation extraction, ACL04
  • 12) Pyysalo et al, A graph kernel for protein-protein interaction, BioNLP08
  • 13) Kate & Mooney, Joint Entity and Relation Extraction using Card-Pyramid Parsing, CoNLL10
  • 14) Mintz et al, Distant supervision for relation extraction without labeled data, ACL09

    Week 8: Mark Gales, 14/3 L, Clustering; Ted Briscoe, 16/3 S, Topic/Term Clustering

    Papers:

  • 15) Griffiths, Steyvers, Finding scientific topics, PNAS 2004
  • 16) Andrezejewski, Zhu, Latent Dirichlet Allocation with Topic-in-Set Knowledge, ACL09 Wkshp Semi-supervised Lrng for NLP