Computer Laboratory

Course pages 2014–15

Machine Learning for Language Processing

Mark Gales' lectures:

This year's lecture slides:

  • Lectures 1 and 2
  • Lecture 3
  • Lecture 4
  • Lecture 5
  • Lecture 6
  • Lecture 7
  • Lecture 8

    Instructions, Schedule and Reading List

    See Syllabus for lecture details and Assessment for assessment scheme

    L = lecture, S = Seminar (Two presentations per seminar)

    Please select three papers that you would like to present in order of preference by noon on Friday 16th January and email your selections to Ted.Briscoe@cl.cam.ac.uk. I will assign papers by 5pm that day. Do not do this if you are only planning to audit the course. Instead email me and let me know.

    There will be 2 presentations per 50 minute session. Your presentations should be about 15 minutes allowing for a further 5 minutes for questions, and 10 minutes at the end of each session for general discussion. You should summarise the paper briefly (remember everyone will have read it), explicate any parts you found difficult or innovative, and critically evaluate the work described. For your evaluation you should consider questions like: To what extent have the stated aims of the research been achieved? To what extent is the work replicable given the information provided? In what way does the work advance the state of the art?, etc. You may prepare slides and use the data or overhead projector and/or whiteboard. You should liaise with your co-presenter to decide the order in which to make presentations. The first presentation should briefly define the task, the other should not. You should have all slides for the session loaded onto a single laptop set up with the data projector by the beginning of each session.

    All students should read all the papers and come to all sessions prepared to discuss each paper after the presentations

    Assignment

    You may write an essay on a topic related to the paper you present, or any of the course material. Alternatively you may undertake a small project on text classification using existing datasets and machine learning software, and then submit a project report. In both cases, your essay or report should not exceed 5000 words and will be due in around the end of the first week of Easter Term.

    You should discuss and agree your essay topic or project with Ted.Briscoe@cl.cam.ac.uk by email after the division (week 4) of the Lent Term. Write a proposal of up to 500 words outlining the topic or project giving a preliminary reading list and indicating what resources you plan to use, if relevant. The first draft of your proposal should reach me by Friday, 20th February at the latest.

    Your essay topic should involve an in-depth critical evaluation of a specific machine learning technique and its application to language processing, or of a specific language processing task and machine learning techniques that have been applied to that task. Little credit will be given for summaries of papers. An example of a possible title/topic on named entity recognition might be `To what extent do we need sequential models to achieve accurate NER?' This essay might critically examine the claim made by Ratinov and Roth that NE recognition and classification can be done accurately by conditioning only on the class label assigned to the previous word(s) (as well as other invariant observed features of the context) without (Viterbi) decoding to find the most likely path of label assignments. In doing this, it might review the NER task definition and consider how dealing adequately with conjoined or otherwise complex NEs (Mazur and Dale) might affect their claims. It might also propose an experiment that would resolve the issue empirically and/or identify one that has been published that sheds some light on it.

    Suitable small projects will need to make use of existing labelled datasets and existing machine learning tools that are distributed and documented, so that they can be completed in reasonable time. Some examples of text classificataion tasks and datasets are: spam filtering (lingspam, genspam), sentiment of movie reviews ("sentiment polarity datasets" Pang), named entity recognition (conll shared task ner), hedge (scope) detection (conll shared task hedge scope), language identification (altw 2010 langid dataset), document topic classification (Reuters-21578), genre classification (genre collection repository), and many many more. Some examples of (good) machine learning toolkits are SVMlight, Weka, or Mallet. A project might replicate a published experiment but try different feature types or a different classifier, and describe the experiment and report results in a comparable manner to the relevant (short) paper.

    Week 1: Mark Gales, 15/1 L, 19/1 L, Classification by ML

    Week 2: Ted Briscoe, 22/1 S, 26/1 S, Document Topic Classification

    Papers:

  • 1) Nigam & McCallum, A comparison of event models for naive bayes text classification, 1998
  • 2) Rennie, Shih et al. Tackling the poor assumptions of naive bayes text classifiers, ICML, 2003
  • 3) Rogati, Monica and Yang, Yiming, High-Performing Feature Selection for Text Classification, CIKM, 2002
  • 4) Gabrilovich, Evgeniy and Markovitch, Shaul, Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge, AAAI, 2006

    Week 3: Mark Gales, 29/1 L, 2/2 L, Graphical Models 1 \& 2

    Week 4: ; Ted Briscoe, 5/2 S, Spam Filtering; Mark Gales, 9/2 L, Graphical Models 3

    Papers:

  • 5) Andoutsopoulos et al, An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages, SIGIR, 2000
  • 6) Medlock, An adaptive, semi-structured language model approach to spam filtering on a new corpus, CEAS 2006

    Week 5: Mark Gales, 12/2 L, Graphical Models 4; Ted Briscoe, 16/2 S, NER 1

    Papers:

  • 7) Klein et al, Named Entity Recognition with Character-Level Models
  • 8) Ratinov & Roth, Design Challenges and Misconceptions in NER, CoNLL 2009

    Week 6: Mark Gales, 19/2 L, SVMs; Ted Briscoe, 23/2 S, RE1

    Papers:

  • 9) Aron Culotta and Jeffrey Sorensen, Dependency tree kernels for relation extraction, ACL04
  • 10) Greenwood and Stevenson, Improving Semi-Supervised Acquisition of Relation Extraction Patterns, 2006

    Week 7: Helen Yannakoudakis, 26/2 S, Ranking; Mark Gales, 2/3 L, Clustering

    Papers:

  • 11) Yannakoudakis et al, A New Dataset and Method for Automatically Grading ESOL Texts
  • 12) Chen, He, Automated Essay Scoring by Maximizing Human-machine Agreement

    Week 8: Ted Briscoe, 5/3 S, Topic Clustering

    Papers:

  • 13) Griffiths, Steyvers, Finding scientific topics, PNAS 2004
  • 14) Blumson, Cohn, A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction

    Ekaterina Kochmar, 9/3, Compositional Distributional Features for Classification

    Papers:

  • 15) Mitchell, Lapata, Vector-based Models of Semantic Composition
  • 16) Kochmar, Briscoe, Capturing Anomalies in the Choice of Content Words in Compositional Distributional Semantic Space