Department of Computer Science and Technology

Course pages 2018–19

Overview of Natural Language Processing

Assessment is by coursework as follows:

A practical is performed where a corpus of movie reviews is given, and students write code that detects the sentiment of each text as positive or negative. In the first task, students build two commonly used baselines (which should be comparable across students), one based on a reimplementation of a classic machine learning approach, the other on SVM classification. In the second task (extension implementation), students improve over the baselines using document embeddings and perform an error analysis on the strengths and weaknesses of the approach.

Practical sessions: 31 October, 2--4pm, 7 November and 21 November, 9--11am, SW02.

Assessment is by two reports on the practical (on paper to Student admin):

  • First task report (20%, ticked, up to 1,000 words, excluding references) due on Friday 16 November 2017 at 12:00 noon.
  • Second task report (80%, 4,000 words, excluding references) due on Tuessday 15 January 2019 at 12:00 noon.

Your reports should include a word count and a pointer to your working code on the Mphil machines (your account).

Part 1 (First Practical Session)

Build a Naive Bayes Sentiment Classification System. Instructions are here.

Slides and "How to write a report" slides.

  • Here are some slides from the course "MLRD" that explain overtraining and crossvalidation

    Part 2 (Second Practical Session)

    Upgrade to SVM Classification and use doc2vec representations instead of bag of words. Instructions are here. And the slides are here.

    Clarification about how to use the validation corpus added to Instructions and to Slides. Executive summary: set parameters by training on 90%, testing on validation corpus. Then discard trained models and perform cross-validation on 90\%, as you did before on 100%.

    Part 2 (Third Practical Session)

    A better significance test (Permutation test) and some tips on how to perform an Analysis of the embedding space. Slides are here.