Computer Laboratory

Course pages 2015–16

Overview of Natural Language Processing

Assessment is by coursework as follows:

A practical is performed where a corpus of texts is given, and students write code that detects the sentiment of each text as positive or negative. Various natural language processing tools will be tested as to how they improve performance on the task.

To prepare yourselves for the task, please do the following two things before the first demonstrated session:

  • Please read the following paper: Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan, Thumbs up? Sentiment Classification using Machine Learning Techniques, Proceedings of EMNLP 2002. Bo Pang et al. were the "inventors" of the movie review sentiment classification task, an d the above paper was one of the first papers on the topic. The first version of your sen timent classifier will do something similar to Bo Pang's system, so please read it, and as me questions about it in our first demonstrated practical.
  • Familiarize yourself with the data in /usr/groups/mphil/L90/data. There are 2000 movie reviews, split into two directories, NEG and POS. These, unsurprisingly, are negative and positive reviews; the data is balanced, so half of each.

    Please read *some* of the texts (at your choice) to understand the difficulties of the task. How might one go about classifying the texts?

The texts have been cleared up as much as possible (automatically, then manually), but some noise might remain. [If you notice textual material that does not belong to the review itself, such as urls of the reviewers, names of the reviewers, ratings, mentions of movie metadata such as the director or the running time, or any other material that should not be there, you would do me a great favour if you could email me at [Javascript required], let me know the file number (and tell me what the problem is). Please ignore normal typos; I don't need to know about them. ]

NEW: As preparation, please read the following instructions. The first part of the practical ("A Baseline System") will be demonstrated on Wednesday 18/11; the second part ("Extension system") will be demonstrated on Wednesday 2/12.

Slides introducing the task are here: Part 1 and Part 2

Simone Teufel, Nov. 2015