Computer Laboratory

Course pages 2016–17

Overview of Natural Language Processing

Assessment is by coursework as follows:

A practical is performed where a corpus of texts is given, and students write code that detects the sentiment of each text as positive or negative. Students first build a simple baseline, which should be comparable (across students) -- and which is a reimplementation of a classic paper. You will then demonstrate your understanding of various natural language processing tasks taught on this course, by preparing your own "extension" implementation that (hopefully) improves over your baseline.

First demonstrated session: Friday 27/10, 11-1, SW02

There is no special preparation needed for the demonstrated session, but if you like, th following are things you can do to now:

  • Please make sure you have read the paper introducted in Lecture 1 (cf. Course materials Page): Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan, Thumbs up? Sentiment Classification using Machine Learning Techniques, Proceedings of EMNLP 2002. Bo Pang et al. were the "inventors" of the movie review sentiment classification task, and the above paper was one of the first papers on the topic. The first version of your sentiment classifier will do something similar to Bo Pang's system. If you have questions about it, we should resolve them in our first demonstrated practical.
  • (Optional): Familiarize yourself with the data in /usr/groups/mphil/L90/data. There are 2000 movie reviews, split into two directories, NEG and POS. These, unsurprisingly, are negative and positive reviews; the data is balanced, so half of each.

    Please read *some* of the texts (at your choice) to understand the difficulties of the task. How might one go about classifying the texts?

Detailed instructions for how to build a Pang-style baseline can be found here. Slides describing the task can be found here.

How to write a good report: practice report on baseline, slides here and here.

Deadline for submission of report (on paper to Student admin): January 18, 12 noon. Your report should be up to 4000 words long, include a word count and a pointer to your working code in your CS file space.

Simone Teufel, Jan 11, 2017