The practical concerns building a sentiment classifier for movie reviews, and to report your results in a scientific manner (i.e., as a paper). 

To prepare yourselves for the task, please do the following two things by next Friday:

1. Familiarize yourself with the data in /usr/groups/mphil/L90/data. There are 2000 movie reviews, split into two directories, NEG and POS. These, unsurprisingly, are negative and positive reviews; the data is balanced, so half of each. The texts have been tokenised and sentence-split. Please read *some* of the texts (at your choice) to understand the difficulties of the task. [If you notice textual material that does not belong to the review itself, such as urls of the reviewers, names of the reviewers, ratings, mentions of movie metadata such as the director or the running time, or any other material that should not be there, you would do me a great favour if you could email me the file number (and tell me what the noise is). The texts have been cleared up as much as possible, but some noise might remain.]

2. One baseline for the task to machine-learn lexical terms which can distinguish between positive and negative tasks. Please read the following paper:

Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan, Thumbs up? Sentiment Classification using Machine Learning Techniques, Proceedings of EMNLP 2002 .

www.cs.cornell.edu/home/llee/papers/sentiment.pdf

Bo Pang et al. were the "inventors" of the movie review sentiment classification task, and the above paper was one of the first papers on the topic. The first version of your sentiment classifier will do something similar to Bo Pang's system, so please familiarise yourself with it. 


Simone Teufel
Nov 2014