Course pages 2018–19

Natural Language Processing

Assessment is by coursework as follows:

Assignment 1, for 20% of the overall grade. Write a 500 word report of your experiment of SVM-based sentiment classification. Ticked, i.e., Pass/Fail.
Assignment 2, for 40% of the overall grade. Write a 1000 word report of your experiment of Doc2Vec-based sentiment classification.
Assignment 3, for 40% of the overall grade. Write a 1000 word report of your design for a text understanding question answering system.

Deadlines:

Assignment 1: 14 November, 4pm
Assignment 2: 30 November, 4pm
Assignment 3: 30 November, 4pm

Instructions for Practical

Assignment 1: Instructions Part 1
NOTE The assignment states "replicate Pang et al. (2002) as closely as possible". What was meant was: do so only for those interventions explicitly stated in the instructions. Pang et al. do many things you are not expected to do: for instance MaxEnt, negation treatment, experiments with excluding certain POS. Pang et al also don't do some things that I would like for you to do: namely stemming and trying a feature cutoff. Sorry if that wasn't clear. I have added this warning also into the instructions.
Strictly speaking, experimenting with feature cutoffs (i.e., doing a systematic search) is methodologically questionable, as you don't have a separate validation corpus. There is a danger of overtraining. The only thing you are (kind of) allowed to do is to choose a feature cutoff once (e.g. 2 or 3 or 4) before the experiment and then run the experiment only once with this feature cutoff. If you compare that to the full feature set (no cutoff), most people would probably judge that was still OK. But we are moving into a grey area.
Assignment 2: Instructions Part 2
Assignment 3: Instructions Part 3

Data etc for Practical

Slides for Part 1
Slides for Part 2
Slides for Part 3
Slides for "how to write a report" (please note that this is advice for longer reports; your mini-report is extremely short so you have no subsections.
Here is the paper you will replicate (some aspects of): Pang et al. (2002)
TOKENIZED DATA: NEG-token.tar and POS-token.tar
Here is some explanation from Siegel and Castellan (1988) about sign test (pdf).
Here are the MLRD slides on crossvalidation
Assignment3, Text 1 in pdf, in ASCII plain text and the output of the Stanford parser on text 1
Assignment3, Text 2 in png with its Questions, in ASCII plain text and the output of the Stanford parser on text 2
You can send email to me and the demonstrators with questions. Guy (ga384), Gladys (whgt2), Tobias (tk534), Chris (ccd38).

Errata (NLP Practical)

Clarification added to instructions for Part 1 about "replication of Pang et al." meaning "replication of aspects that are mentioned in the instructions, not the entire paper".
Added as footnote to instructions for Part 1: is it methodologically OK to experiment with different feature frequency cutoffs?

Word limit for assignments changed to 500 (assignment 1) and 1000 words (assignments 2 and 3). Due to late written announcement, students will not be penalised for submitting reports of the old length of 300 words for assignment 1.

Department of Computer Science and Technology