In Task 1 you were asked to manually classify four reviews as positive or negative. The file names of the reviews and the ground truth category were:
6848 - neutral,
9947 - neutral,
937 - negative,
1618 - positive.
Download the nuanced sentiment dataset which contains reviews assigned to three classes: positive, negative, neutral.
Open file 6220, which is neutral, and read it. Do you think this task has become harder or less hard by including the neutral category?
Modify your Naive Bayes classifier from Task 2 so that it can deal with a three-outcome classification. Instead of Sentiment
use the NuancedSentiment
provided. You can use DataPreparation6.java
to load the nuanced dataset.
Test your system using 10-fold cross validation. What are the results?
Let’s return to the world where reviews are either positive or negative in order to investigate human agreement.
The file class_predictions.csv
contains the group’s judgments as they were added to the database in Task 1. You can load the file using DataPreparation6.java
provided.
Use the data to create an agreement table aggregating the group’s judgments. That is, for each review calculate how many people said it was positive and how many that it was negative. How do your own predictions compare with those of the entire group?
Human judgement can be used as a kind of truth. If no definitive decision can be taken, we cannot expect the system to agree 100% with every human but only as much as humans agree with each other. Write code to calculate Fleiss’ kappa given by the following formula:
\[\kappa = \frac{\bar{P_a}-\bar{P_e}}{1 - \bar{P_e}}\]
Here \(\bar{P_e}\) is the mean of the observed proportions of assignments to each class squared: \[\bar{P_e} = \sum_{j=1}^{k}{(\frac{1}{N} \sum_{i=1}^N{\frac{n_{ij}}{n_i}})^2}\]
and \(\bar{P_a}\) is the mean of the proportions of prediction pairs which are in agreement for all items: \[\bar{P_a} = \frac{1}{N} \sum_{i=1}^N{\frac{1}{n_i(n_i-1)} \sum_{j=1}^k{n_{ij}(n_{ij}-1)}}\]
where \(n_{ij}\) is the number of predictions that item \(i\) belongs to class \(j\) and \(n_i=\sum_{j=1}^k{n_{ij}}\) is the total number of predictions for item \(i\).
You are strongly advised to make up some simple test examples to make sure your code is behaving as expected.
What \(\kappa\) score do you get for all four reviews taken together? What if you compare only reviews 1 and 2? What about 3 and 4 only?
Returning to the 3-way classification case. Investigate the behaviour of \(\kappa\) by creating artificial agents which behave in different ways. For instance:
Code some number of random agents and let them make choices for 50 examples and calculate \(\kappa\). Repeat this exercise 100 times. Can you explain the answers you get, based on your knowledge of the agents’ strategies?