This is the second part of tick 3.

The lexicon classifier you created in Task 1 does not use all the information available in the lexicon. In this step you should modify the simple classifier to account for the strength/magnitude of the sentiment expressed by each word. The tester assumes the use of weight 2 for strong sentiment.

To investigate whether the two systems are significantly different, we will use the **sign test**. In this case, the sign test is based on the binomial distribution.

Count all cases when system 1 is better than system 2, when system 2 is better than system 1, and when they are the same. Call these numbers Plus, Minus and Null. The sign test returns the probability that the **null hypothesis** is true.

This probability is called the **p-value** and it can be calculate for the two-sided sign test using the following formula:

\[2\sum_{i=0}^{k}{n \choose i}q^i (1-q)^{n-i}\]

where \(n=2 \lceil \frac{Null}{2} \rceil + Plus + Minus\) is the total number of cases and \(k=\lceil \frac{Null}{2} \rceil + \min{\{Plus,Minus\}}\) is the number of cases with the less common sign. In this experiment, \(q=0.5\).

Here, we are dividing the null cases evenly between the two signs, rounding up if necessary (why is this better than rounding down? Hint: we are looking for statistical significance. We then use the formula as above. The formula is multiplied by two because this is a two-sided sign test and tests for the significance of differences in either direction, so a difference in one of the directions is half as significant as it would be otherwise.

The numbers you will calculate are high enough to cause overflow if you use standard `double`

or `int`

types. Use `BigInteger`

class.

Is the difference in the scores obtained by your simple classifier and the magnitude classifier statistically significant? What about the magnitude and Naive Bayes classifiers? Make sure you compare the results on the same examples and that you don't test on the training set. As usual, please see Exercise4Tester.java as a guide for what is expected.

How does the p-value vary if you reduce the number of samples?

If you vary the smoothing parameter from Task 2 so that you add a constant smaller than 1 (e.g., add 0.5 rather than add 1), do you obtain better performance? You can carry out this experiment with various values for the smoothing parameter to try and optimize it. Why would it be invalid (strictly speaking) to check for significant improvements by using the sign test on your best result?