Tagging project notes
=====================

10-12-92
--------
Bibliography - see file biblio.


11-12-92
--------
Viterbi algorithm now working.

Test results 5/95, using number option 'm', and specifying the following
classes as closed (others may be possible; this will do for the test):
! ( ) *' **' *- , . ... ; ? BE BED BEDZ BEG BEM BEN BER BEZ COLON DO DOD DOZ
EX

Total words: 1955
1697 known, 258 unknown, 826 ambiguous, 568 known ambiguous

Options		All	Known	Unknown	Ambig	Ambig/known		(%)
-		83	91	26	62	78
1		75	86	1.2	44	63
2		75	86	1.2	44	63
3		75	86	1.2	44	63
14		75	86	1.2	43	63
24		75	86	1.2	43	63
34		75	86	1.2	43	63
15		81	93	1.2	58	84
25		81	93	1.2	58	84
35		81	93	1.2	58	84
NB! Option codes now changed.

Comment: these figures are rather low because of the high number of unknown
words.

Also try a 5/100 test. Total words: 1955
1955 known, 0 unknown, 595 ambiguous, 595 known ambiguous

Options		All	Known	Unknown	Ambig	Ambig/known		(%)
-		94	94	0	79	79
1		93	93	0	77	77
2		93	93	0	77	77
3		93	93	0	77	77
14		93	93	0	77	77
24		93	93	0	77	77
34		93	93	0	77	77
15		97	97	0	89	89
25		97	97	0	89	89
35		97	97	0	89	89
NB! Option codes now changed.

All paths
---------
Implemented using forward scoring only. Here are the usual tests

5/95. Total words: 1955
1697 known, 258 unknown, 826 ambiguous, 568 known ambiguous

Options		All	Known	Unknown	Ambig	Ambig/known		(%)
-		84	93	24	65	84
1		84	93	21	64	84
2		81	94	0.78	59	85
3		81	94	0.78	59	85
14		82	91	21	59	76
24		75	86	0.78	42	61
34		75	86	0.78	42	61
15		85	94	25	66	85
25		82	94	0.39	59	86
35		82	94	0.39	59	86
NB! Option codes now changed.


5/100. Total words: 1955
1955 known, 0 unknown, 595 ambiguous, 595 known ambiguous

Options		All	Known	Unknown	Ambig	Ambig/known		(%)
-		96	96	0	86	86
1		96	96	0	87	87
2		97	97	0	89	89
3		97	97	0	89	89
14		93	93	0	78	78
24		90	90	0	68	68
34		90	90	0	68	68
15		96	96	0	88	88
25		97	97	0	90	90
35		97	97	0	90	90

SUMMARY: best overall success rates came from:
 5/95  v	-
 5/100 v	15/25/35
 5/95  a	-
 5/100 a	25/35


14-12-92
--------
(Notes made over the weekend)
1. All links is not the same as all paths. All paths gets amazingly
intractable, but might be interesting to implement anyway.

2. We could try to make the probabilities be real probabilities. The way to
do this is to get rid of the normalising factor on chains (or use it to keep
things within bounds but then undo it on output), and to make sure we use the
correct normalisation of the transitions, namely:
   total from i to j / total from i to any (Sharman).
This could be another option. See below on B-W for more on this.

3. Comparing to Sharman, a difference is that I do not need his "pi"
parameter, since I have an unambiguous initial state, namely the unambig word.

4. Denoting coltot by c and rowtot by r, the options for normalising the
transitions matrix at present are: t, t/c, t/c*c and t/c*r. This leave t/r and
t/r*r as possibilities. From Sharman, the correct one is as given in (2)
above, which is t/r. In fact, I has got confused because of entering a comment
about what coltot and rowtot meant. This explains why option 5 made such a
difference. So get rid of these options and implement the correct version.
 (i) changed names of the variables to make it clearer.
 (ii) null option is now t/from_total. Also retain as options, in case we want
them: t (option 1), t/from_total (option 2), t/from_total * to_total (option
3).
 Comment: from_total and to_total should be the same, but we keep them
separate in case we add special cases later).

Example results with these new options. 5/100
Options		All	Known	Unknown	Ambig	Ambig/known		(%)
-		92.94	92.94	0	76.81	76.81
1		93.71	93.71	0	79.33	79.33
2		93.71	93.71	0	79.33	79.33
3		92.94	92.94	0	76.81	76.81
l		96.01	96.01	0	86.98	86.98
1l		95.81	95.81	0	86.22	86.22
2l		96.42	96.42	0	88.24	88.24
3l		96.52	96.52	0	88.57	88.57

Note that from the Sharman equations, the correct option out of 4/5/- is
always -, i.e. the one where we multiply the transition score by the base of
the tag we are going to.
CONCLUSION: in future, always just use option -: the others, though they may
make a difference to the results, are not the a priori correct ones. (Though
for tests, out of interest, I may still try the others.)

5. Next major thing is to implement the forward-backward algorithm, following
the equations given be Sharman. This can be a suboption to the all links case.
I think I should drop the best link option on the all links case, since it
doesn't really make any sense.

6. The step after 5 is to do Baum-Welch re-estimation of the parameters.
Again, see Sharman. We can do this in two ways: output a corpus and rebuild
the dictionary from it; or build the new data as we go along in the labeller,
and have options to output it as well as the corpus. The second one looks
better. Get the equations from Sharman. We need more than just the joint
probability, but we use the joint probability as the predictor.
 The re-estimation can be done at the output_words stage.
 We could also do it in the dictionary builder using exactly the code we have
already (* I think! *)
 The % correct can be used as a convergence measure. Perhaps this should be
made automatic. In fact, an iterate option could be added to the labeller, so
it keeps on relabelling the corpus until told to stop, or up to some number of
times, or until there is no change.
 Does normalisation of output scores matter for this? NO: because top and
bottom of the re-estimators have always been normalised by the same amount.


REIMPLEMENTATION of algorithms following Sharman gives better results on
Viterbi than before. We get results as follows, considering only the "correct"
scoring options (i.e. not 1, 2 or 3 and not 4 or 5). Using option m:

Test		All	Known	Unknown	Ambig	Ambig/known		(%)
5/100 Viterbi	97.65	97.65	0	92.27	92.27
5/100 FB	97.60	97.60	0	92.10	92.10
5/95 Viterbi	86.09	94.28	32.17	69.49	86.44
5/95 FB		86.24	94.40	32.56	69.85	86.80

For comparison, we now do a 100/100 test with FB
		All	Known	Unknown	Ambig	Ambig/known		(%)
100/100 FB	98.11	98.11	0	94.40	94.40


15-12-92
--------
First attempt at a true BW run. Options "tsrmlb". 38587 words in total, 12992
ambiguous, none unknown.

First run (as above):
		All	Ambig	(%)
100/100 FB	98.11	94.40

iteration	All	Ambig	(%)	Change in all	ambig
2		97.96	93.94		0.15		0.46
3		97.05	91.23		0.91		2.71
4		96.01	88.16		1.04		3.07
5		95.75	87.38		0.26		0.78
10		94.03	82.26		1.72 (0.34)	5.12 (1.02)
15		93.59	80.79		0.44 (0.09)	1.47 (0.29)
20		93.28	80.04		0.31 (0.06)	0.75 (0.15)
25		92.89	78.88		0.39 (0.08)	1.16 (0.23)

COMMENTS:
1. These figures do seem to be coming to an equilibrium, albeit somewhat worse
than the starting point.
2. The actual scores assigned, and the values in the dictionary, etc. are
getting smaller and smaller. I wonder how other people have coped with this.
Sharman mentions it but does nothing about it. I suspect that some of the
computations have already gone down as far as noise level. I am not convinced
by the remark I made above about normalisation. This area needs rethinking.

COMMENT: as the code stands, dictionary values are renormalised on reading. I
wonder what happens if I take this out.

iteration	All	Ambig	(%)	Change in all	ambig
1		98.11	94.40
2		98.25	94.79		+0.14		+0.39
3		95.49	86.64		2.76		8.15
4		94.35	83.22		1.14		3.42
5		93.75	80.90		0.60		2.32
10		90.58	72.03		3.17 (0.63)	8.73 (1.75)
15		90.04	70.43		0.54 (0.10)	1.60 (0.32)
20		88.94	67.15		1.10 (0.22)	3.28 (0.66)
25		88.82	66.79		0.12 (0.02)	0.29 (0.06)

So again there is convergence, but to a lower value. I think there is a slight
inconsistency in Sharman, in that he claims the output probabilities of a word
should sum to one ("in each set...sum to one", p.7), but the re-estimation
does not ensure this. What makes the issue confusing is that one word may have
tagset {i,j,k} and another {i,j,l} say, in which case normalisations have
different effects: i and j have the same "relative normalisation", but k and l
do not. Since the effect percolates through to xi on the next iteration, this
potentially causes a problem.

If we label 5.txt against the ref and dat we now have (25 iterations) we
get: 88.44%/62.02%, which is not very encouraging. I think this calls for a
re-assessment of the technique. (The Cutting et al paper may be useful here.)

One more experiment to try is to actually make use of the pi values, even
though we have a single initial state. Set them to 1 in the dictionary
builder, and then use them as specified in Sharman. Revert to dictionary
normalisation for comparing the test results.

iteration	All	Ambig	(%)	Change in all	ambig
1		98.11	94.40
2		97.96	93.94		0.15		0.46
3		97.05	91.23		0.91		2.71
4		96.01	88.16		1.04		3.07
5		95.75	87.38		0.26		0.78
10		94.03	82.27		1.72 (0.34)	5.11 (1.02)

This is as before: i.e. pi makes no difference.

Cutting et. al. have a method for doing numerical stability. Add this in, with
option 'c' to get it (since this is the name of the parameter they use).

As an illustration of numerical stability adjustment, in the non-stabilised
version, after five iterations, we have a typical dictionary entry:
 's 3 6.13244e-10 29 9.98925e-23 53 5.67347e-52 95
 've 1 0.0162602 49
and a typical chain of words in the output:
     television                                         NN:1
     .                                                  .:4.78713e-08
     life                                               NN:1.75054e-10
 a   of                                                 IN:2.43414e-19
     Miss                                               NPT:2.43414e-19
     Nightingale                                        NP:0.742833
     .                                                  .:1.0038e-06


HOLD! REDO THIS STUFF, because I just noted that the change is results I am
listing below happened even before I applied the numerical stabilisation
option! ALSO try again without a priori dictionary normalisation.

(num.stab.)
Does it make a difference to the output?
...
iteration	All	Ambig	(%)	Change in all	ambig
1		98.11	94.40
2		98.36	95.13		+0.25		+0.73
3		97.23	91.79		1.13		3.34
4		95.99	88.08		1.24		3.71

I.e. there is a difference, but it isn't uniformly better or worse than
before.


16-12-92
--------
Changes to program: drop options 1-5. Do normalisation of dict scores in the
dictionary builder only. Numerical stability stuff added. Do a re-run of some
tests when this is done. Similarly, adjust transitions in dictionary builder.

Results after doing this (/c = with option c):
		All	Unknown	KAmbig	All/c	Unkn/c	KAmbig/c
5/100 FB	97.60	0	92.10	97.60	0	92.10
5/100 V		97.65	0	92.27	97.65	0	92.27
5/95  FB	86.29	32.56	86.97?
5/95  V		86.14	32.17	86.62

This is OK!

Iteration option build into the proggy. Now we can do some iteration tests.
Options: tlrmb(c)
Iteration	All	Ambig		All/~c	Ambig/~c
1		98.11	94.40		98.11	94.40
2		96.90	90.80		98.33	95.04
3		90.28	71.12		98.36	95.14
4		89.43	68.61		98.32	95.02
5		87.18	61.??		98.25	84.80
6		87.56	63.16		98.20	94.67
7		87.25	62.24		98.17	94.56
8		87.25	62.24		98.14	94.47
9		87.25	62.24		98.13	94.44
10		87.25	62.24		98.11	94.39

This shows that the process converges (pretty much), but the results are
rather worse than I was getting before. This could be because of (valid)
changes in the program, or it could be a consequence of option 'c' in some
way. Try it without 'c'.  Add them to the table above.

(Checked iteration option by running separately.)

Later: I think the high figures in the right hand columns are because there
are lots of zero score chains, which get thrown out altogether. So the
"ambiguity" seen by the program is much less than you might think! This is
corrected in tomorrow's entry.


17-12-92
--------
Attempting to correct the apparent problems which arise when you use the
numerical stabilisation. c_t+1 added to the xi calculation.

Try to construct a justification that what I have done is correct, where I
have worked over a fixed chain, rather than over all time. All time can't
actually be done properly with F-B as it stands because, though you can do the
alphas, you can't do the betas. I think the algorithm is basically OK, because
the observational probability sorts things out. The only point is what happens
at the junction between the chains: what are the right alphas and beta here?
(Actually I think I'm happy about the point after all)

I have now added lots of things which prevent any alpha or beta scores, and
any of the re-estimated values from becoming zero, by adding the minimum
double value in some calculations.

This means the non-stabilised performance is less good than it was for the
reasons noted above, but I think the result is more reasonably correct than it
was. So once again generate the figures (over 9 iterations)

Iteration	All	Ambig		All/~c	Ambig/~c
1		98.11	94.40		98.11	94.40
2		96.65	90.06		97.39	92.25
3		95.93	87.92		96.11	88.45
4		93.15	79.66		92.96	79.09
5		88.71	66.48		89.11	67.66
6		90.68	72.31		87.65	63.32
7		88.85	66.89		87.25	62.15
8		90.88	72.91		87.21	62.00
9		90.67	72.30		87.20	61.98

But I still don't believe the values. Look in the .ref and they are often
still ridiculously low, even if you have numerical stabilisation (e.g. single
tag proper nouns with scores of e-200, e-300). without c you get NaNs (ooh
dear!).

I think the problem may well be the splitting up into separate sequences
(models). Looking at Cutting, they say that they do this and then average the
results. So I think what I am doing wrong must be in the combination of
things. What I really want is to do the re-estimation (i.e. dividing by gamma
etc) at the end of each model, and to sum the results into the final arrays
from this. And then to average as they suggest, by the total number of models.
Try implementing this next.

I have just realised that I have been doing the wrong thing with the symbol
probabilities. In the equations, what is used is b(i,k), which is the
probability of generating output k from state i, where states can be
identified with tags and outputs with words. What I have been using is in fact
the probability of getting a tag given a word. Of course, we can use Bayes
theorem to relate them:
	b(i,k) = b(k,i) * b(i) / b(k)

So do this. For now, just completely drop the re-estimation stuff.

Doing this in dict, we get:
		All	Unknown	KAmbig
5/100 FB	98.98	0	96.64	better + same + better
5/100 V		98.93	0	96.47	better + same + better
5/95  FB	86.96	32.56	89.26	better + same + worse?
5/95  V		87.06	32.17	89.79	better + same + better

This is happy enough, so now we can go back to getting re-estimation right.

We can count the number of models, which is the number of times we call tag(),
and divide the summed transitions and pi values by this. But I am less sure
what to do with the new output probabilities. There is no need to do any
Bayesian stuff, because the re-estimation formula already gives us b(w | t),
i.e. the probability of outputting the word given the tag. The central
question really is what do we average by. But it may not matter, provided we
don't care what the absolute values of the outputs are.


Results with the latest version
	Non-c			C
	All	Ambig		All	Ambig
1	98.95	96.89		98.95	96.89
2	92.67	78.23		84.20	53.07
3	91.27	74.08		88.62	66.19
4	91.09	73.53		85.93	56.62
5	90.97	73.17		87.66	63.36
6	88.47	65.76
7	88.80	66.73
8	89.12	67.67
9	89.14	67.76
10	89.17	67.83


18-12-92
--------
After much faffing around, I've finally traced various bugs so we don't get
NaN's in the dictionary, Infinity's in the transitions, etc.

Can get zeros in pi on second etc iterations. Presumably for tags which never
occurred at the start of a chain; change to TINY. This is why the performance
was stabilising.

What is the correct way to average the dictionary models? Given how short the
models are, it is very common to get 1 as the final result, because in a given
model, the word occurred only once with the tag, and we finally divide by the
total models, which is one per word occurrence. Although there are non-1
values in the dictionary, I am still not convinced the values are right. So I
propose to take the re_est, not divided by the models for the word, and divide
it by the total transitions we have for the label in the whole corpus. This
seems to keep the values reasonably good.

We can also look at the max and min values we find in various arrays and see
if they stay reasonably stable. (Pi values seem to plummet by about 10 orders
of magnitude per iteration (this is with c turned off); so put in a scaling in
adjust_parameters. Since it applies to all models equally, it is OK to do
this.) 

I have been unable to get the numerical stabilisation as specified in
Cutting to work: it makes things just grow and grow. But I can't find the but.
However, using the above change to stabilising pi, the pi, trans and dict
values all stay in a more or less constant range, and the results seem to come
out OK. So we will accept this.

The results, with options trlmb, are then as follows. (Note: 38587 words total,
12992 ambiguous = 34%).

Iter	1	2	3	4	5	6	7	8	9
All	98.95	95.93	93.19	92.65	92.33	92.17	91.99	91.94	91.89
Ambig	96.89	87.90	79.78	78.16	77.23	76.74	76.21	76.06	75.92

Iter	10	11	12	13	14	15	16	17	18
All	91.93	91.80	91.77	91.77	91.76	91.76	91.76	91.76	91.76
Ambig	75.75	75.63	75.56	75.55	75.52	75.54	75.52	75.52	75.52

The results have stabilised, i.e. B-W has converged as it ought to.

The bug whereby running for 9 iterations gave different results to running for
5 then running for 4 seems to have been fixed, from the latest test.

This looks better, though the results are lower than might be hoped. But that
could be just the training data.

Another change: go back to accumulating values across the whole corpus and
then dividing by gamma sums etc. Working through some examples of single and
multiple models by hand, I think this has to be right. Implemented with the
define global_ave to switch it on. It gives the following (with stability of
trans/dict/pi checked as before):

Iter	1	2	3	4	5	6	7	8	9
All	98.95	98.05	97.48	97.01	96.77	96.69	96.45	96.39	96.33
Ambig	96.89	94.20	92.53	91.13	90.41	89.91	89.46	89.28	89.11

Iter	10	11	12	13	14	15	16	17	18
All	96.30	96.27	96.18	96.02	95.95	95.89	95.74	95.69	95.46
Ambig	89.02	88.91	88.56	88.18	87.96	87.81	87.36	87.21	86.51

Although these results are still going down, they are rather better than
before. The lack of convergence may be a property of the size of the corpus.

AGENDA I (at the end of the week)
---------------------------------
DONE (ish) 5-1-93 Build tags dictionary from LOB and then do self learning.
DON'T 28-12-92 Make the changes for idioms and untagged LOB format.
DULL 22/12/92 QUESTION: what values are being used for the bases on unknown
words? ANSWER: It's always 1. Is this right?
DONE 28-12-92 Make the changes for parallel implementation.
DON'T Take ideas from Cutting 3.3, 3.4, 3.5.
DON'T Faster tokenising and dictionary lookup (Cutting suggests "tries" (Knuth)
for the latter).
DON'T Consider throwing out small frequency words from dict - data may not be
reliable.
UNIMPORTANT 2/1/93 Deal with capitalisation.
UNIMPORTANT 22/12/92 Using text for numbers (scores) in dict means a gradual
loss of accuracy.
DONE 21/12/92 Try using a bigger corpus.
DONE 21/12/92 Write up reasoning behind global_ave, and check through on pi's.
DONE 22/12/92 Try with models which are longer than at present: still anchored
by unambiguous words, but now allowing unambiguous words in the middle of a
chain.
DONE 21/12/92 Edit test corpora to get rid of things with ditto tags.
DON'T 22/12/92 Would be interesting to do a plot of performance against
ambiguity. As well as other plots.
SEE 22/12/92 Does the re-estimation process actually converge? If so, when? If
not, why not?

21-12-92
--------
A new week. Start by doing a bigger test than before. Take lobv7e, which is
about 2.5 times as big in the unedited form as lobv7c. The results (86171
words, 32619 ambiguous = 37%)

Iter	1	2	3	4	5	6	7	8	9
All	98.07	96.67	95.79	95.24	94.89	94.59	94.41	94.26	94.18
Ambig	94.89	91.20	88.88	87.43	86.49	85.71	85.23	94.94	84.63

Iter	10	11	12	13	14	15	16	17	18
All	94.04	93.96	93.91	93.87	93.85	93.80	93.77	93.75	93.73
Ambig	84.26	84.06	83.90	93.82	83.75	83.63	83.54	83.48	83.44

These are a bit lower than before, perhaps because of the slightly greater
degree of ambiguity. But they still seem reasonable.


The reasoning behind using global_ave is as follows. What it does is to take
all the separate models, and form total xi and gamma values and then divide
them when all the models have run to do the reestimation. As opposed to
dividing after each model and averaging. I decided that this was right, by
considering a model:
	w1:{t1}  w2:{t2,t3} w3:{t1} w4:{t4,t5} w5:{t6}
which can be treated as one model, or as two split at word three. I worked
through the equation for the two cases, which give the same answer whichever
averaging scheme you use. However, if you set tags 1 and 4 to be the same, and
also tags 2 and 6, then only the global_ave scheme works. There is still a
difference between using a single model and two models, which appears
whichever averaging scheme you use, and that is pi, since in one case there
are two final states and in the other case only one.


Tried taking the ditto tags out of lobv7c, replacing x_Y y_Y" by x-y_Y etc.
The only change not made was for the following, which looks like an error:
	a_JJ la_IN" Dietrich_NP
The results are (38422 words, 8562 ambig = 22%)
Iter	1	2	3	4	5	6	7	8	9
All	99.13	98.58	98.14	97.73	97.57	97.45	97.83	97.30	97.26
Ambig	96.11	93.61	91.64	89.83	89.09	88.55	88.25	87.89	87.71
(Much lower ambiguity; much better performance).
(This corpus is called 7c.txt.)

Similarly edited lobv7e as 7e.txt. So we have a full set of results, here is
what it gave:
The results are (85743 words, 26003 ambig = 30%)
Iter	1	2	3	4	5	6	7	8	9
All	98.35	97.38	96.57	96.10	95.79	95.51	95.31	95.17	95.07
Ambig	94.56	91.35	88.71	87.13	86.12	85.19	84.54	84.06	83.74


Now on to the auto-learning stuff (various preliminary experiments done -
chopped results because I now have better ones). Define a number of options in
dict.

   1-2 - transition options:
	default: set transitions using score of "from" word.
	1: set all transitions to 1.
	2: set transitions using product of "from" and "to" scores.
   3 - dictionary score options:
	default: set dictionary values to sum of scores of values seen.
	3: set all dictionary values to 1.


Now we can do some tests. We first build a wordlist from 7c.txt and then use
it in the dictionary builder. Just do 6 iterations for an initial test.
(Always need options mu on dictionary builder in the following tests.)

Iter	All	Ambig	All	Ambig	All	Ambig
	------nil----	-----1-------	------2------
1	94.22	74.07	96.02	82.15	92.75	67.47
2	95.90	81.60	97.75	89.92	95.53	79.93
3	95.85	81.38	97.25	87.64	95.94	81.78
4	95.67	80.58	96.93	86.21	95.89	81.56
5	95.51	79.83	96.64	84.92	95.75	80.93
6	95.42	79.44	96.46	84.10	95.66	80.52

Iter	All	Ambig	All	Ambig	All	Ambig
	-----3-------	------13-----	-----23------
1	same as nil	89.93	54.79	95.34	79.07
2			94.41	74.92	95.12	78.10
3			94.16	73.81	95.07	77.89
4			93.90	72.16	95.02	77.67
5			93.79	72.12	94.93	77.25
6			93.36	70.19	94.92	77.19


22-12-92
--------
Convergence? If you plot a run over 18 iterations, such as that of 7e above,
then the performance looks very like e^(-x) where x is the iteration. So this
would imply convergence.

What is more of a worry is the fact that the results are declining, even with
auto-learning. It may be that this is correct, and it is a property of the
data; in this case, the convergence is to a maximum, but a sub-optimal one.
But it may also be that there is a problem with the program.


An experiment we could do is to build a dictionary from 7e and then label 7c
with iteration. I am not sure how well this will work out given that there
will be unknown words, but we could find out. Or we could chop out all
sentences containing unknown words.
(Suspend the work on this for now - VERY slow with unknown words.)

Now do an experiment on the size of the models. At the moment we have lots of
models because if there are two unambiguous words adjacent, this counts as a
model. As an experiment, try making a change, so that we only declare a
sequence a model if there was at least one ambiguous word in it. (This might
lead to stack overflows.) Test it on 7c. It gives the same results, so leave
the code in (with defines around it); it should make things a little faster.

While we are at it, check Skip words work. If so, we can use the labelled LOB
corpus as it stands. (Done: one point to note is that you must never rely on
from_1, to-1, < to, etc: you must always drawn on the next and prev fields.)
Checked on Viterbi as well.


AGENDA II
---------
(In addition to earlier entries)
DONE (ish) 5-1-93 Need to do some tests with corpora from different sources:
i.e. train on one, test on another. Can do this either in the manner of a 5/95
test (in effect), or by running the new corpus through the dictionary builder
first. In either case, some allowance for unknown words is needed. Really need
to try and write up all these different possibilities and think about what is
really best. Or we could do something in the re-estimation which adds unknown
words to the lexicon.
DONE 28-12-92 Write code for merging dictionaries and matrices, and for
outputting separate xi's and gammas. All this is needed for parallelisation
and merging.
DONE 2-1-93 Sorting out the user interfaces to these proggies might be a good
idea; use of filenames is a mess at present. Also sort out file structuring.
Build a library, then a series of orthogonal applications on top. Set up
makefile with a single target for them all as well.


28-12-92
--------
Over the Christmas break, I have done an extensive restructuring of the code,
to lessen the dependence on globals and make further work easier. This also
includes writing a dictionary merge program, as yet untested. To have some
confidence in these changes, it looks like a good idea to do some standard
tests.

Viterbi:			All	Unk	Ambig	Ambig/Known
5/100 (idioms not removed)	98.93		96.47
5/95 (idioms not removed)	87.06	32.56	71.79	89.61
100/100 (idioms removed: 7c)	99.13		96.10
(* Essentially as before *)

B-W iteration on 100/100
Iter	1	2	3	4
All	99.13	98.91	98.82	98.70
Ambig	96.11	95.12	94.69	94.18
Obs	0.010	0.0018	0.0018	0.0018

These are actually a little better than before. They were taken with
Ambig_needed defined out; what if it is brought back in?

Iter	1	2	3	4	5	6	7	8	9
All	99.13	98.91	98.82	98.70	98.63	98.59	98.56	98.54	98.53
Ambig	96.11	95.12	94.69	94.18	93.87	93.66	93.52	93.45	93.42
Obs	1.8e-5	2.2e-6	2.2e-6	...

So this gives the same performance as before. So I can only assume that my
code changes have eliminated some minor bug, such as rounding error. The next
test ought to be to do some self learning; but leave this for the moment
because it's too cold to do much today!


Aside: here's a comment I excised from a source file
    For the dictionary builder, we use this to adjust the scores using Bayes
    theorem. On entry, we have the number of occurrences of each tag for the
    word, which is equal to T = p(t | w) * n(w). What we want to get is the
    probability of the word given the tag, which is defined as:
     p(w | t) = p(t | w) * p(w) / p(t).
    where p(w) = total occurrences of this word / total of all words.
               = n(w) / sum_w(n(w))
    and   p(t) = total occurrences of this tag / total of all tags.
               = n(t) / sum_t(n(t))
    We can take sum_t(n(t)) = sum_w(n(w)), assuming the probabilities of the
    tags on a word sum to 1.
    Hence p(w | t) = p(t | w) * n(w) * sum_t(n(t)) = T / n(t)
                     -----------------------------
                         sum_w(n(w))) * n(t)
    So all we need is the amount we already have and the total number of
    occurrences of the given tag. The latter can be found from the
    from totals, so provided this has been set up we are ready.

    In fact, this normalisation will not be done here, but it means that we
    must output gamma values taken from the n(t) totals. [Changed from
    earlier verions.] So this function need do nothing at all.
It explains how the normalisation of dict scores is done.
(Later comment: to confirm this, this is just the same as the B-W
re-estimation of parameter b, which divides by total gamma values for a given
tag.)

As the next step, integrate dict and label, so that when we go to having
things like FSMs, we can avoid duplicating the code. I think the first
things is to write some sort of options handler.

2-1-93
------
Design the FSM - see separate document.

5-1-93
------
We now try doing various tests with wordlists. We will build a word list from
7e, then add all unknown words which occur in 7c, then label 7c vs the 7e
dictionary with re-estimation. The easiest way of adding the unknown words
would appear to be to build 7c.ref in the usual way, and the merge
dictionaries. This has the disadvantage of changing the list of labels you get
from 7e, but that doesn't really matter, since if we specify the "use
wordlist" option, then the dictionary gets initialised suitably anyway.

Comment: this creates a rather higher degree of ambiguity than previous tests:
11116 words out of 38422.

Comment: I just wasted a lot of time by running this in training mode. What we
really want to do is to see what would happen with an untagged corpus. I.e. we
are using the tags purely for evaluating the success. But I'll keep the
results anyway (from [[ to ]])

[[ start of bogus experiment

* All the results with option 1 are dodgy anyway, because I have just found a
bug. Fixed for the real experiment.

What we shall then want to do is an iterative run, with the first iteration
being training, and using a word list. There remain various options for
initialisation:
 no init code.
 init code 1: means that the training generates dict info but not trans info.
 init code 4: different method of calculating scores.
 init code 5: the two together (only affects dict).
(label 7c.txt dfoo.ref N B10 Ofoo.out lwS); see below for results. This gives
cases: no code, 1, 4, 5.

We can also specify option 2 on top of any of this, which has the effect of
initialising dictionary scores to 1/tag_max rather than 1/ntags. Cases: 2, 3,
6, 7.

Some of these results hit a peak and then decline. Sometimes there is more
than one peak; i.e. a peak, then a downswing, then an upswing and so on. So
as a larger experiment, try running each option for 50 iterations, and find
(1) the first peak, (2) the best peak and (3) if there is a final monotonic
downward trend, the peak just before it. Just note the "all" values. (Where
there is a downards trend, this is probably because of the limited amount of
training data).

Bracketed if not a separate occurrence. If the "peak" is at 50, it hasn't
really been shown to be a peak.

Test	first	at	best	at	last	at
no	88.55	49	(88.55	49)	(88.55	49)
1	87.74	12	89.98	49	(89.98	49)
2	87.11	50	(87.11	50)	(87.11	50)
3	90.80	18	91.52	50	(91.52	50)
4	86.86	10	89.44	50	(89.44	50)
5	87.94	8	89.74	50	(89.74	50)
6	87.37	20	87.84	50	(87.84	50)
7	91.15	27	91.17	43	91.17	45

What are the best cases?
 3 = before training, set wordlist scores to uniform values; after training,
     set transitions to 1.
 7 = as 3, but use slightly different scoring (so that dictionary values come
     out slightly differently).
In both cases, we ignore the effect that the training has on the transitions;
that is training is purely for setting up the dictionary.

end of bogus experiment ]]



Now, we do it right, in which we use re-estimation with no training. Option 4
makes no differences (which is why the results above were kept to show what
option 4 does). There are four cases: with init 1, 3, and in addition two more
where we use the transitions from 7e with init 0 or 2. (There isn't much point
in using options 0 or 2, since the data in the wordlist if treated as being a
dictionary isn't all that meaningful.)

(label 7c.txt dfoo.ref N B50 ofoo.out wS I ...)
(label 7c.txt dfoo.ref t7e.dat N B50 ofoo.out wS I ...)

Test	first	at	best	at	last	at
1	88.48	50	89.06	100
3	84.89	35	87.29	92	87.29	92
no/7e	96.87	24	96.88	31	96.88	35	stable: 96.86 @ 36...
2/7e	97.00	3	97.11	39	97.11	45	stable: 97.10 @ 46

This shows that a good initial matrix makes a lot of difference, and that the
effect of dictionary initialisation seems to vary with whether we have an
initial matrix. The ones marked stable might eventually change a little, but
have probably come to noise level, after 50 iterations. To see what happens in
the other cases, extend the test to 100 iterations. (Change in results
recorded above.)


6-1-93
------
AGENDA. Further ideas to try
DONE Do some analysis of the kind of errors that occur (not much published on
this: claimed that N/Adj is a common one).
DON'T Consider just dropping co-ordination (restart the model? continue it?):
Ted says this helped.
DON'T Try having more uniform transitions to and from punctuation. If this
seems to cause errors.
DON'T Write a matrix editor/converter from readable format.
DONE Try num stab again.

For larger tests, add an option to skip words with ditto tags. This will give
slightly less good performance that really handling idioms, but it will do for
most testing. (Done).


It is claimed that lexical probabilities give about 90% of performance. Test
purely on the basis of most frequent label to get a baseline
Input corpus 7c.txt
With 7c.ref		96.02%
With 7ce.ref (7c + 7e)	95.34%
(* Comment: the "most frequent" test as it stands does not normalise with
gamma; see below for what happens if we add this. *)

We can also try tests of the sort done yesterday, but using a dictionary and
initialising the transitions matrix (rather than a word list). Use the 7ce
dictionary, with 100 iterations. The only interesting initialisation option is
1: (label 7c.txt NB100 d7ce.ref ofoo.out O9 I1)
	first	at
	97.77	2
Then, after some oscillation, ends up at 95.56%.

The only thing that is dodgy in this test is the normalisation, which is
1/tag_max; we should also try using correct gamma normalisation. Adding (yet)
another option for this, namely I8, the result are:
(label 7c.txt NB100 r7ce ofoo.out O9 I8)
	first	at	best	at	last	at
	96.26	33
Then ends up at 93.18%.

These results seem to suggest that transitions matrices rather than
dictionaries are important in getting a shove in the right direction.


7-1-93
------
Now we start doing some larger scale tests, by building dictionaries and
transitions from several separate parts of LOB and then merging them as
needed. Do this with corpus option 2 (skip ditto tags), and first make some
changes to the tag list to close some more categories, e.g. ditto tags.

Reminder: if you change the tags.map, then old dictionary and transitions
become out of date.

Changed mapping code so that the tags file now needn't and shouldn't list
ditto tags if skipping ditto (helps keeps matrix sizes down a bit).

Also change to allow parsed numbers preceded by "*+" or "*-". This still allow
some sorts of number, such as "1960s" through. Also, make the anchor word and
tag be "^", the LOB sentence marker. We don't count it towards statistics,
however, but otherwise it is treated a word "^", tag "^". (These changes very
slightly alter the performance.)

To make dicts and transition matrices, define a command "build". Now we will
do this on all the lobtag files (first pass to find unknown tags, then a real
pass).

Corpus	Fsize	Words	Dsize	LWords	AWords	All	Ambig
lobv7a	1033282	<See Note>
lobv7b	 619965	 84581	 7756	 60657	19628	98.36	94.93
lobv7c	 387553	 53013	 7540	 38421	10307	98.94	96.04
lobv7d	 383954	 53309	 5238	 39113	15394	98.52	96.23
lobv7e	 854341	117620	10631	 85743	29809	97.96	94.13
lobv7f	1006264	137987	12217	100498	35866	98.36	95.40
lobv7g	1751205	<See Note>
lobv7h	 688833	<See Note>
lobv7j	1733347	232744	14095	168808	69502	98.40	96.13
lobv7k	 664645	 94324	 7841	 69138	27837	98.45	96.14
lobv7l	 553256	 79052	 6610	 57679	17461	98.45	94.89
lobv7m	 141212	 19999	 3181	 14491	 3294	98.97	95.48
lobv7n	 675191	 97376	 7626	 70928	25717	98.42	95.64
lobv7p	 672541	 96734	 6502	 70607	22635	98.49	95.51
lobv7r	 204452	 28608	 4440	 20936	 5926	98.77	95.65

Notes:
1. Fsize is the file size in bytes. Words includes skipped words. DSize is the
dictionary size. LWords is the number of words labelled and AWords the
ambiguous ones. All and Ambig are the success rates against own dictionary and
transitions matrix. The figures are using the FB algorithm. Viterbi gives
about the same performance but sometimes with slightly different errors (e.g.
on 7r, exactly same performance, but about 6 differences in the errors).
2. Have to exclude "g" because it contains a formatting error which makes the
input routines fall over, namely "millions_rough_IN" at G58 93. Have to
exclude "a" because of the error "plum_NN_RB" at A23 7. Have to exclude "h",
because of "new_JJH22 122" at H22 105. There are some errors in the other
corpora, such as missing tags, but they are recoverable (at some cost in
accuracy).
3. Comment: these are all quite a lot lower than the 7c test with idioms
allowed for. Maybe the analysis of errors will show this as a major source.


8-1-93
------
For an analysis of errors in the above, see "Errors". Do this just on corpora
b and l, since we will be looking at these below.

11-1-93
-------
Build a big dictionary (30600 entries) and transition matrix from b, c, d, e,
f and j. We will also want to build one which has the unknown words from l.
Call these btoj and btojl (32366 entries), respectively.

Test the corpora in a variety of ways from these dictionaries.
		LWords	AWords	All	Ambig
btoj	b	60657	32991	97.80	95.96
btojl	b	60657	32991	97.80	95.96
btojl	l	57679	30623	96.32	94.25

		LWords	UWords	AWords	All	KAll	Unk	Ambig	KAmbig
btoj	l	57679	3065	33587	92.70	96.02	33.46	88.53	94.06


Now we can proceed to tests based on B-W reestimation. In each case, we will
do 50 iterations, and record the first, best and last peaks and the
stabilisation value if any.



13-1-93
-------
(Moved to using adder. Unfortunately it uses the opposite byte order! So all
the transitions files have to be rebuilt first. Do this first. Then will have
to stick with adder!)

We now want to explore a variety of dictionary and transition options, to
study the effects of having different kinds of available information. From
Docn, the options are:
 I n	do initialisation as specified by n. Some options may be combined;
	add the codes to achieve this effect. These options affect the initial
	normalisation of values. Denoting the raw frequency from a dictionary
	entry (i.e. the number of occurrences of a tag on a word) by d, the
	raw transition as t, and the raw gamma for a tag as g, we normally use
	d/g and t/g (taking the appropriate g for the tag of d and t). The
	option allows some other possibilities. We will use n for the number
	of tags on a word and T for tag_max: the number of tags defined. The
	same applies to pi values as to transition values. The range of
	options represents different degrees of smoothing the information,
	which might be wanted if it is not reliable with respect to the corpus
	to be tagged. These options do not apply if training (option l).
   0	Dict d/g, Trans t/g (default)
   +1   Dict d/n
   +2   Dict d/T
   +4   Trans t/T
   +8	Set all dict values to 1 first (in place of d)
   +16  Set all trans values to 1 first (in place of t)
	If options 1 or 2 and option 4 are in effect, no transitions file need
	be specified. If neither 1 nor 2, or not 4, there must be one.

We want to try the following initialisation codes:
	t/g	t/T	1/g	1/T
d/g	0	4	16	20
d/n	1	5	17	21
d/T	2	6	18	22
1/g	8	12	24	28
1/n	9	13	25	29
1/T	10	14	26	30

Add a further option: +32 (replaces +1 or +2), which divides by the total
score on the word. So we have the following further cases:
	t/g	t/T	1/g	1/T
d/s	32	36	48	52
1/s	40	44	56	60

Some of these will be equivalent (which is probably just as well!)

This gives a total of 24 tests on each corpus. Run 30 iterations of each. Also
add a baseline test of picking the most frequent tag (bF and lF below)

Results (in performance order): corpus b

Test	d	t	best	at	comments
b0	d/g	t/g	97.80	1	non-iterative
b1,2	d/n,T	t/g	97.48	2	= b4(d/g,t/T), = b32(d/s,t/g)
b5,6	d/n,T	t/T	97.09	4	= b36(d/s,t/T)
b21,22	d/n,T	1/T	96.72	2	= b52(d/s,1/T)
b13,14	1/n,T	t/T	96.28	7	= b44(1/s,t/T)
b9,10	1/n,T	t/g	96.18	4	= b12(1/g,t/T), = b40(1/s,t/g)
b8	1/g	t/g	95.01	6	stable at 94.76
bF	-	-	93.98	-
b17,18	d/n,T	1/g	91.74	21	= b20(d/g,1/T), = b48(d/s,1/g)
b29,30	1/n,T	1/T	86.99	13	= b60(1/s,1/T)
b16	d/g	1/g	86.61	30
b24	1/g	1/g	79.81	30
b25,26	1/n,T	1/g	79.80	30	= b28(1/g,1/T), = b56(1/s,1/g)

There is no difference between using options 1 and 2. In fact, this is what
you would expect, since all they do is to change the output probabilities of
all tags on a given word by the same factor FOR THAT WORD. But since we only
ever use these output probabilites for a word considered independently of
others in the dictionary, this doesn't affect the actual performance. For this
reason the lines are commoned together above. For the l test, don't do the
extra tests at all.

Obviously also, if you divide one of d and t by T and the other by g, the
effect is the same as the other way round. So we can common up some more
tests: 1,2,4; 9,10,12; 17,18,20, 25,26,28.

Results: corpus l
Test	d	t	best	at	comments
l0	d/g	t/g	96.32	1	non-iterative
l32	d/s	t/g	96.27	3
l1,2	d/n,T	t/g	96.26	3	= l4 (d/g,t/T)
l36	d/s	t/T	95.97	6
l9,10	1/n,T	t/g	95.52	6	= l12(1/g,t/T)
l40	1/s	t/g	95.49	6
l13,14	1/n,T	t/T	95.37	8
l44	1/s	t/T	95.33	9
l21,22	d/n,T	1/T	95.11	3	= l52(d/s,1/T)
l8	1/g	t/g	95.00	6
l5,6	d/n,T	t/T	94.69	16
l17,18	d/n,T	1/g	91.73	5	= l20(d/g,1/T), = l48(d/s,1/g)
lF	-	-	91.31	-
l29,30	1/n,T	1/T	89.14	25	= l60(1/s,1/T)
l16	d/g	1/g	87.96	22
l25,26	1/n,T	1/g	85.03	30	= l28(1/g,1/T), = l56(1/s,1/g)
l24	1/g	1/g	79.25	30


18-1-93
-------
Analysis of the iteration tests. First we want to say roughly what the
various cases mean. Take the dictionary first. In decreasing quality of
information:

D1 (d/g)	Perfect information.
D2 (d/s)	Have relative tag frequencies on each word, and their relative
magnitudes are reliable.
D3 (d/n,d/T)	Have relative tag frequencies on each word, but we don't know
how these tag frequencies relate to overall tag occurrences (i.e. have P(tag |
word), but can't recover P(word | tag)).
D4 (1/g)	Uniform frequencies across word, but know some overall
information about tag occurrences.
D5 (1/s)	No information about differences in tags across a word, but
affects absolute values of tags on a word (and hence affects re-estimation
because it makes some models have a lower observation probability than
others).
D6 (1/n,1/T)	Minimum information - just know what tags on each word.

For the transitions, we can similarly say:
T1 (t/g)	Perfect information.
T2 (t/T)	The right distribution, but the absolute values are less
reliable. (E.g. if we were given a prenormalised array.)
T3 (1/g)	Information in a distribution matching overall tag
distribution, but otherwise smooth.
T4 (1/T)	Minimum information - completely smooth.

A useful measure of quality is whether a result is above or below the line,
i.e. performs better than just picking the most frequent tag.
Comment: most cases above the line converge quite quickly, most below it
converge more slowly.

First, we will construct orderings of one of the above parameters under each
of the others.
		Corpus b				Corpus l
	T1	T2	T3	T4		T1	T2	T3	T4
	D1	D1	D2,D3	D2,D3		D1	D1	D2,D3	D2,D3
	D2,D3	D2,D3	D1	D1		D2	D2	D1	D1
	D5,D6	D5,D6	D4	D5,D6		D3	D4	D5,D6	D5,D6
	D4	D4	D5,D6	D4		D6	D6	D4	D4
						D5	D5		
						D4	D3

		Corpus b				Corpus l
	D1  D2  D3  D4  D5  D6			D1  D2  D3  D4  D5  D6
	T1  T1  T1  T2  T2  T2			T1  T1  T1  T2  T1  T1
	T2  T2  T2  T1  T1  T1			T2  T2  T4  T1  T2  T2
        T4  T4  T4  T3  T4  T4			T4  T4  T2  T4  T4  T4
        T3  T3  T3  T4  T3  T3  		T3  T3  T3  T3  T3  T3

Comment: there are some patterns in this, but I think they are not strong
enough to really say much useful.

The approach of trying to put some sort of sensible ordering on this doesn't
appear to be very productive. Instead let's do two things:
(1) Repeat the tests using two other parts of the corpus to see if we get
similar results. In doing this, miss out some of the cases (at least
initially) to save time. To get comparable sizes, use corpus c for one and m+r
for the other; m+r is 35427 words. Just as we formed btojl.ref above, here we
form btojmr.ref (32248 entries). See below for the results.
(2) What do these results mean for unsupervised training?
(a) Suppose all we had was a MRD with no frequency information, i.e. the 1/n
case. Then the "l" results suggest that to get above the line we must have at
least some transition information (t/T) even if it is imperfect.
 (a1) If we had some information about overall tag frequencies, i.e. case 1/g,
then the same applies
(b) Suppose we had a MRD with relative frequency information, i.e. the d/n
case. Then from "l", we can make do with minimal transition information.
 (b1) If we had some information about overall tag frequencies, i.e. case d/g,
then the same applies, except that we should not use 1/g for the transitions.
Comment: in all cases, using 1/g for the transitions looks like a bad idea.


Result of c/mr tests. (Equivalent tests are also not carried out.)
--------------------
Test	d	t	best	at	comments
c0	d/g	t/g	98.30	1	non-iterative
c1,2	d/n,T	t/g	98.05	3	= c4(d/g,t/T)
c32	d/s	t/g	98.05	3
c5,6	d/n,T	t/T	97.70	5
c36	d/s	t/T	97.70	5
c21,22	d/n,T	1/T	97.25	3	
c52	d/s	1/T	97.25	3
c13,14	1/n,T	t/T	96.85	17
c44	1/s	t/T	96.85	17
c9,10	1/n,T	t/g	96.71	4	= c12(1/g,t/T)
c40	1/s	t/g	96.71	4
c8	1/g	t/g
cF	-	-	94.68	-
c17,18	d/n,T	1/g	92.36	28	= c20(d/g,1/T)
c48	d/s	1/g	92.36	28
c29,30	1/n,T	1/T	87.21	16
c60	1/s	1/T	87.21	16
c16	d/g	1/g	86.83	30
c25,26	1/n,T	1/g	81.02	23	= c28(1/g,1/T)
c56	1/s	1/g	81.02	23
c24	1/g	1/g	79.04	30

Test	d	t	best	at	comments
mr0	d/g	t/g	96.57	1	non-iterative
mr32	d/s	t/g	96.31	2
mr1,2	d/n,T	t/g	96.31	2	= mr4 (d/g,t/T)
mr36	d/s	t/T	95.96	9
mr21,22	d/n,T	1/T	95.39	3
mr52	d/s	1/T	95.39	3
mr9,10	1/n,T	t/g	95.38	5	= mr12(1/g,t/T)
mr40	1/s	t/g	95.36	5
mr13,14	1/n,T	t/T	95.36	9
mr44	1/s	t/T	95.35	9
mr8	1/g	t/g	94.79	22
mr5,6	d/n,T	t/T	94.45	30
mrF	-	-	92.47	-
mr48	d/s	1/g	91.57	16
mr17,18	d/n,T	1/g	91.57	16	= mr20(d/g,1/T)
mr29,30	1/n,T	1/T	88.92	30
mr60	1/s	1/T	88.92	30
mr16	d/g	1/g	87.32	29
mr25,26	1/n,T	1/g	84.66	30	= mr28(1/g, 1/T)
mr56	1/s	1/g	84.66	30
mr24	1/g	1/g	79.24	30

Comparative orderings, omitting the +1/+2 difference. Equal ones bracketed.
Omit F.
	b		c		l		mr
	0		0		0		0
	[1		[1		32		[32
	4		4		[1		1
	32]		32]		4]		4]	
	[5		[5		36		36
	36]		36		[9		[21
	[21		[21		12]		52]
	52]		52]		40		[9
	[13		[13		13		12]
	44]		44]		44		[40
	[9		[9		[21		13]
	12		12		52]		44
	40]		40]		8		8
	8		8		5		5
	[17		[17		[17		[17
	20		20		20		20
	48]		48]		48]		48]
	[29		[29		[29		[29
	60]		60]		60]		60]
	16		16		16		16
	24		[25		[25		[25
	[25		28		28		28
	28		56		56]		56]
	56]		24]		24		24
i.e. reasonable similarity.


DONE Bug in O9 - only gives stats. Tidy up output options generally.
DON'T Improve the quality of the corpus reading routines.
DON'T Make the default line format be somewhat less than 50 (save a bit of
space!); or do something for tabs.
MOVED Add something to merge in unknown words during reestimation.
DONE Junk old source files which are no longer needed.
MOVED Maybe change file format to include actual tags in dictionaries (to make
them easier to modify by hand).
DON'T Improve interfaces and tidy headers (e.g. output routines are in a funny
place).
DONE Add an option to make the processing unit be a sentence marked by anchors.
DON'T Make consistency check better: maybe include tags file at start of trans
and dict files.
DONE Test with just bands for freq's in dict; and/or trans.


From meeting with Ted, 19-1-93
 He sees there as being three ways of working with tagging stuff (e.g.
thinking about how to resolve "to" as Inf marker or as prep.
1. Thresholding (de Marcken, sgp): keep several tags and use thresholds to
throw some out. Problem: can't really find useful thresholds (Steve's result)
- either throw out too much or keep to much in.
2. Named path (Kupiec). If you have Det N P Det N followed by something that
could be V3sg or Npl, then you see the P Det N and recognise it as a PP. On
returning from PP recognition, the NP recogniser them goes on to the VP and
does head-verb agreement. Problem is lack of modularity.
3. Underspecification technique. Use "architags" = tags which are implicitly
ambiguous between possibilities. E.g. TO for "to", AS for "as". The if you see
TO followed by VP, you resolve as inf. marker, if followed by NP as prep.
Problem: only works for certain errors types.

We want to head towards a mixture of 2 and 3. Marcus and Magerman (sp?)
propose that in case 2, phrase boundaries correlate with lower probability
that transitions within a phrase and you might recgonise them that way.
Practical experience of Ted's is that this does not work.

One approach is inserting phrase markers as extra tags: see Church paper and
also another paper in applied ACL 88 by Eva Ejerhed.

My FSM approach looks like being a good thing to follow. Really after fairly
flat structuring. Need to consider whether recursion is a problem, and whether
it should be allowed. Can probably do an ordering on which things you look
for, e.g. find AP then NP then PP then VP. So if you have a NP with a PP in
it, you actually find an NP up to its head, then the PP.

Should analyse major error classes more, in particular to see what errors
there are of class 2. Look at the major and det errors which are frequent;
develop a tool for this.


21-1-93
-------
Fiddling with the output format reveals that there was a small bug in the
error counting (places where an unambigous word at the start of a model got
the wrong tag). It doesn't make a huge difference to the values, but the tests
should be re-run (in the background) with the aim of getting the right values.
Note that the missed errors before would not have shown up on any self test.

Added an option to label sentences at a time; the idea is to make the error
checking easier. However, there is one way in which it is not straightforward.
It appears as follows: there was a word for which the only tag was NP and one
for which the only tag was PP2 following it. Without the sentence option,
these were in two separate models. With it, they are in the same one. But the
transition strength between them is zero, and hence all of the alpha values
(going forwards) or beta values (going backwards) collapse. This could be
fixed in the labelling algorithm, but I think that it is really an issue of
output. So drop the option as a labelling option, and instead add an output
option which delimits sentences. This means that we will get all of them, not
just the ones with errors in.

Tool to help with error analysis written. Now we go on to do a more
comprehensive analysis of certain errors in corpora b and l.

COMMENT Noticed in a stack dump that gamma on the first and last elements of a
model is always 1, rather than alpha * beta. Does it matter? It doesn't on the
last one, and it is correct on the first, because it allows for the
observation probability. So it is OK.

26-1-93
-------
Try "time flies like an arrow", etc.
Some sample funny sentences:
Correct					Tagger
they_PP3AS can_MD fish_VB		they_PP3AS can_MD fish_NNS
time_NN flies_VBZ like_?? an_AT arrow_NN	... like_IN ...
fruit_NN flies_NN like_VB a_AT banana_NN
				fruit_NN flies_NN like_IN a_AT banana_NN
("banana" was unknown in btojmr).
fish_NNS are_BER plentiful_JJ		(correct)
the_ATI fish_NN is_BEZ plentiful_JJ	(correct)


A loose end is to look at the numerical stabilisation. So do this.
Checked the code. Test on non-iterative self test of v7b: gave same
performance as standard run (except for a different average observation
probability). However, it still does not work when B-W re-estimation is in
use. I think this is because we accumulate totals for gamma, and the new
transition and dictionary values separately, and then combine them together at
the end, and this means that the values are strictly incomparable. So the
values you end up with in the matrix don't necessarily make a lot of sense.
However, if you don't keep gamma separate then there is the problem of how to
combine them re-estimates from the separate models; the sort of thing I looked
at using the "total_models[tag]" array in an earlier. Version.

28-1-92
-------
The FSM stuff is well on the way to being done. However, I think it is
nesessary to stop and consider how the overall architecture of the labeller
will work now. I think what we want to do is rather than having a stack made
up of singel stack levels, to allow each stack entry to point to a list of
"next entries". In addition, we want to be able to have levels of the stack
which are essentially phrasal, in the sense that they have a tag of their own,
and contain a range of stack entries within them.

The first step is just to add the hooks for this to the labeller without
worrying about how the algorithm will work, and in particular how parallel
stacks like this get compared. So do this, with the standard labelling
algorithm just applying on the first such branch, and ignoring all the others.

(Of course, this means that the stack dump routine will be rather harder to
sort out: make it do just the first branch as well.)

Perhaps a better way to do this is by having phrasal hypotheses rather than
parallel stack entries. The entry would have a tag for the phrase as a whole
and a second hypothesis structure in case there are tagging algorithms working
within the phrase. You still have one structure per word: the ones after the
first don't specify any phrasal tag, but just indicate that the phrase is
continuing (or they say something like a ditto tag). What then about having a
sequences of words and using different labelling algorithms on them? One way
of doing this would be to mark each hypothesis with flags indicating what it
is relevant to. You just then ignore the things that aren't relevant in any
given algorithm.

(* Not sure which is best: see the code for what I actually end up with. *)

After playing round with for a while, it really is getting into a mess. I
think it is important to rethink the data structures, bearing in mind exactly
how we are going to use them.

Basic point: we have a sequence of words, which each has an array of tags and
their scores, taken from either the dictionary or some source of information
for unknown words. The basic algorithms we apply to this are F-B and Viterbi.

Consider an example: word 1 has hyp Det, 2 has hyp Det, A, N, 3 has hyp N.
We can have an NP over the lot, over word 1 to word 2 and over word 2 to word
3. So the trellis looks like this:

     |--- NP (Det N)---------|
     |                       |
     |           |--- A -----|
     |           |           |
. ---|--- Det ---|--- N -----|--- N ---|
     |           |           |         |
     |           |--- Det ---|         |--- .
     |           |                     |
     |           |--- NP (Det N) ------|
     |                                 |
     |--- NP (Det A N) ----------------|

So it looks as if the way to implement this is to have information about each
link (e.g. alpha, beta), and for each HYPOTHESIS where the next hypothesis is.
Also, it will probably help if each hypothesis has a pointer back to sp.

(To save copying, I think you do something like doing top level actions first
and then if you pick that hypothesis doing lower level ones. I'm not sure how
this will interact with re-estimation yet, but it's probably not a problen:
can always have original and modified arrays temporarily. In fact, this needs
some thinking about generally, but we will sort it out. It may call for some
extra actions for FSMs.)

29-1-93
-------
Retests done: training, most freq, FB, Viterbi, BW

Comment: "most frequent" does not normalise with gamma. What happens if we add
this?
b-F without normalisation	93.98/88.94
b-F with normalisation		89.24/80.22
I.e. much better without, as before.

Comment: I also seem to have brought about a change in the average observation
probability. Not sure what has caused this. It is also worth rechecking the
num-stab stuff with this lot of changes, since I found a small bug in it
(though it doesn't fix the problem).


1-2-93
------

AGENDA
------
(Summarised from some things collected over the last couple of weeks)
MOVED Re-do the iterative tests to allow for some bugs found more recently
(effect is probably minor).
MOVED Add changes for retaining formatting and allowing markup.
DON'T (Long-term) Possible idea: trigrams are too hard because of getting
enough data. But you might be able to do something where as well as using
adjacent words, you also use a word and the next word but one, etc. Probably
this wouldn't work that well on its own, but combined with the simple bigram
model it might do well. (Also longer sequences.)
DON'T Notes from Ted: Bidirectional machines, e.g. anchor on "and" and look in
both directions to do co-ordination. Another idea: run tagger, run machines,
run tagger again on machine outputs. Allow machines to trigger on tags with
given prob, or trigger if the highest probability tags is X, and so on. Also,
initial tagging, pass through all tags with greater than some threshold, with
the threshold set up to get the error rate below some particular value.
DONE Consider changing standard file extensions to move away from esprit
stuff.
DONE Sort out numerical stabilisation by working through an example or two by
hand.
DOING Try other corpora and also perhaps corpora from widely different sources
to the build ones.
DONE Continue with FSM work.
DONE Add hooks for testing collapsed categories (i.e. where several are lumped
together in an automatic way).
DON'T Could we do something for automatically finding best dictionary
thresholds: e.g. see how much difference between the categories you need to
predict the correct cases, and then flatten out to that to try to avoid making
incorrect predictions.
DOCUMENT Error recovery could be a lot better in various places.
LATER Semantics of FSMs: what should block/force etc do, if more than one of
them applies. Do they all operate on the same tags, or do they each make a
copy. Also how to implement "phrasal" things? Also, complex interactions
between machines are likely to be very confusing.
MOVED. Get de Rose's thesis.
LATER How to construct FSMs in a principled way? Rather than just as hacks to
fix problems.
DONE As part of the "practical experience" thread, get the tagger to output
probability distributions for (1) observational probability (a) on all
sequences (b) on all sequences containing/not containing errors; (2)
probability of the chosen tag for (a) all be correct/incorrect. So that latter
relative to the total tag probability.
DONE (some) Collect some results from the above.
DON'T Abstract/summarise the papers in the two corpus books from the library.
IGNORE Comment that will show up with fsms is that earlier I noted if you
stick two models together and ther unambig-unambig transition in the middle is
low, then you get low values throughout. So the FSM would always dominate
there. It has implications for reestimation. (a) Where do such low transitions
come from? (b) Should I coinsider hacking them?
DONE Change the program to make F-B really be an Option (even if not entered
from the command line), rather than just the default.
DONE Add a second dictionary option, entries from which are treated as skip
words when seen with the given tag (can then use this to test the effect of
missing out certain sorts of punctuation).

Numerical stability fixed - it was a bug, in that B~ and not B^ (in Cutting's
terms) was being used in the re-estimatin. Fixed and tested on 5 iterations of
v7b starting from its own ref and dat - gave the same results in the
stabilised and unstabilised cases.

2-2-93
------
Looked at some probability distributions from self testing v7b.

Added a sentence level labelling option, which only tags when a second anchor
is hit. This is for some of the punctuation testing. But before doing that,
let's record what happens with it. Self test v7b
base case				98.36/94.93 OP 0.062
with anchoring (a)			98.36/94.93 OP 1.4e-10
with anchoring, stabilised (ab)		98.36/94.93 OP 1

4-2-93
------
Revised way of getting probability distributions: switch on O256 which
includes O+, T+, R+, O-, T- and R- lines in the output. Then get the category
you want. E.g. for correct observational probability, go
	fgrep "O+" file-from-label | awk -f aprog | sort | uniq -c
where aprog contains:
	{ print $2 }
Note that the output file does not give results in numerical order, but it is
suitable for passing to gnuplot.


5-2-93
------
It's time to re-hack, re-think and re-state the main data structure, into the
form needed to represent phrasal hypotheses. The basic building block is a
*hypothesis list*, which is a linked list of hypotheses, each having a
tag, a base score, a pointer back to the dictionary, and information specific
to the tagging algorithm, such as alpha and beta values for FB. A hypothesis
list is associated with a word. The pointer to the dictionary may be NULL, in
the case of phrasal tags.

We then also have a top level hypothesis structure. There is a linked list of
them for each word, i.e. each level of the stack. The top level structure has
a hypothesis list; this is either the list taken from the dictionary or a list
of phrasal tags. The former always appears as the firs top level hypothesis;
the latter allows for different phrasal tags spanning the same words. There is
additional specific information: numerical stability values etc for FB; for
phrases, a skip flag and a pointer to a list of top level hypotheses for the
first word, which allows recursion. The chain of successors across them will
represents the span of the phrase.

DON'T have a look at the I/O code in the concordance proggies.

8-2-93
------
Small refinement to the FSM semantics: do not move machines on skipped words
(because skipped words are principally for markup etc, and the use of them on
idioms is really a functional hack in the current version).

Think about score integration (hardest part conceptually).
Must have phrasal thing in initial training somehow (otherwise it'll never
get started).


9-2-93
------
Substantial change to the data structures now I know what I want. See
struct-diag for a picture.


10-2-93
-------
Recap on performance statistics, for future reference.

Self tests on v7b (word totals are 'correct'):
	All	%	Ambig	%	Skipped		Obs prob.
FB	59661	98.36	18632	94.93	20964		0.00665576
V	59659	98.35	18630	94.92	20964		-
mostf	57488	94.78	16459	83.85	20964		-

Comment: mostf is rather higher than before. But I think it's OK.
Comment: Viterbi is also different from FB before. But tracing through the
stack, I think this is OK. As noted in earlier tests, it does happen.

DON'T allow creating a "phrase" which has no phrasal tag (esp. for idioms)
(In fact, for idioms, you DO want a phrasal tag)

11-2-93
-------
As the next step towards FSM integration, add to the routines, so that phrasal
hypotheses are explored as well as non-phrasal ones. But for the time being,
leave the problem of making sensible scores. Also, postpone addressing the
output problem.

Let's outline roughly how this is to work. At present, in each of the
algorithms, we look at a stack level, deal with all its hypotheses, and then
look to the successor/predecessor stack level. What we wish to do is the same,
except that after looking at all the base hypotheses, we look at all the
phrasal hypotheses that start/end at the given place. We never need to look
within a phrasal hypothesis: that will all have been dealt with by the %tag
action (or whatever). So in fact this looks quite easy. To repeat the point,
you never (yet) need to actually look through a phrasal hypothesis, only to
look at the things that start and end at a hypothesis.

This may be easiest implemented by writing iterators that work over all
hypotheses starting/ending at a stack level, of whatever sort.

Deciding on the scoring. One important point in a number of places is whether
we take scores from just base + forward looking hypotheses or just base +
backwards or base + both. Look at the cases:
(1) Training: do base + both. Generally this won't actually matter, but it
would in the variant that allows multiple tags in the training data.
(2) Most freq: Just look at forward ones. But if you pick a forward looking
hypothesis, then jump to the end of it. Generally this will also be
irrelevant.
(3) Viterbi is essentially a forward pass, so we look at hypotheses going
forward from t and backward from t-1. Same reasoning applies in forward pass
of FB, and parallel reasoning in the backward pass.
(4) For the gamma part of FB, just take forward looking hypotheses. In fact,
we could just as well take backward looking ones. We have accumulated suitable
alpha and beta values, and they can either be calculated at the point where
the phrasal hypothesis start or at the point where it end: it just doesn't
matter which.
(5) By a similar reasoning, in choosing the best hypothesis on FB, just look
at forward ones. Then also by reasoning similar to (2), if you pick a phrasal
hypothesis, skip over it. Something similar is also needed in chaining through
Viterbi.
(6) For output, you can just recurse on phrases. Outputting all tags is less
obvious: just do something which goes over them all and forget the matter of
structure and of what the phrases contain. Format for phrases not really
settled, also how to display scores. But these are minor issues.


STRANDS
-------
What am I trying to do at present?
1. FSMs.
 a. get them working
 b. get them integrated
 c. sort out how to score them
 d. define some and see what extra facilities are needed
 e. use them to correct problems seen in the error analysis
2. Error analysis
 a. get distribution information on errors
 b. devise predictors of errors
 c. use this to modify 1c
 d. look for regularities in errors
 e. use this in 1e
3. Punctuation
 a. collect further statistics on a larger sample on punctuation
 b. do further oracular tests
 c. use this in 1e
4. Other practical experience
 a. experiment with other corpora
 b. experiment with other tagsets on the same corpora
 c. make useful statements about auto-learning

12-2-93
-------
A scrappy day. I think the stack level stuff is integrated (bar scoring), but
I have yet to test any of it. Got a readable dump of btoj.dat for 3a. Started
file Punct2 for recording this.

16-2-93
-------
Continuing with the above. In fact, first create all.dat - a transition matrix
for the whole thing, and all.ref.

Deem the punctuation analysis done for the time being: really it's just redone
the work before with some different transitions etc. Moved into the file
Punct, with the older version renamed Punct-old.

17-2-93
-------
I think we might want something which will say that if a particular phrase
matches, then the base hypotheses it corresponds to are to be skipped. The
problem is in getting the semantics of this right: what happens of there are
other machines running, or other spans covering the same data. Think this
through. The major reason is for idioms.

Also make skip and phrasal tests be macros. At least add a flag to skip the
base hypotheses in tagging, even if the full semantics of the above isn't yet
worked out.  Hooks added for this, but no current FSM action to make it
possible.

*** work out the semantics of this

Issues to sort out in fsm stuff:
DONE (1) may_tag: what if there is a single hypotheses (which allows it) and
all machines have finished, but we still have one or more phrasal hypotheses
ending at the unambiguous word. Where this matters is in deciding between the
hypotheses, especially for Viterbi.
SOLUTION: trigger may_tag if there is a single hypothesis and nothing starts
or ends here and there are no machines running.
DONE (* This may still need further work *) (2) choosing in FB is done by
looking at gamma values; this includes the gamma value on a phrase. (a) Is
this ever set up? YES: just as for any other hypothesis. (b) If we choose a
phrase, we must surely then recursively do choosing on its consituents, and
skip to the end of it. NO: chosen points to the phrase; sub-tagging will have
set chosen on the span. But the output routine must do the recursion. (c) the
same argument as (b) must apply for Viterbi.
OK (added consistency check) (3) in choosing, we must never allow a skipped
base hypotheses. I think this is so at present.
DONE (from nexthyp machinery) (4) output routines must be changed to make all
the skip tests, esp. in all tags output.
IGNORE (5) the problem of low transitions, i.e. transitions between
unambiguous words, in the middle of a model.


24-2-93
-------
Chat with Ted about the state of the work, with conclusions as follows:
1. FSMs
a. continue with integration.
b. do some tests.
c. use treebank and SEC as initial sources of information and for comparison.
Note that treebank allows more than one tree in some places - correct if any
match.
d. Person at engineering (Nick Weg...?) probably has code in C for reading
treebank.

2. Practical experience
a. go ahead with attempting to find heuristics to predict when tagger has
given error.
b. get performance statistics using this (e.g. ignore output when it predicts
and error and see what performance you then get).
c. possible measures:
 i. absolute score of word
 ii. relative score of word
 iii. score of word relative to next best
 iv. observation probability
 v. observation probability weighted by length of chain
 vi. absolute score etc weighted by observation probability
 vii. something using transitions probabilities?

3. Tagset
Comment: my performance on ambiguous words is regarded as unusually high by
Ted, even though the overall performance is typical. This may be a consequence
of the tagset.
a. use reduced version of LOB tagset
b. do some tests using treebank, which has a smaller tagset


25-2-93
-------
A first attempt at FSMs in now integrated. There are still some issues about
scoring but the basic framework is there. As an experiment, take a small
corpus, and edit it by hand, and make some FSMs. We will construct something
that contains a few idioms and a few noun phrases. To test execution, we will
make the phrasal tags be the same as some of the base tags, so we are sure to
get some transitions.

For defining FSMs for a real test, defining FSMs from treebank looks sensible.
(* Want to see what happens with some overlapping machines? *)

FSMs tested on some simple cases. So now we go on to looking at treebank, both
for its own sake, and for its use in FSMs.

Location of material: /usr/groups/corpora/treebank/postexts. Link to directory
penn. There are several subdirectories for different source of text. Note that
the corpora sometimes specify more then one tag on a word. As a simplification
(possibly not a very good one), just take the first. Tagset in penn.map. Add a
new input option for the treebank.

The format appears to be as follows (by examination rather than fiat):
1. first few lines of file are copyright notice. Recognise these by lines
starting *x*.
2. [ ... ] enclosed phrases. May also get [ and ] with tags: when used for
phrases, there is a space after the [ or ]; treat it as skip words were in
LOB.
3. Words are follows by / and the tag. Multiple tags separated by |. \ is used
to escape '/'.
4. Treat lines of lots of equals signs as anchors. (in fact test on first
three).
5. : can also terminate tags. See a1/a1.pos:
[ a-Si*/NN:H/NP films/NNS ]
Note that : can also be a tag, so we need to exercise a little care here.


26-2-93
-------
For the purposes of the "what are corpora and tagsets like" strand, it will
probably be useful to have a program which takes a dictionary and transitions
matrix and gives the distribution of tags, i.e. the number of tags on each
word. We also want to get information on branching factors, i.e. the number of
transitions into and out of tags, where we ignore zero or very low (say less
than 1e-300) transitions. Call this program dtinfo.

Explanation of the output: gives the number of tags on a word and the number
of words with that many tags. Then gives the number of transitions from a tag
and the number of tags which have that many transitions from it, and similarly
for to.

Next we want to find some treebank corpora with similar numbers of words to
LOB. To do this, do a fairly exhaustive test on the bits of treebank. This can
go on as a background activity. In each case we will do a self test (but not
retaining the matrices and transitions for reasons of space).
This may prove not to be the test we want, if the number of words in each file
is relatively small, since in this case, we might not get a rich enough
dictionary and transitions.

The parts to use are a1 (or a2 - different tagger), mari and t (better than t1
or t2). But looking at the sizes there will certainly be a need to merge them.
We want to aim for about 60k words so make sensible comparisons. Miss out the
following because of format errors:
 t14: Duc/NP/FW
 t2:  because/RB/IN
Also miss out whole of dj, because it is lots and lots of small files.

Now we make up combined files as follows:
a1: a1+a10+a11+a12 -> a-1
    a13+a14+a15+a2 -> a-2
    a3+a4+a5+a6    -> a-3
    a7+a8+a9       -> a-4
t:  t1+t10+t11+t12+t13+t15+t16+t17+t18+t19 -> t-1
    t20+t21+t22+t23+t3+t4 -> t-2
    t5+t6+t7+t8+t9 -> t-3

We get the following results (number are 'correct' under 'All' and 'Ambig'):
	Total	All		Ambig		Skip
a-1	62937	61730	98.08%	9109	88.30%	34260
a-2	63820	62676	98.21%	13259	92.06%	35028
a-3	64548	63393	98.37%	9915	90.39%	36863
a-4	39301	38830	98.80%	5665	92.31%	21235
t-1	56588	55249	97.63%	13199	90.79%	2 (yes!)
t-2	60146	58994	97.92%	16203	92.83%	1
t-3	39834	39807	98.12%	7492	90.93%	0

Small change: alter the standard file extensions to get away from the old
esprit stuff. So the dictionary becomes .lex and the transitions become .trn.


TAKING STOCK.
I am a bit undirected at the moment, so I think the thing to do is to pick a
strand and just concentrate on it. The one to look at is the practical
experience thread. So let's summarise the things we want to look at practical
experience of.

(1) Variation of success rate with corpus properties.
 (a) size of corpus
 (b) lexicon and transition properties and branching factor
 (c) tagset
(2) Success rate and convergence of Baum Welch.
 (a) with variations of initial data
 (b) with size of training data
(3) Success rate with similarity between training and test data, using things
like lexicon and transition probabilites as a similarity measure.
(4) Prediction of correctness.


Some work has been done on parts of this already. Take the points above in
order.

PRACT 1a (Variation with size of corpus)
........
How does success vary with the size of the corpus? This is interesting because
we would expect to see the success improve as the corpus gets larger, up to
some limit. To test this, use the t-1 corpus, starting from a smallish part of
it and gradually increasing in size.

Go up in units of about 5k words; each file contains the previous one as a
subset. To do this, it will be easiest to first verticalise the corpus and cut
out blank lines (or at least multiple ones). The it is easier to see when we
have reached n% of the file

File	Total	(%)	All	Ambig	Obs
1	 5585	9.8%	98.73%	91.66%	0.00313
2	11210	19.8%	98.59%	91.72%	0.00323
3	16840	29.8%	98.36%	91.51%	0.00246
4	22546	39.8%	98.06%	90.63%	0.00272	
5	28353	50.1%	98.08%	90.84%	0.00344	
6	34043	60.2%	98.01%	90.67%	0.00379	
7	39709	70.2%	97.77%	90.19%	0.00409	
8	45387	80.2%	97.75%	90.56%	0.00404
9	50935	90.0%	97.66%	90.62%	0.00467
10	56588	100%	97.63%	90.79%	0.00497
* Checked.
Plot size vs. probs: no obvious relation.


Also look at 1 vs. stats from 1...10
	All	Ambig
1	98.73%	91.66%		== self test
2	98.51%	91.47%
3	98.26%	91.12%
4	98.16%	91.33%
5	98.07%	91.45%
6	97.91%	91.16%
7	97.71%	90.45%
8	97.73%	91.02%
9	97.73%	91.18%
10	97.69%	91.45%		== 1/10 below
* Checked

Also look at 1...10 vs stats from 10
	All	Ambig
1	97.69%	91.45%		== 1/10 above
2	97.77%	91.79%
3	97.77%	91.57%
4	97.62%	91.04%
5	97.75%	91.38%
6	97.72%	91.27%
7	97.67%	91.15%
8	97.69%	91.09%
9	97.63%	90.89%
10	97.63%	90.79%		== self test

PRACT 1b (Lexicon and transition probabilities)
........
Take each of the above and get the average and maximum number of tags, and the
average number of transitions from and to for each of them.
	Tags:	Av	Max	Trans:	from	to		Branching
1		1.079	4		0.408	0.408		1.326
2		1.100	4		0.449	0.469		1.379
3		1.109	5		0.469	0.490		1.498
4		1.111	6		0.551	0.469		1.620
5		1.110	6		0.592	0.510		1.704
6		1.110	6		0.592	0.510		1.725
7		1.115	6		0.510	0.592		1.774
8		1.121	6		0.571	0.612		1.809
9		1.125	6		0.612	0.449		1.851
10		1.128	6		0.633	0.571		1.875
11		1.128	6		0.653	0.653		2.007
12		1.129	6		0.653	0.653		2.013
13		1.128	6		0.653	0.673		2.017

We should do some sort of plot of this against success rates.


PRACT 1c (Variation with size of tagset)
........
This is a bit of a tricky one to do, but also the most important. I think the
way to approach it is to take a part of LOB, say lobv7c, and collapse some
likely categories. The danger is that in doing so we will affect the
proportion of ambiguous words, which affects the overall success rate. So the
measure to look at is the success rate on ambiguous words as the primary one.

A way to do this is to compare the LOB and penn tagsets, and try to make
changes which move the one to the other. Some of these changes will be
independent, and we can try running with various combinations of them, both to
see what the order is, and so we can tagsets of similar size derived in
different ways, and see whether there is a correspondence. We should also look
at branching factors in doing this.

We could also use the condensed heading in Garside, Leech and Sampson, but
this doesn't take them down far enough.

Use Penn as a guide. We aren't actually going to the Penn tagset, because
there are some things where the correspondence is not clear, and because there
are some different tagging conventions (as described in the Penn guide, section
4). In some cases where the correspondence is in doubt (i.e. hand examination
would be needed, the most frequent use is set up as the correspondence). Penn
splits up possessives, LOB does not. So retain them as a separate category,
but group together where possible. Invent new group tags for these cases.

(* Table moved to 2-3-93 entry *)

1-3-93
------
Start the tagset tests. First change the code: add an option which reads a
second tags file, containing the original tag and the tag it is mapped to. The
overall tag file for the program will be the one with the reduced tags.

We will do the tests on b and l. Start with the usual self test as a base
line. Then we test with various reduced tag sets.

Comment: a change to the program means that there is better checking on
corpora, which reveals that there is a corruption in lobv7b. So results above
will change in what follows. We should check all the other corpora to. Redo
this, and get new baseline figures for all of them (including the ones which
had errors in the master version - perhaps we can get these fixed to). Most of
the errors appear to be where lines have been repeated or corrupted. So write
a small program to scan and check for numbers out of sequence and for odd
characters. Then correct them. Mail gt wtih a log. (* Done: there are now some
lines missing, but at least no errors. *)

Now we can repeat the build and self tests. This shows the need to add a new
tag: NPLS$. So some previous stuff on transitions will be out of date as well.

We will probably need to do this all again if and when we get corrected
versions of the corpora.

Corpus	Fsize	Words	Dsize	LWords	AWords	All	Ambig
lobv7a	1032164	141723	11836	101458	32348	98.29	94.64
lobv7b	 616893	 84212	 7742	 60399	19538	98.36	94.92
lobv7c	 387553	 53013	 7540	 38421	10307	98.94	96.04
lobv7d	 383954	 53309	 5238	 39113	15394	98.52	96.23
lobv7e	 854341	117620	10631	 85743	29809	97.96	94.13
lobv7f	1006264	137987	12217	100498	35866	98.36	95.40
lobv7g	1744874	236629	17344	173074	66758	98.20	95.31
lobv7h	 664645	 92315	 6342	 66755	23120	98.55	95.80
lobv7j	1732230	232605	14094	168706	69464	98.40	96.12
lobv7k	 664645	 94324	 7841	 69138	27837	98.45	96.14
lobv7l	 552174	 78891	 6608	 57562	17430	98.46	94.90
lobv7m	 141212	 19999	 3181	 14491	 3294	98.97	95.48
lobv7n	 673128	 97078	 7625	 70709	25643	98.42	95.64
lobv7p	 671433	 96572	 6501	 70488	23656	98.49	95.51
lobv7r	 204452	 28608	 4440	 20936	 5926	98.77	95.65

2-3-93
------
First, do a small run to test tagset reduction (just use some arbitrary
reduction and compare output with reduced and unreduced runs).

Now we move on to the actual work. First need to identify the groups that we
will reduce. (For groups, see below). Penn tagset was used as a guideline in
identifying which distinctions to get rid of, but the result tagset is not the
same as the Penn tagset. The main map will be the standard one, which we then
add reductions to. So the group names have to be in this tagset. This leads to
some slight funnies:
 ABL covers what Penn would call PDT
 PP1A covers all the personal pronouns
 ZZ covers what Penn would call SYM
 DO covers what Penn would call VBP

Group	NewTag	Covers...
11	CC	CC	DTX
8	CD	CD	CD$	CD-CD	CD1	CD1$	CD1S	CDS
9	DT	AT	ATI	ATP$	DT	DTI	DTS	
1	EX	EX
1	&FW	&FW
6	IN	CS	IN
4	JJ	JJ	JJB	JNP	OD
1	JJR	JJR
1	JJT	JJT
1	MD	MD
2	NN	NN	NNP	NNU	NR	PN
2	NNS	NNPS	NNS	NNUS	NRS
2	NP	NP	NPL	NPT
2	NPS	NPS	NPLS	NPTS
9	ABL	ABL	ABN	ABX
10	PP1A	PP1A	PP1AS	PP1O	PP1OS	PP2	PP3	PP3A	PP3AS
		PP3O	PP3OS	PPL	PPLS
10	PP$	PP$	PP$$
5	RB	QL	QLP	RB	RI	RN	XNOT	AP	APS
1	RBR	RBR
1	RBT	RBT
1	RP	RP
7	ZZ	&FO	NC	ZZ
1	TO	TO
1	UH	UH
3	VB	BE	BEDZ	VB	VD
3	VBD	BED	DOD	HVD	VBD
3	VBG	BEG	HVG	VBG
3	VBN	BEN	HVN	VBN
3	DO	BEM	BER	DO	HV
3	VBZ	BEZ	DOZ	HVZ	VBZ
9	WDT	WDT	WDTI
11	WP	WP	WPA	WPI	WPO	WPOI
11	WP$	WP$	WP$I
1	WRB	WRB
1	*'	*'
1	**'	**'
1	(	(
1	)	)
1	,	,
7	.	!	.	?
7	:	;	:	*-	...
1	DT$	DT$
2	NN$	NN$	NNP$	NR$	PN$
2	NNS$	NNPS$	NNS$
2	NP$	NP$	NPL$	NPLS$	NPT$
2	NPS$	NPS$	NPTS$
5	RB$	AP$	APS$	RB$

Groups:
1 = no change: just list this tag in the main taglist, with nothing in the
reduce tagset file.
2 = noun reductions
3 = verb reductions
4 = adjective reductions
5 = adverb reductions
6 = preposition reductions
7 = punctuation and orthographic reductions
8 = numeral reduction
9 = determiner reductions
10 = pronoun reductions (excluding wh-pronouns: in "other minor")
11 = other minor class reductions

Now we try applying these reductions one by one. All are self tests on v7b.
Group	Red	Size	All	Ambig	Amb%	AvHyp	BF
1	0	135	98.3576	94.9227	32.3482	2.19788	1.48732
2	19	116	98.2698	94.6427	32.2952	2.19669	1.48681
3	16	119	98.2864	94.7026	32.3482	2.19788	1.48732
4	3	132	98.3278	94.8306	32.3482	2.19788	1.48732
5	9	126	98.3079	94.6821	31.8184	2.08041	1.44894
6	1	134	98.5248	95.3774	31.9128	2.0898	1.44874
7	7	128	98.341	94.8715	32.3482	2.19788	1.48732
8	6	129	98.3907	95.0251	32.3482	2.14557	1.47169
9	8	127	98.2169	94.4174	31.9409	2.18723	1.48343
10	12	123	98.3261	94.7652	31.9757	2.18652	1.48376
11	6	129	98.3576	94.8705	32.0187	2.19009	1.48417

Red = number of reductions, Size = size of resulting tagset, Amb% = percentage
of words which are ambiguous, AvHyp = average hypotheses per word, BF =
average branching factor.

We next want to try combinations of these. Trying them all would be a huge
amount to do, so let's choose some reasonable combinations for the next stage.
12 = nouns, adjectives, pronouns (2, 4, 10)
13 = verbs, adverbs (3, 5)
14 = prepositions, determiners, minor classes (4, 9, 11)
15 = punctuation, numerals (7, 8)
16 = all major classes (2, 3, 4, 5)
17 = all minor classes (6, 9, 10, 11)
18 = all (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)

12	34	101	98.2251	94.4401	31.9227	2.18532	1.48326
13	25	110	98.2334	94.4479	31.8184	2.08041	1.44894
14	17	118	98.1821	94.2492	31.6115	2.17943	1.48029
15	13	122	98.3775	94.9841	32.3482	2.14557	1.47169
16	47	88	98.1556	94.1937	31.7654	2.07921	1.44844
17	27	108	98.4155	94.8562	30.8035	2.06085	1.43816
18	87	48	98.2665	94.264	30.2207	1.89901	1.38365

As a next stage, we can try some further things, which involve a new level of
collapsing.
19 = treat all nouns as being the same, except retain number and possessives
distinctions. This really means getting rid of the proper/common distinction.
20 = treat all verbs as being the same, except retain number and possessives
distinctions.
21 = treat all possessives as being the same as their underlying category
(except PP$ and PP$$).
22 = get rid of all number marking.

19	18	117	98.2798	94.6723	32.2886	2.1958	1.48646
20	21	114	98.2632	94.3934	30.9773	2.14101	1.46942
21	21	114	98.3559	94.9147	32.33	2.19733	1.48714
22	27	108	98.3179	94.7615	32.1115	2.18762	1.48416

(Some others would be possible, e.g. treat all adverbs, adjectives, etc. as
the same, but don't do this, for now at least).

Next thing to do is to plot:
(a) all vs size of tagset
(b) ambig vs size of tagset
(c) all vs ambig%
(d) ambig vs ambig%
(e) all vs ambig (condenses some of the above)

Also do self test on l to add further comparison

L tests
1	0	135	98.4556	94.8996	30.2804	2.15747	1.46732
2	19	116	98.3982	94.7103	30.2804	2.15747	1.46732
3	16	119	98.4243	94.7963	30.2804	2.14749	1.46453
4	3	132	98.4556	94.8996	30.2804	2.15747	1.46732
5	9	126	98.3235	94.3889	29.8773	2.04627	1.43037
6	1	134	98.5025	95.0038	29.9729	2.06599	1.43425
7	7	128	98.4087	94.7438	30.2752	2.15725	1.46728
8	6	129	98.4434	94.8594	30.2804	2.15646	1.46698
9	8	127	98.3253	94.4387	30.1136	2.15294	1.46576
10	12	123	98.3913	94.5269	29.3927	2.13055	1.45899
11	6	129	98.4347	94.8018	30.1119	2.15323	1.46574
12	34	101	98.3183	94.2786	29.3927	2.13055	1.45899
13	25	110	98.2784	94.2377	29.8773	2.03689	1.42758
14	17	118	98.2871	94.2797	29.9451	2.1487	1.46418
15	13	122	98.3982	94.7094	30.2752	2.15623	1.46693
16	47	88	98.2315	94.0807	29.8773	2.03689	1.42758
17	27	108	98.3566	94.2836	28.7499	2.03152	1.42276
18	87	48	98.0647	93.1715	28.3416	1.91403	1.38262
19	18	117	98.3218	94.4578	30.2804	2.15747	1.46732
20	21	114	98.3861	93.805	26.0519	2.00367	1.41617
21	21	114	98.4538	94.8939	30.2804	2.15049	1.46473
22	27	108	98.4087	94.7438	30.2752	2.15656	1.46703


3-3-93
------
Comment: base case with Penn (file 10) was
-	0	48	97.63	90.79	25.69

So, although there is a lower BF and ambig%, the ambig performance is much
lower. Why is this? Probably because the corpora are in some sense not
comparable.

Analysis.
1. Plot the results obtained above:
 a. all performance against size of tagset
 b. ambig performance against size of tagset
 c. all performance against ambig %
 d. ambig performance against ambig %
 e. all performance against ambig performance.
 f. all performance against average hypotheses per word.
 g. ambig performance against average hypotheses per word.
 h. all performance against average branching factor.
 i. ambig performance against average branching factor.

Conclusions:
1. The relation is more complex than simply one between performance and size
of tagset (as you would expect). This can be seen in the graphs being very
scattered.
2. In general, the performance is larger for larger tagsets for both all and
ambig performances.
3. There are a few cases where a smaller tagset than the base one does better
(see below).
4. There is little obvious relation between the performance and the percentage
of ambiguity.
5. Overall performance and ambiguous word performance are generally quite
closely related, with only a few exceptions, most notably:
B: 17 (all minor), 18 (all), 20 (single class of verbs)
L: 29 (single class of verbs)
6. There is no clear relation between either performance rate and either BF or
AvHyp (except possibly for ambig rates in L).

Where does a small tagset do better?
B (on basis of All):	6, 8, 15, 17.
B (on basis of Ambig):	6, 8, 15, but not 17.
L (on basis of All):	6 only
L (on basis of Ambig):	6 only
6 = prepositions, 8 = numerals, 15 = punctuation+numerals, 17 = all minor. In
fact, 6 is purely CS for IN.

Analysis: ignore 15 and 17, which are just the effects of 6 carried through.
Section l contains relatively few numerals, and hence the lack of effect of 8.

4-3-93
------
To get some more information, try running the test on some other corpora too.
Pick two more parts of LOB: g and j. Both are very large. g is belles lettres,
j is learned. We won't note all the results here, but we will look for which
cases do better than the baseline. This will take some time! (Full results are
in tagset.g and tagset.j).
Does better on:
G (all):	6, 8, 15, 20
G (ambig):	6, 8, 15
J (all):	6, 8, 15, 17, 18
J (ambig):	6, 8, 15
18 = all, 20 = cut out possessives.

Comment: the 18 result is an interesting one. May be the numbers effect
dominating.

Comment: it may be that the things that getter better (the ones that differ
woth corpus) depend very much on the proportion of things with that category.
But if that were so we should see a connection between performance and average
hypotheses per word and/or branching factor and/or ambig%, which we do not (or
only weakly, in the latter case). Also we see it in the performance measure on
ambig, which is perhaps a better test.

Comment: it may be that changing certain things at a finer grain than the
groups I gave would make an improvement, or that other combinations than the
ones I tried would do also. But I think the point to focus on is really the
general trend towards a bigger tagset giving better results.

We continue this work by (i) checking the program by transorming a corpus by
hand, (ii) tagging against a different corpus.
(i)  Do a change equivalent to group 4 on LOB-B. OK (i.e. the program can be
relied on).
(ii) Train from corpus g, test from corpus b. This means we will have unknown
words (this is do-able, if a little slow). 6.06% unknown.
Does better on:
BG (All):	2, 4, 5, 6, 7, 8, 11-22
BG (Ambig):	2, 4, 5, 6, 7, 8, 12, 15, 16, 18, 19, 21, 22

Comment: although the trend is towards a higher performance at larger sizes of
tagset, the connection is quite a weak one (e.g. if you took out the two
bottom points on some of the graphs, you could draw almost any line through
them). A major point is that it is different on the unknown test, where the
number of unknown words means that the number of tags makes a big difference.
To confirm this point, we should do a test where we use "g" for training, but
add a few sentences at the end to make the extra unknown words available.
(These "sentences are actuall anchor + word). Do this next (test "bg1").
Does better on:
BG1 (All):	3, 5, 6, 7, 8, 11, 13, 15-20, 22
BG1 (Ambig):	3, 5, 6, 7, 8, 13, 15-18, 20, 22
(In fact, I missed a few words in doing this, but only 0.03% so it doesn't
matter much.)

Comment: why should this be so? Perhaps we need some sort of grand plot of
performance against BF/AvHyp over all the tests. Try this and see what it
looks like. Miss out the B/G test because the factors are so much larger that
they obscure any variation in the rest, and only plot the performances versus
BF and AvHyp.

Comment: I think the conclusion is this: that there is a roughly linear
relationship between performance and BF, but only when the training and test
data are similar; so with BG, the relationship does not appear (Actually, this
is only really obvious with Ambig vs BF/AvHyp, and it may be that we actually
want a BF/AvHyp value for the things which actually were ambiguous only). So
this brings us back to the key question of:
	When is a corpus good/similar enough?
There is also a secondary conclusion that unless you take such cases into
account, conclusions about success rates must be taken with a large pinch of
salt.

Next we look at the errors on B self test with reduction 18 (a Penn-like
tagset) compared to the standard one. In doing so, we ignore any places where
there is an error in both corpora, provided it is either the same error, or
the same error when tag reduction is taken into account; for example RB/AT in
full LOB and RB/DT in reduced are taken as the same. See file b-sort1 for the
collected errors; main files in b-ln1 and b-ln18 (with line numbers).
Unfortuantely, I don't think there is anything more to conclude than from any
other error analysis we've done so far. But keep it in case we wish to refer
to it.
(* Deleted b-ln files to save some space, b-sort1 retained. *)

Comment: I think all this is going round in circles in a rather useless way.
The conclusion, if there is any, is that it is not the size/nature of the
tagset or its consequences in terms of BF/AvHyp that is really a predictor of
success rate, but some measure of similarity.

Why do we want similarity measures?
(a) to predict whether a training corpus is suitable for a test corpus.
 Comment: of course, we could just do B-W estimation. But (i) we might come to
the wrong local optimum; (b) if the test data was large, it might be too
expensive to do this.
(b) to search for predictors of performance which would enable conclusions
about tagset design to be drawn.

ERRORS
For errors, start by getting observation etc probabilities across all the
sections of LOB, and also on some cross tests. Do the former first. The files
that come out of this are HUGE, so we need some way of condensing them down. A
suitable way is to sort and uniq them, doing this file by file. When all this
is done, I will want a program which merges the results, totalling the number
of occurrences of each score (we are using "uniq -c" to get this). The awk
script "amerge" will do this, provided the data are in a single sorted file.

5-3-93
------
Redo the error tests. To avoid getting huge amounts of data, output to three
significant figures. (Means numbers are of the form %.2e). Pipe results into
sort | uniq -c. Do this over all the files. The combine the results by cat ...
| sort | awk -f amerge. Then fgrep for "O+" and pass results to plot.
(To save space, could also awk with script plot/acut, but this means a change
to the plot command: it cuts out the "O+" etc.

We might also want to construct a running total so that we get a cumulative
distribution. This best done by another program, since the sorter can't cope
with numbers with exponents in them.

If this work across lots of corpora gives a reasonable looking threshold, then
we shall want to try it with the individual corpora taken separately, and also
with non-self tests.

Add: only do output of T and R for ambiguous words; do output class S =
probability relative to second best. * On second thoughts, don't, because yout
get very large values (e.g. 1e+307) which screw up gnuplot and gs.

Also write a program which says where the various percentage points of the
distribution are (do this in the program "total" with comments on the output
lines so plot doesn't pick it up.

Drop absolute tag probabilities since they are the same as the relative ones
with the current implementation (?because of normalisation by obs1 in gamma
calculation, I think). Also make a change to not output O when it is exactly 1.

On the basis of this we get the percentage points like this:

%	O+		O-		R+	R-
4%	3.95e-11	1.91e-17	0.777	0.503
9%	1.03e-09	2.93e-15	0.897	0.524
14%	7.62e-09	4.1e-14		0.95	0.545
19%	3.36e-08	3.19e-13	0.974	0.567
24%	1.14e-07	1.85e-12	0.987	0.592
29%	3.47e-07	7.19e-12	0.992	0.616
34%	1.01e-06	2.47e-11	0.995	0.642
39%	2.78e-06	7.88e-11	0.997	0.668
44%	6.99e-06	2.37e-10	0.998	0.694
49%	1.54e-05	6.85e-10	0.999	0.721
54%	2.94e-05	1.97e-09	0.999	0.751
59%	5.31e-05	5.56e-09	1	0.78
64%	7.34e-05	1.49e-08	1	0.809
69%	0.000138	4.17e-08	1	0.839
74%	0.000272	1.1e-07		1	0.867
79%	0.000587	3.16e-07	1	0.896
84%	0.00157		1.02e-06	1	0.925
89%	0.00711		4.38e-06	1	0.952
94%	0.0408		3.11e-05	1	0.978
99%	0.733		0.0355		1	1

% = percentage of the total observations
other columns = probability such that that percentage of the observations have
probability less than or equal to this value.

8-3-93
------
The O values are fairly unimportant, but on the basis of the R values, we
could look for a threshold below which we would consider the tag assignment to
be probably incorrect. We want this threshold to be such that no more than
some certain proportion of the actually correct tags is so marked, and no less
than some certain proportion of the actually incorrect tags is not so marked.
To get a better idea, produce a more detailed version of the above table
(won't list it here - file plot/err.rpd and plot/err.rmd). To decide on the
figure, we also want to look at a typical success rate.
Suppose:
	N = total words labelled
	a = proportion which are ambiguous
	s = proportion of ambiguous words correctly labelled
	c = proportion of correctly labelled ambiguous words which are
	    below the threshold and therefore incorrectly rejected
	i = proportion of incorrectly labelled ambiguous words which are
	    below the threshold and therefore correctly rejected.

Then we have
	Na       = total ambiguous words
	N(1-a)   = total non-ambiguous words (correctly labelled, not rejected)
	Nas      = total ambiguous words correctly labelled
	Na(1-s)  = total ambiguous words incorrectly labelled
	Nasc	 = total ambiguous words, correctly labelled, and rejected
	Na(1-s)i = total ambiguous words, incorrectly labelled, and rejected

How do we measure the success rate? There are two ways. We can either treat
words that are rejected as being skipped, and measure the success rate of the
remaining words with respect to how many words remain. Or we can assume the
existence of an oracle which assigns the correct tag, and measure the success
rate on all of them. (A third way might be to assume they are incorrect, but
if this is so, you wouldn't really want to recognise them!)

CASE 1
For the first case, the total of words which are rejected and hence marked as
skipped is Nasc + Na(1-s)i. The total of ambiguous words left is
Na - (Nasc + Na(1-s)i)
The total of words overall left is
N - (Nasc + Na(1-s)i)
And hence we have an overall ambiguous success rate of
Nas(1-c) / Na - (Nasc + Na(1-s)i) = s(1-c) / (1 - sc - (1-s)i)
For the overall success rate
Nas(1-c) + N(1-a) / N - (Nasc + Na(1-s)i) = as(1-c) + 1-a / (1 - asc - a(1-s)i)

Take some examples, by choosing how many correct words we want to reject and
selecting a threshold. We can look up to see how many incorrect words will be
rejected. Using self test b with a = 32.35% and s = 94.92%, we then get:
Thres-	c	i	Ambig	All
 hold
base	0%	0%	94.92	98.36	(checks out OK)
0.631	1%	32%	96.45	98.87
0.692	2%	44%	97.03	99.07
0.809	5%	64%	98.01	99.39

CASE 2
Now we assume we have an oracle which tells us the correct tag for a rejected
words. We have a total of Na ambiguous words, of which all but Na(1-s)(1-i)
are correct. This figure is the number incorrectly labelled but not spotted.
So the ambiguous word success rate is
Na - Na(1-s)(1-i) / Na = 1 - (1-s)(1-i) = 1 - 1 + i + s - is =  i + s - is
And for all words the rate is
Na - Na(1-s)(1-i) + N(1-a) / N = a(i + s - is) + 1 - a

Thres-	c	i	Ambig	All
 hold
base	0%	0%	94.92	98.36	(checks out OK)
0.631	1%	32%	96.55	98.88
0.692	2%	44%	97.16	99.08
0.809	5%	64%	98.17	99.41

Now we attempt to confirm these figures using the actual test on v7b. (Numbers
are total words, correct words, percentage; also give the precentage of ambig
words and the number of skipped ones). Base information:
	Ambig			All			Ambig%	Skip
	19538	18546	94.92%	60399	59407	98.36%	32.35%	20863
Check with t = 0.0 (same result) and t = 1.1 (no ambig left). The latter must
be 1.1 and not 1.0 because there are some words where the second tag has a
very samll score and rounding causes the first tag to get score 1.0 even
though it is ambiguous.

CASE 1
t	Ambig			All			Ambig%	Skip
0.631	18797	18144	96.37%	59568	58975	98.86%	31.51%	21604
0.692	18491	17938	97.01%	59352	58799	99.07%	31.15%	21910
0.809	17721	17370	98.02%	58582	58231	99.40%	30.25%	22680

CASE 2
t	Ambig			All			Ambig%	Skip
0.631	19538	18855	96.50%	60399	59716	98.87%	32.35%	20863
0.692	19538	18985	97.17%	60399	59846	99.08%	32.35%	20863
0.809	19538	19187	98.20%	60399	60048	99.42%	32.35%	20863

These agree quite closely with the predicted figures. Those figures were based
on thresholds from the whole of LOB, with the test on just v7b. As a further
test, we should now try using the same thresholds on a test from Penn, and see
what predicted and actual figures we get. Use chunk "10" of newpenn. Just
record the success ratem, ambig% and skip.
	Ambig	All	Ambig%	Skip
Base:	90.79	97.63	25.59	2
Giving s = 0.9079, a = 0.2559. Predictions, base on c and i values as above
	Ambig1	All1	Ambig2	All2
0.631	93.49	98.38	93.74	98.40
0.692	94.52	98.66	94.84	98.68
0.809	96.30	99.13	96.68	99.15

CASE 1
0.631	92.93	98.26	24.60	823
0.692	94.01	98.56	24.07	1208
0.809	95.76	99.03	22.83	2098

CASE 2
0.631	93.33	98.29	25.69	2
0.692	94.50	98.59	25.69	2
0.809	96.37	99.07	25.69	2

The agreement here is less good, but this is unsurprising: we would need
thresholds from something closer to the actual data. However, the point is
that we still get better success rates.

Question: why not just keep increasing the threshold, since the success rate
keeps going up? Answer: because it means fewer words are analysed/the oracle
has to do more work.

Comment: case 2 is what we would do with an oracle. Case 1 is what we might do
if we had a training corpus and we wanted to throw out incorrect things for
later re-estimation.


Time for a new...
AGENDA ***
------
(This collects things from old agendas, adds some new ones, and cancels out
most of the YTDs above this point.)
1. Program changes
 a. allow unknown words to be merged in during re-estimation.
 DONE b. is gamma re-estimation done in the right function? Move it if not.
 DON'T c. change file format to include tags in dictionaries (maybe).
 DON'T d. do dict/trans with just bands of frequencies rather than absolute
	  values.
 e. do better number parsing for LOB
 f. Adapt training code for phrasal hypotheses
2. Tests
 DONE a. re-do the interative tests.
 b. add changes to retain formatting
 c. add changes to split off punctuation.
 DONE d. do a big corpus test, say 500000 words (as in Kupiec, e.g.)
3. Literature
 a. get de Rose's thesis.
 b. see "boosters" paper for an examples of weak complex lexical items (not as
strong as idioms, but still connected).
 c. follow up references cited by Brill (3 of them).
4. FSMs
 a. do real testing
 b. See 3b.
 c. Try FSMs which are error correction a la Brill.
5. Practical experience
 a. similarity: how to measure? how is it related to performance.
6. Write up
 DONE a. base of paper (labeller description)
 b. practical experience
7. Futher ideas
 a. do B-W with stuff to skip predicted errors and see if it gives better
convergence (and if it eventually corrects those errors)
8. BT
 DONE a. split off release code
 DON'T b. raise invoice
 DONE c. download
 d. send off
 e. do full documentation
9. Remarks from Kupiec CS&L paper
 a. Add extra tags for low freq words (section 3.1)
 b. do morphological analyser
 c. he doesn't tag on WORDS at all, but on equivalence calsses which then
predict the category sequence. This measn you can get away with less training
data (Cutting et al suggest 3000 sentences). Try implementing this.
 d. lexical probability problems are fixed by a higher order network, i.e. a
HMM with some disallowed transitions. (page 7)
 e. follow up Rohlicek was (page 9)


Iteration tests
...............
Methodology: build a dictionary and transition matrix from a substantial part
of the LOB corpus, namely b, c, d, e, f, g, h, j. Then add JUST the unknown
words from l, without changing the rest of the dictionary or the transitions
matrix. Then do iteration tests on parts b and l using this dictionary (btojl)
and matrix (btoj). btojl.lex has 39288 entries.

In each case, we will do fewer tests than in the previous one, with 30
iterations; as follows:
0	d/g	t/g	base case
F			most frequent
1	d/n	t/g	equiv: 2 (d/T,t/g), 4(d/g,t/T)
2	d/s	t/g
5	d/n	t/T	equiv: 6(d/T,t/T)
8	1/g	t/g
9	1/n	t/g	equiv: 10(1/T,t/g), 12(1/g,t/T)
13	1/n	t/T	equiv: 14(1/T,t/T)
16	d/g	1/g
17	d/n	1/g	equiv: 18(d/T,1/g), 20(d/g,1/T)
21	d/n	1/T	equiv: 22(d/T,1/T)
24	1/g	1/g
25	1/n	1/g	equiv: 26(1/T,1/g), 28(1/g,1/T)
29	1/n	1/T	equiv: 30(1/T,1/T)
36	d/s	t/T
40	1/s	t/g
44	1/s	t/T
48	d/s	1/g
52	d/s	1/T
56	1/s	1/g
60	1/s	1/T

9-3-93
------
Results of doing this. We report the best Ambig performance and the iteration
it appeared at (or the last performance if still rising). General figures
first:
	Words	Ambig	Ambig%	Skip
b	60399	34133	56.51%	20863
l	57562	31532	54.78%	17606

(The best all performance can be found from 1 - ambig% + ambig% * ambig)

Test	d	g	Best(b)	at(b)	Best(l)	at(l)	Comments
0	d/g	t/g	95.96	-	94.77	-	base case
F	-	-	89.22	-	85.32	-	most frequent
1	d/n	t/g	95.40	2	94.44	3	= 2:d/T,t/g, 4:d/g,t/T
5	d/n	t/T	94.73	4	91.25	29	= 6:d/T,t/T
8	1/g	t/g	90.52	9	91.82	6
9	1/n	t/g	92.96	5	92.80	5	= 10:1/T,t/g,12:1/g,t/T
13	1/n	t/T	93.08	7	92.55	7	= 14:1/T,t/T
16	d/g	1/g	75.40	30	78.86	30	
17	d/n	1/g	84.74	5	86.06	5	= 18:d/T,1/g,20:d/g,1/T
21	d/n	1/T	94.06	2	92.27	3	= 22:d/T,1/T
24	1/g	1/g	65.47	30	64.24	30
25	1/n	1/g	66.51	30	72.48	30	= 26:1/T,1/g,28:1/g,1/T
29	1/n	1/T	75.49	14	80.87	30	= 30:1/T,1/T
36	d/s	t/T	94.73	4	93.81	5
40	1/s	t/g	92.96	4	92.80	6
44	1/s	t/T	93.08	7	92.57	7
48	d/s	1/g	84.75	5	86.06	5
52	d/s	1/T	94.06	2	92.27	3
56	1/s	1/g	66.53	28	72.48	30	
60	1/s	1/T	75.36	17	80.87	30	

If we take the results from b and plot the success rate against the average
observation probability, there is considerable variation in the sort of curve
seen. Roughly:
Class 1: falling linear, with knee: 1
Class 2: falling linear: 21, 52
Class 3: falling, wide scatter: 5, 36
Class 4: rising linear: 6, 16, 29, 60
Class 5: rising linear, wide scatter: 13, 25
Class 6: multilinear: 17, 24, 56
Unclassified (usually tight grouping): 9, 40, 44, 48


There are two further tests to do: firstly, the same as the above, but on a
much larger corpus. Secondly, a Cutting et. al. style test where training is
on one part of the corpus, and testing on the other. In the latter, it should
be notes that they are using ambiguity classes rather than words. They use a
lexicon in training, but do not report (a) if it contains frequency
information (probably not) and (b) what their initial transition matrix is.
However, they do say that their approach CAN be given initial bias in both
ways. I think for the larger test, we should use d/n and 1/T. The former is
justifiable in that real dictionaries are often ordered by frequency, which
approximates to having d/n. Not using d/g means we don't need to know anything
about tag distribution across the whole of language, as does using 1/T. This
is test case 21.

So we first do the full collection of iteration tests on a large corpus formed
by concatenating b, c, d, e, f, g. This means that it does not exactly match
the btojl.lex and btoj.trn (as above), but it is still around 500000 words,
which is a good size.

	Words	Ambig	Ambig%	Skip
big	497878	279457	56.13	163998		(btog)

Test	d	g	Best(big)	at(big)	Comments
0	d/g	t/g	96.17		-	base case
1	d/n	t/g	95.40		3	= 2:d/T,t/g, 4:d/g,t/T
5	d/n	t/T	94.82		5	= 6:d/T,t/T
36	d/s	t/T	94.82		5
21	d/n	1/T	94.51		3	= 22:d/T,1/T
52	d/s	1/T	94.51		3
13	1/n	t/T	93.60		6	= 14:1/T,t/T
44	1/s	t/T	93.60		6
9	1/n	t/g	93.48		4	= 10:1/T,t/g,12:1/g,t/T
40	1/s	t/g	93.48		4
8	1/g	t/g	92.36		30
F	-	-	88.71		-	most frequent
17	d/n	1/g	87.55		30	= 18:d/T,1/g,20:d/g,1/T
48	d/s	1/g	87.55		30
29	1/n	1/T	79.12		17	= 30:1/T,1/T
60	1/s	1/T	79.12		17
16	d/g	1/g	78.72		30
56	1/s	1/g	66.88		30
24	1/g	1/g	66.86		30
25	1/n	1/g	55.88		30	= 26:1/T,1/g,28:1/g,1/T

Ordering on best (0 before any of the rest):
b   1, {5,36}, {21,52}, {13,44}, {9,40}, 8, F, 48, 17, 29, 16, 60, 56, 25, 24
l   1, 36, {9,40}, 44, 13, {21,52}, 8, 5, {17,48}, F, {29,60}, 16, 56, 25, 24
big 1, {5,36}, {21,52}, {13,44}, {9,40}, 8, F, {17,48}, {29,60}, 16, 56, 24, 25

10-3-93
-------
(Big test still running!)

We also want to do a second big test, in which we iteratively build the
dictionary and transitions from one large corpus and label another one against
it. Since we don't want unknown words to be left out, the procedure is as
follows:
build a basic dictionary from training text
find unknown words in test text
merge dictionaries
tag test text (baseline)
ITERATE {
re-estimate using training text and basic dictionary -> new basic dictionary
merge unknown words
tag test text
}
We have to keep on merging because the re-estimation will kill the unknown
words (since they won't occur).

A better procedure is that if we find a word is re-estimated to zero, then we
set it to have a low score, say 1/gamma for the tag. This seems better. So the
procedure above changes to:
build a basic dictionary from training text
find unknown words in test text
merge dictionaries
tag test text (baseline)
ITERATE {
re-estimate using training text and basic dictionary -> new basic dictionary
tag test text
}

(This is called xall below)


Timing: 87s to label 60399 words on B = 694 words per second

Also do some more tests to collect information about branching factors and so
on. We first change the program to collect information about average numnber
of hypotheses and branching factors only on ambiguous words. Then we run this
in self tests on each part of the corpus, and also on a test wtrained on b and
tested on 10% portions of b. We also want to find the length of ambiguous
sequences and whether they contain errors or not. So change the code for this,
using the special option. Note that the new measurement of BF and AvHyp is
different from the old one (and probably better). Do the sequence thing by
output rather like the O+ etc stuff, and analyse it later.
For sequences, we want the average length of a sequence, and the average
length of a sequence containing one or more errors (there are other measures
too).

Corpus	Ambig	Ambig%	AvHyp	BF	SeqLen	ESeqLen
a	94.64	31.88	3.91	2.59	1.48	0.148
b	94.92	32.35	3.78	2.58	1.42	0.132
c	96.04	26.83	4.12	2.79	1.36	0.091
d	96.23	39.36	3.99	2.59	1.60	0.120
e	94.13	34.77	4.19	2.70	1.49	0.163
f	95.40	35.69	4.14	2.66	1.52	0.135
g	95.31	38.43	4.59	2.79	1.57	0.148
h	95.80	34.63	4.01	2.66	1.41	0.111
j	96.12	41.17	4.98	2.98	1.52	0.124
k	96.14	40.26	4.11	2.55	1.69	0.129
l	94.90	30.28	3.94	2.64	1.46	0.130
m	95.48	22.73	3.13	2.41	1.27	0.080
n	95.64	36.27	3.95	2.51	1.65	0.145
p	95.51	33.56	3.75	2.48	1.55	0.141
r	95.65	28.31	3.43	2.46	1.34	0.096
Plots wanted: ambig vs ambig%, avhyp, bf, seqlen; eseqlen vs seqlen. These
show little connection. But I think we want some more data, so introduce
something which does the tests against a bigger dictionary, to try to see what
happens as the degree of ambiguity goes up. (Unknown words would give even
more). Do this by testing b...j against the btoj data we already have
Corpus	Ambig	Ambig%	AvHyp	BF
b/btoj	95.96	56.51	6.02	2.98
c/btoj	96.77	53.59	5.86	2.98
d/btoj	96.16	57.28	5.78	2.95
e/btoj	95.68	57.65	6.12	2.99
f/btoj	96.22	55.82	5.92	2.97
g/btoj	96.35	55.72	6.00	3.01
h/btoj	95.97	58.46	6.45	3.08
j/btoj	96.58	56.63	6.14	3.05

Could also do tests with unknown words, but you get vastly larger branching
factors. So don't bother. But do also carry out a test with a dictionary build
from everything and the tagging the various corpora against it. (The "all"
dictionary is 47276 entries.)
Corpus	Ambig	Ambig%	AvHyp	BF
a/all	96.08	57.99	4.90	2.22
b/all	96.14	59.14	4.98	2.24
c/all	96.80	56.23	4.68	2.18
d/all	96.24	59.70	4.90	2.25
e/all	95.80	60.37	5.12	2.28
f/all	96.41	58.42	4.88	2.22
g/all	96.35	58.06	4.96	2.25
h/all	95.99	60.24	5.44	2.36
j/all	96.55	58.53	5.14	2.30
k/all	96.14	58.40	4.59	2.14
l/all	96.12	58.92	4.63	2.15
m/all	96.02	55.45	4.33	2.08
n/all	96.08	58.64	4.54	2.12
p/all	96.16	57.47	4.46	2.10
r/all	95.94	58.06	4.73	2.19

Correlations:	self tests	whole lot
ambig to AvHyp	0.142		0.501
ambig to BF	0.095		-0.108

Another useful plot is length of sequence versus average number of errors in
sequence of this length. Do this on all the merged files from above.

Sequence length		Proportion with errors
1			4.04802%
2			8.526%
3			14.0063%
4			17.8917%
5			21.3166%
6			25.0871%
7			32.5879%
8			40.4762%
9			39.4737%
10			27.2727%
11			0%
12			66.6667%
13			0%


12-3-92
-------
Big test nearly done. Next do the xall test described above. (Writing up some
of this stuff in practtag while I do it.)

XALL
....
Training corpus = btog (497878 words, 279457 ambig, 163998 skipped)
Test corpus     = hton

Diction from btog + unknown words of hton is 42981 entries.
(* Results under 15-3-92 *)


15-3-93
-------
We want to repeat the xall test with dict initialised to 1/n, trans to 1/T and
both, for comparison. These are codes:
9  = 1/n,t/g
20 = d/g,1/T
29 = 1/n,1/T

Baseline performance (with no re-estimation, against perfect dictionary etc):
All 96.90%, Ambig 94.63%.
Ambig 48.31%, Total words 447361, Skipped 145616.

Ambig rates
Init	1	2	3	4	5	6	7	8	9
0	92.24	91.78	91.51	91.32	91.17	90.96	90.73	90.32	90.23
9	88.26	89.29	89.62	89.71	89.77	89.80	89.72	89.67	89.63
20	79.03	80.59	81.15	81.65	82.02	82.18	82.46	82.57	82.68
29	71.57	75.24	76.89	77.64	77.90	78.11	78.17	78.20	78.23

All rates
Init	1	2	3	4	5	6	7	8	9
0	95.74	95.52	95.39	95.30	95.23	95.13	95.02	94.90	94.78
9	93.83	94.32	94.48	94.52	94.55	94.57	94.53	94.51	94.49
20	89.36	90.12	90.39	90.63	90.81	90.89	91.02	91.07	91.13
29	85.76	87.53	88.33	88.69	88.82	88.92	88.95	88.97	88.98


(* Aside on preparing err.rp and err.rm for practtag paper: get output from
tagger containing R+ and R-. Then use total.c to get it to 2 sf, sort it and
total it. NB! This is changed from earlier procedure (5th March). Results in
err.rp1 and err.rm1. *)

SIMILARITY
..........
Experiments with a similarity measure on transitions. We will take various
tests and produce a similarity measure with the transitions matrix using the
following procedure.
1. Generate a starting matrix.
2. Label a corpus against it, re-estimating a new matrix.
3. Form the correlation coefficient for each value in the new matrix against
the value in the old matrix.
Add an option to the program for doing this (do it as a special option).

16-3-93
-------
Start doing the similarity tests above. Doesn't seem to be very successful. I
did five iterations of a B-W seld test on B, and got the following

Iter	2	3	4	5
Correl	0.71121	0.99929	0.99961	0.99978
Ambig	94.92	93.89	93.34	92.86
i.e. correlation is still rising while success is still falling

But to confirm this, try doing tests using a different part of the corpus
against the B matrices. Just one iteration in each case.
	C	L	N
Correl	0.67183	0.65420	0.61197
Ambig	71.87	73.23	89.46
(These have a fair number of unknown words, so the tests are *real slow*)

And as a further test do the same against the btoj data.
	C	L	N
Correl	0.62684	0.62510	0.64476
Ambig	86.77	89.94	71.94
(L and N will have unknown words)

Conclusion: correlation between matrices is not a useful measure.

As the last thing in the practical experience thread, we will add an change to
B-W so that if it predicts an error (using the error threshold), then it will
omit the relevant word altogether from the re-estimation. This is less than
perfect, because re-estimation takes all the hypotheses into account, but it
might give an indication that something has gone wrong, and so the
re-estimation is not to be trusted. Two possibilities: ignore if chosen
hypotheses below the threshold, ignore any individual hypothesis below the
threshold. Only explore the second one.

We test this with the thresholds above. Use self tests on B. Start by doing
some iterations with no threshold to get base case and then use the three
thresholds we had before.

(Ambig accuracy)
(* Cut this, because bug found: see better results below *)
Iteration numbers exclude the training and the first run after it, since B-W
and hence thresholding has not taken effect at this point. In all cases, the
performance on the initial run is 94.92%

Comment: I haven't recorded the all performances. The ambiguous proportion is
32.35%, from which you can calculate it: 1-0.3235 + 0.3235*a. E.g. from 94.92%
you get 98.36%.

17-3-93
-------
(* Slight change *)
In deciding when to eliminate something, the precedure for gamma and pi values
is just the hypothesis gamma value (normalised) being too low. For
transitions, it is if either gamma value is too low. Before I had this with a
test only to see if the from gamma value was too low. With this change, we get
new results as follows.

Figures retaken on 5-4-93, because a bug emerged: that the thresholding WAS
working on the output, but not actually in excluding anything from the
dictionary etc. Need to hunt down why first.
... Now I understand: it's because there is a re-est threshold as well as a
normal one. I excluded this from the BT release for some reason: now
re-instate it. I started redoing these tests because I was worried I had not
made the distinction in the performance figures, but perhaps I had; do a
double check.
... So in fact no need to retake figures.

Thresh	1	2	3	4	5	6	7	8
none	93.89	93.34	92.86	92.49	92.33	92.15	92.08	91.92
0.631	94.18	93.91	93.85	93.81	93.74	93.67	93.60	93.55
0.692	94.15	93.96	93.85	93.69	93.62	93.53	93.50	93.37
0.809	93.91	93.68	93.33	93.20	92.92	92.76	92.69	92.63
(Better with this further change)
Comment: the results are improved by making this change, but it is
interesting how a LOWER threshold makes better results: presumably too much
data is getting excluded otherwise.

Why did it change? Some correction to the program perhaps? Or maybe I just
noted the figures incorrectly before.)

Now do the above tests but with an initialisation code. We just want a general
idea of what happens at much lower performances, so use 29 (1/n,1/T).

(Ambig accuracy)
Thresh	1	2	3	4	5	6	7	8
none	69.31	71.49	73.14	73.79	74.48	75.01	75.46	75.59
0.631	72.23	75.51	78.74	79.47	79.59	79.60	79.50	79.48
0.692	72.23	75.67	78.82	79.59	79.70	79.72	79.60	79.52
0.809	72.23	75.66	78.70	79.42	79.62	79.53	79.49	79.43
(Exclude the first iteration again, since no re-estimation has happened yet)

We do a final set of tests in which we exclude words where the chosen
hypothesis was below the threshold. Do the same test as before, i.e. self test
and test with I29.

Self test
Thresh	1	2	3	4	5	6	7	8
none	93.89	93.34	92.86	92.49	92.33	92.15	92.08	91.92
0.631	93.87	93.25	92.94	92.58	92.93	92.21	92.18	92.10
0.692	93.80	93.23	92.82	92.61	92.35	92.33	92.15	92.24
0.809	93.44	92.87	92.50	92.24	92.15	92.02	91.98	91.92

I29 test
Thresh	1	2	3	4	5	6	7	8
none	69.31	71.49	73.14	73.79	74.48	75.01	75.46	75.59
0.631	72.23	75.17	78.01	78.93	79.20	79.43	79.36	79.38
0.692	72.23	75.33	77.84	78.76	78.88	79.02	78.99	79.10
0.809	72.23	75.00	77.71	78.39	78.68	78.85	78.90	78.78

Comment: these are lower than the other method, which seems reasonable. So
drop it (and cut it from the code).

*** These figures redone 15-4-93.

18-3-93
-------
We now declare the practical experience thread finished, for the time being at
least. FSM stuff continues in a new Notes file. However, before moving on, we
index the stuff above, where it has not been superseded by later work.

28-12-92 Justification for dictionary calculation.
18-01-93 Analysis of re-estimation tests.
19-01-93 Ted's background to the work.
25-02-93 Penn format.
26-02-93 Penn results.
26-02-93 Accuracy variation with size of corpus (not written up).
26-02-93 Accuracy variation with lexical and transition probabilities (not
	 written up).
01-03-93 Lob results.
02-03-93 Tagset variation.
08-03-93 Automatic error detection reasoning.
08-03-93 B-W tests with varoous starting points.
09-03-93 ... continued.
10-03-92 Big corpus tests.
10-03-93 Sequence length variation.
15-03-93 Results of B-W with automatic error detection.
17-03-93 ... continued.
16-03-93 Experiments on matrix correlation.
