Second tagging notes file. Started 18-03-93
===========================================

We are concerned mainly with the FSM work. First establish an agenda, based on
points from the previous file.



AGENDA (Now overridden)
------
(This collects things from old agendas, adds some new ones, and cancels out
most of the YTDs above this point.)
1. Program changes
 a. allow unknown words to be merged in during re-estimation.
 b. do better number parsing for LOB
 c. Adapt training code for phrasal hypotheses
 d. add changes to retain formatting
 e. add changes to split off punctuation.
3. Literature
 a. get de Rose's thesis.
 b. see "boosters" paper for an examples of weak complex lexical items (not as
strong as idioms, but still connected).
 c. follow up references cited by Brill (3 of them).
4. FSMs
 a. do real testing
 b. See 3b.
 c. Try FSMs which are error correction a la Brill.
5. BT
 a. send off
 b. do full documentation
6. Remarks from Kupiec CS&L paper
 a. Add extra tags for low freq words (section 3.1)
 b. do morphological analyser
 c. he doesn't tag on WORDS at all, but on equivalence calsses which then
predict the category sequence. This measn you can get away with less training
data (Cutting et al suggest 3000 sentences). Try implementing this.
 d. lexical probability problems are fixed by a higher order network, i.e. a
HMM with some disallowed transitions. (page 7)
 e. follow up Rohlicek was (page 9)
8. Brought forward: Also make skip and phrasal tests be macros. At least add a
flag to skip the base hypotheses in tagging, even if the full semantics of the
above isn't yet worked out. Hooks added for this, but no current FSM action to
make it possible.
 a. work out the semantics of this.


Procedure for getting some initial syntax rules for FSMs: write something to
automatically scan parts of a parsed corpus and pull them out. Use "lancpars"
as the corpus, which is a parsed version of parts of LOB. So we have the same
tagset.

Format: ignore untagged words. Phrases are brackets [tag ... tag]. There is
alwats a space between the tag and a word that follows, but it looks as if the
tag can be followed by another [. Also ] can be followed by ], or a tag, but
has a space if it is followed by a word.

So initially write the code to extract syntax rules. We will keep this
separate, and just ignore the words. Ultimatley, of course, it will have to be
integrated into the tagger.

19-3-93
-------
Now have a program lanc.c which takes the lancpars corpus and reports all
syntax rules in it. Many of these are pretty obscure, and the number of
further inclreased by compund phrase tags such as NP/NP/PPL& (it's a kind of
co-ordination). So let's run it on the whole of lancpars, get all the rules,
sort them, uniq -c them and sort them again. We then have a file listing all
the rules in increasing order of frequency, with the frequencies. Results are
in lrules.

We must first copy the lancpars files across, uncompress them, and edit out
email headers etc. We also join the bits together in doing this. This gives a
total of 11679 rules, a number of which are X -> Y in form. For example, the
second most frequent rule is V -> VBD

To do some experiments with FSM (not sure what I am really looking for in
this!) let's start by coding up some of the more frequent noun rules. Category
N means noun phrase. Take the rules with frequency greater than 500, and throw
out those that involve other phrasal categories other than N (for now); they
are indented below.

 518 N -> ATI NNS
 574 N -> AT JJ NN
 583 N -> DT
 659 N -> NNS
 712 N -> PP3O
 824 N -> AT NN
	 897 N -> ATI NN Po
1097 N -> PP2
1172 N -> PP$ NN
1554 N -> PP3
1932 N -> NN
2129 N -> NP
2521 N -> ATI NN

Now code these up as FSMs.

Some problems and thoughts emerge from this
(1) How to do sensible evaluation?
(2) Add something which reports when we have more than one FSM in parallel, so
we can see the consequences it had.
(3) Can we get FSMs which start on the basis of phrasal hypotheses from other
FSMs? (i.e. recursive networks)
(4) How to do sensible training? (a) generating the transitions (b) what to
use for lexical probabilities (1/gamma?)?
(5) Bug: gives "bad total ?" at some point in processing LOB-B. This meanss we
had an alpha total of <= 0. (Presumably something like NaN and hence the '?')
It happens on a phrase spanning a single word ("year" at B15 58).
(6) Lancpars uses simple X -> Y rules only on some special cases of words of
category Y, but we always recognise them. Is this right?

and the really key question:
(0) What really am I trying to achieve?


22-3-93
-------
			FSMs: initial statement
			=======================
1. Background/aims
2. FSM specification
3. Problems and procedures

1. Background/aims
We want to add some phrase structure information to the basic tagging. The
main reason for this really is for parsing, with a subsidiary aim that we
might be able to use things like identifying heads to improve the accuracy of
tagging. To do this, we will define FSMs on tags in order to simulate the
effect of syntax rules.
(a) Is this right? Or do we want to junk it and really use syntax rules?
(b) At present, FSMs are not allowed to be recursive, so it isn't really
parsing: the structure is always flat.
(c) FSMs are expected to be formed by expanding out the "rules" used in an
example parsed corpus. So if these are recursive, they also have to be
expanded out. The result may not be finite, so we have in fact imposed a
limitation.

2. FSM specification. See file "fsm".

3. Problems and procedures.
(a) Actions. The FSMs conclude with two major actions: tag and end.
* End just builds a phrasal hypothesis. It is intended for things like idioms,
where we will have completely forced the internal structure of the thing the
FSM has recognised.
* Tag tags the hypotheses it has spanned and builds a phrasal hypothesis.
** PROBLEM 1: the spanned hypotheses may start with an ambiguous word. Check
this is OK (it should be).
** PROBLEM 2: the spanned hypotheses may end with an ambiguous word. This is
not OK as it stands, so we will need somthing to fix it, like adding a faked
anchor.
** PROBLEM 3: a single word can be spanned. Solving problem 3 ought to fix
this.
(b) Integration.
** "End" always give a resulting score of 1, and "Tag" the observation
probability of the things it has tagged. How do we make this be comparable to
other scores we are getting?
(c) Input/Output.
** How do we produce the output in a convenient form for comparison with the
input, so that we can arrive at an accuracy measure?
(d) Training. There are two strategies for training. (1) Take a hard parsed
corpus and construct rules from it. Pass a corpus through the same set of
rules, and train it on the results. This is probably NOT what we want, because
we will have lots of cases where the rules will have incorrectly recognised
phrases in the training data. (2) Use a hand parsed corpus, and adapt the
training procedure to know about phrase boundaries. This means quite a change
to the input routines and maybe to the data structures, but it is also
probably what we will need for the input/output problems described above (and
with the current data structures it may not be so awful after all).
(e) Tag set. The Lancaster parsed version of LOB potentially has a huge number
of tags (60000). This is clearly impractical with the current implementation.
We could edit the corpora by hand to bring this down (also we might want to do
this so we are only seeing the kinds of phrases we are actually interested
in), or we could use the reduce mechanism to get rid of some of them; though
this is again probably slow.

So, the first thing to do is probably to add something which will read a
parsed corpus in Lancpars format. This means adding a new input option.
Lancpars does not have anchors in it. We could automatically insert some
whenever we see that we are at top level in the parse structure, but this is
unsatisfactory, because we may be editing down the corpus to get rid of some
of the levels of structuring. So assume that if we have them, they will have
been inserted by hand. Also use this to solve the tag set problem, by choosing
them and reducing them by hand.

I looked into what would be needed to get rid of the idea of having an
unambiguous anchor at the start. There are two places where you have to have
it: (i) for training (ii) for re-estimation. In both cases, if you don't, you
lose the link to the next model. So retain this, and accept the consequences.

Concentrate for the time being on training. For this we will need to write
code to read in the structured corpus and to learn from it. This may involve
some restructuring of the main tag_corpus loop (which is why I was looking
into dropping anchors).

Hold on this for now. I'm too pathetic today to get much done.


23-3-93: away

24-3-93
-------
Comment: using FSMs is actually quite like higher order (augmented) networks
in Kupiec CS&L paper.

To do training: read in tags a words of a special class. Keep a count of how
many open and close brackets have been seen. When you get down to zero and the
other may_tag conditions apply, then enter the training code. Do training
recursively, by finding a bracketing, and calling training within it, down to
the lowest level, at which point you do training as before. The new class
items on the stack use the ctag field for indicating what the phrase tag is;
this also allows a consistency check on the input corpus.


25-3-93
-------
Here's a note on something that puzzled me. After looking at the output
transitions from training on a small text sample I noticed that each gamma
value was either 1 or the number of occurrences of the tag plus 1. This is
correct: it comes from adding in a fake transition to the anchor tag for each
other tag. So no need to worry about it. (Of course, it makes the results on
very small training corpora a bit dodgy, but they are anyway.)

Training code rewitten and hand checked on a small sample: seems OK. One
question that seems to arise is what you do about things like a N (phrase)
made up of a single NN (lexical) or a single Ti (lexical). At the moment it is
the score of the enclosed item plus the transitions to and from the phrase tag
that matter.

Next we write a little program which will get rid of all phrase tags except
those from a specified set. (dephrase.c) Some source code stolen from lanc.c.

Now we start doing some testing. This will throw up some problems with the
labelling algorithm (no doubt!). We first take part b of the tagged corpus and
extract all N (noun phrase tags), discarding all others. Result in parsed/b.n.
Next we use "lanc" to extract all the noun phrase rules from it
(parsed/rules.b.n). This gives a total of 626 rules. Next we start to code
this up as FSMs. Undoubtedly, it would be better if this were done
automatically, but do it by hand for now (parsed/fsm.b.n).

(Did think of keeping all the phrase tags and using the embedded phrases as
names of states, expanding them out. Because then moving over to FSMs with
other phrases would be easier. But it's not equivalent.)

Compiling these FSMs by hand is really tedious! So I should write a program to
do it (makefsm.c)

26-3-93
-------
Implemented the program makefsm. That's all for today.

29-3-93
-------
Reimplemented makefsm to get it right! It's rather noddy at the monent and
could be imprived by adding macros and/or looking for identical states. But
let's leave it as it is for now. If we run it on the non-recursive rules from
rules.b.n, we get 675 items. So next we will just let the labeller loose on
this to see whether it manages to run or not! The input will be lancpars/b.
Use input option 8. First need to train it, using a tagset which just has N as
an extra category, plus the alternative ditto tags that are used.

30-3-93
-------
Planning and running this test brings to light some problems.
(1) The FSMs do not allow recursive phrases (and it is quite hard to see how
to set them up with the current code). They *do* however allow embedded ones.
That is, they will recognise them, but you can't write FSMs to define them.

(2) The "choosing" mechanism is no good. Suppose you have an embedded phrase.
Then when it is tagged, "chosen" is set to the best hypothesis on a word
within this. Then, when the overall model containing the phrase is tagged,
"chosen" gets reset to the phrase. (I think).

(3) In fact, we are never actually managing to output a phrase at all at the
moment!

(4) Bombs out with a bad total.

3 and 4 are probably related.

Problem: executing a fsm does not set up alpha, beta, gamma scores, only an
overall phrase score. So the whole thing doesn't get integrated. That is, if
you have the end of a phrase preceding another word, in FB, then there is no
score to use.

How to fix it? Need a serious rethink of how scoring of and integration of
phrasal hypotheses works.

Re-reading the Kupiece trellis paper, I understand the way he does it, which
is that he has separate alpha and beta passes through the phrasal networks and
integrates the whole lot together using these values in something which is
only a fairly minor adaptation of the F-B algorithm. I could duplicate this
work, though it would need a fair amount of rehacking.

I think this should go on hold for a little while, to give me time to ponder
it and get clearer about it.

In the mean time what I will do is some restructuring of the program, so that
training and most_freq work on a word by word basis without having to build up
a stack in the same way (and it is silly that they do). This means that I can
then drop having initial anchors, since the other two algorithms don't need
them.

Code changes
............
These changes make train and most_freq be separate from the others, which is
useful because it means that phrasal tagging can call tag for just Vit or FB,
without the need of having an anchor at the start.

HOLD on this: is it really what we want? It makes the code a lot cleaner. The
main fly in the ointment is (I think) what to do about phrasal training. I
think the answer actually is easy: for TO phrase start and FROM phrase end, do
nothing different. From FROM phrase start and TO phrase end, just ignore the
transition.


DONE Move create_links to push_word.
DONE Split training and most_freq to have own tagging loops.
DON'T (no need) Allow output_word to be global
DONE Check options that requires output_words and block in cases where only
output_word is called.
DONE Recheck results.

One thing that emerges from this as something needing to be sorted out is
whether we normalise by gamma when doing most_freq. In the old version this
did NOT. A comparison figure available is b vs. btoj which gave 89.22/93.91
(ambig/all). If we change it so that it does normalise by gamma, then we get
80.55/89.01. So it is best to leave it so that it does not. Next also need to
check that we get the same results if we train and write as two separate runs
and if we do it in one go. Former gives 83.84/94.77, latter gives 58.92/86.71.
So there is some sort of bug here! Found it: it's that the re-estimation
scores never get copied across. Rather than do this, just force most freq to
be done in a separate run from training.

31-3-93
-------
Rehacking the code reveals an interesting point. The training initialises the
pi values to a uniform amount. In the old version this didn't matter because
we always used an unambiguous word at the start. But in the new version the pi
setting does matter, and it appears to be changing the success rate downwards.
So we have two options (a) continue with an unambiguous initial word (b) find
a way of making a suitable collection of pi values. One way we might do the
latter is to total the transitions from other tags to this one and use that as
the pi value. Try adding a hack to do this (for now in the thing that reads in
transition matrices).

After hacning arouns a bit, I also found there was a bug about Ambig_Needed
(which is now forced), which knocked the performance down a bit. Fixing this
we get:
Performance with no pi hack		93.37
Performance with pi hack		94.42 (yes 94.42)
Performance with unambiguous anchor	94.92

So we will note this in the practical experience paper and go on with using
the initial anchor of the previous ambiguous word. Also instantiate the "pi
hack" in the training code (i.e. sum from transition values). However, we keep
with the procedure of having an unambiguous initial anchor.


Comment: the problem we are facing with FSMs is that Kupiec's trellis stuff
shows we need separate alpha and beta passes. What I have done ends up a
pretty much equivalent to his, I think.


1-4-93
------
Chart parser written. This will replace the FSMs. We next need to integrate it
with the main data structures. Then we need to add the actual tagging to it.
It only works on tags, but I think that is probably what we want. As a
suggestion for the tagging (integration) do roughly what I was think of for
FSMS, which is to tag from the start symbol to the end symbol, with scores to
and from phrase brackets to do this. This is a bit like Kupiec, only not using
a separate model for each kind of phrase. Training is easy to do. Tagging I
think has to be done as separate alpha and beta stages.

Where this is really going is to get something which finds heads. So we can
then tag between them. But this is a long way off.

DONE: added stuff to link nodes to the stack and to create the edges.

Now here comes an important point. In Kupiec and in the earlier FSM approach,
the constituents of a phrase could be ambiguous, i.e. for each consituent
making up a phrase there was a range of possibilities for what it might be.
With the chart parser as proposed, the constituents are unique: there is a
single hypothesis for each. We could convert it to an ambiguous form, by
allowing phrase rules to recognise disjunctions, and whenever an edge has been
recognised, scan all other edges spanning the same points to see if it can be
used in an equivalent way (meaning that the edge was recognised by the same
point in the same rule, or possibly (not sure here) that the edge was
lexical).

So given this, what we want to do when we recognise a phrase is to generate
alpha and beta scores for it. This is easy to do, since the end points are
unambigous, and the only special case that need be allowed for is when there
is just a single word in the phrase, in which case we can use beta = 1, alpha
= pi, and somehow bring in the word probabilities as well. Then these scores
can be assigned to the phrase tag as a whole.

Therefore, the next stage is to write code to construct phrasal hypotheses in
this form. And then to add something that does the tagging.

Comment: it is unclear whether the Kupiec approach or this one is better. This
ought to give sharper probabilities but it may be too extreme. We will see.

2-4-93
------
Insight: one way of providing further probability information for phrases is
to look at the tag sequence for them, and see where it occurs as a phrase and
where it doesn't, to get a probability that that tag sequence *is* a phrase.

Next step: write code to
DONE. move initial alpha/beta calculation.
0. Simplify data structures (see below).
1. add the phrasal hypotheses to the stack.
2. gets its alpha/beta/gamma values
2a. without phrase tags in the model (need special case for single word model)
2b. with them
(use a code switch)
3. include phrase tags in re-estimation
4. output phrases and evaluate their occurrence.

The stuff for adding phrasal hypotheses to the stack reveals some
considerations about how phrases should be represented when they have phrases
inside them.
(1) Since phrases are made up of unique items, we could have a pointer for its
constituents back into the base hypothesis list, and when you look at a
phrase, you just use this (not following the "next" for the hyp). Phrasal
hypotheses would then also go in the base list. The result is a simpler data
structure, but it will potentially create problems if we ever go to
alternations in the rules (as mentioned above). Also if we want to find out
what was an original hypothesis and what was added.
(* In fact the latter problem does not apply, because we have the trace back
to the lexical entry in a hypothesis structure. Except in the case of unknown
words. So amend hypothesis structures for have a Boolean indicating if lexical
*)

(2) Suppose you have S -> NP VP and VP -> V NP and get hypotheses NP V NP.
Then do you need separate hypotheses for the VP in the S and the other one?
No: it is in the nature of the chart that they should end up in the same
place.

Go ahead with these changes: to summarise: we now only have base hypothesis,
and they have a flag for whether they are lexical. Phrases get added into this
list. (And when creating an edge from a lexical hypothesis, we should record
where it comes from.) Phrasal hypotheses will have a field which points to a
list of links, indicating the hypotheses they are made up from. We alsready
have a field indicating whether a hypothesis is phrasal or not. But there
might be some way we add new lexical hypothesis, so still have an extra field
here. We also get rid of the silly things for iterating over hypotheses (but
have a single for loop macro anyway).

Rethink: the problem with this is how to tag phrases separately from the base
ones, without having to write a new tagging loop. So what we will do is to
have a base hyp list for each stack level at present, and also a phrasal
hypothesis list. Phrasal hypotheses may be shared between stack levels. Each
phrasal hypothesis has a tag and a pointer to a list of links. Each link has a
pointer to the next link (ending in NULL), a pointer back to the corresponding
sp, and a list of hypotheses (with a single member in a present). Such
hypotheses may themselves be phrasal. These hypotheses are NOT linked from sp,
but know their way back to it. The phrasal hypothesis also has a pointer to
the sp at the end of the phrase.

In fact, this is still not enough, because we tag on stacks. So we do want
pretty much the same data structure we always had. The only thing which really
emerges is that phrases within phrases are not linked back to their parents
stack structures.

Actually, this still doesn't really work. You have to drop the master field,
because for int the example above, the VP doesn't have a single sp that it
corresponds to. Also, lots of the stack stuff gets duplicated and quite a lot
of it is only relevant to lexical things. I think the answer may be to divide
up what is essential for tagging, which means a list of base hypotheses, a
list of phrasal hypotheses starting and ending here, and the scores for the
algorithms; and split this off from the lexical information, which could be
left as NULL for phrasal hypotheses. Phrasal hypothesis structures would then
be a list of type Hyp, but with a field in the union for a subsumed stack that
they point to in which the lexical information pointer would be NULL. Base
hypothesis lists would be allowed to be NULL, for cases such as the VP above.
Things like the VP above WOULD be anchored into the phrase list for the base,
but this would happen from just the normal end of parse edge creation. So it
would allow the VP to be recognised even if we fail to recognise the S. This
means that the VP (alone) and the VP (in the S) should be distinct objects in
the data structure, since they are actually distinct objects in the world. And
they can have different scores, because the transitions to their start and
finish can be different. Just as we duplicate the base lists on them. A phrase
can be started and ended from the same stack object.

We have to keep on using link structures, because a hypothesis can be shared
between its start and end stack levels. One minor change to the above work is
that we use the base hypothesis list for phrsaes other than at top level,
because they start and end at the same place, so we might as well not mess
around with links.

5-4-93
------
Over the weekend, I had a though at the B-W with thresholding, namely that the
final test should have thresholding off (after the interations with
thresholding on) to get a better estimate of accuracy.

State of the phrase code: written the stuff for generating phrases from edges.
Next we need to add the things to do the tagging. Two options (us e a soruce
code switch for them): (1) tag just on enclosed model; (2) tag inserting
phrase brackets at start and end. For training with this we want two
transitions involving phrase tags: transition to and from the phrase tag as an
ordinary tag, and transition from and to the tag internal to a phrase.

We can implement the stuff about needing two sorts of phrase brackets, by
having a code on tags in the tag list, and internally creating more than one
tag for them: the pne used as an ordinary tag, and one for the within phrase
use of the bracket; the latter is needed only with algorithm (2) above. Define
a macro which will get us from one to the other. Use the notation that + as
the first character on a line of a tag mapping means phrasal.

(Did think of doing this by having tags for [x and x]. But it's at least as
ugly and not very well principles, I reckon.

Agenda:
DONE Code for tagging enclosed model
DONE Code for tagging with inserted phrase brackets
DONE Training work

Next we want to do some testing. To keep it all manageable, build a small
corpus of my own, from some of LOB, identifying noun phrases by hand. If this
works, then we can go on to using some real stuff. Stick to lancpars for the
format. Test test: n/text, tags n/tags.map, rules n/rules.

6-4-93
------
Debugging phrasal work. Make a program change by getting rid of for loops over
stacks that have these ugly things like:

end = to->succ;
for (t = from ; t != end ; t = t->succ)
    ...

A better way is:
end = to->succ;
for (t = from ; t != NULL ; t = (t==to) ? NULL : t->succ)
    ...

For which we will write a couple of macros: Succ and Pred. Better re-do some
tests to make sure I haven't fucked it all up in doing so. Everything is OK
apart from the re-estimation, which I may already have damaged for some other
reason. So try to track this down. (There is a small difference in observation
probability because the new version doesn't include redundant final anchors,
but it doesn't matter). Found it: you must never initialise an alpha or beta
value except when you really mean to, because they get accumulated on. In
fact, as a matter of safety, we could zap them at the start of the relevant
loops. There does seem to be a (significant) difference in obs prob on later
iterations, for some reason, but the success rate, with it run to 4
iterations, is the same.

7-4-93
------
The output from the tagger isn't really very convenient for comparing with
phrases. So add output option 512 which just gives words and tags, missing out
skipped ones (which includes phrase brackets). Then we can add an extra
program to do comparisons. But first prepare an agenda of ideas.

Output comparison turns out to be less easy that you might think, because you
can have things like a phrase in one corpus and not in the other (not too
bad), and crossing phrases, e.g.:
corpus 1: [ fred walks the dog ] quickly
corpus 2: fred walks [ the dog quickly ]

So I think rather than going into some complex way of trying to do this, what
I will write is a post processor which reads a bracketed corpus and outputs
lines each containing a phrase. Such phrases will include subsumed ones, but
we will then re-output subsumed phrases with a number following them to
indicate their depth. The files can then be compared with sort and diff, or
perhaps with yet another post-processor.

Program written (outcomp). Not sure how useful it will prove to be, but stick
with it for now.

Next do the work to allow disjunctions in the rules, and for the whole
hypothesis list to be matched against the rules, and all the possibilities to
be taken up. The latter requires putting an anchor at each end, so we first
test the code which does that (done).

13-4-93
-------
(After Easter)

Summary of the phrasal stuff - I should be discussing this with Ted soon, so
get it straight.

1. The FSM approach was dropped because it is similar to what Kupiec does: a
probabilistic FSM is just a restricted HMM. It was not identical: the
differences were the use of an %end operation as well as a tag one, and the
richer mechanisms for matching and modifying the tags (exactly, noneof, etc).
However, my practical experience of trying to set up FSMs for phrases taken
from LOB was that it was hard to see when you could actually use these.
Particly this comes from not being clear about how FSMs wold integrate with
the other tagging algorithms.

2. So I implemented a chart parser, which is such that when it builds a
phrase, the constituents of the phrase are unambiguous, i.e. the phrase
structure rules do not have disjunctions in them. I might add disjunctions
later.
(* Aside. This is non-trivial: it is easy enough for lexical categories, but
not so good when it comes to phrases e.g. X = Y (A|B) Z where A and B are
phrases. *)
The two open questions on this are how to evaluate the output, and how to get
a good score for phrases.

3. On the latter subject, the approach is probably this. At the moment the
algorithms work on the score of the phrase, which is taken from the score of
tagging the words within it. The possibilities are:
 a. use just this score, i.e. tagging the particular sequence of categories
making up the phrase.
 b. using a score on the phrase specified for the rule.
 c. using the product of the two.
To get the score needed in (b) and (c), we could look for the number of times
the sequence of categories specified for the rule does make up the phrase and
the number of times it does not (or makes up some other phrase. ? perhaps this
should be the total times the sequence occurs; otherwise the probability can
be greater than 1), and derive the score for the rule from that. This looks
fine for the training procedure.

What about re-estimation? I think here we can draw on the Kupiec stuff, or
something like it, though this is quote complex and I don't really have a good
feel for it yet.

4. So the other question this leaves is how to evaluate the output.

Program change made: make score of a phrasal hypothesis be the product of the
rule score and the tagging score.


Question: do we try to do the thing to derive the rule probabilities in the
tagger or as a separate program? I think let's do it in a separate program.
Perhaps we can build this into or on top of lanc.c which is the program for
finding rules in lancpars.


Meeting with Ted to discuss state of the work and progress. Also see email
from him 8 April.
1. FSMs and Parsing
 a. John Carroll can supply a copy of a CF backbone derived for the Alvery
tools which may be useful for building FSMs and rules.
 b. For building FSMs, rather than using the data, use linguistic knowledge. To
avoid recursion as in NP -> NP RelCl with NP in the RelCl as well, drop PPs
and RelCls from NPs. Then we have Det AdjP N. Expand out AdjP into (e.g.)
AdvP. Expand out Det into NP+poss. Expand out NP+poss into Det N 's or Name
's. This gets you most cases and can be encoded ina FSM with iteration (i.e.
loops back to earlier states).
 c. If we are using a chart, then we could have an interactive front end which
displays charts and allows users to alter them, e.g. to add or delete edges,
and to adjust probabilities. In each case, we might want to be able to
recalculate/tag afterwards, or to select a part of a chart and run Viterbit/FB
on it. It would work on a sentence by sentence level. It would be possible to
write out charts to a file and read them back in. And presumably there would
be a re-estimate option (perhaps cumulative on each sentence, or write the
re-estimation data in a sparse form, such as xi(3,4)=n, and then have a
program which combines all this lot together). This is in part working towards
the ILD. To mkake this work, it would be useful to link phrases to their
original hypotheses (and also to any subphrases), so that if an edge changes,
we can recompute everything that depends on it. Consider an interface to the
GDE chart parser: can we write in a format that is onterconvertible?
 d. Reinsert (and recompile, retest) the FSM stuff with some sort of define or
comment to make it available.
2. Errors and Practical experience
 a. Add analysis of major error classes to practical experience paper (base it
on the stuff already shown to Ted), and discuss how you tweak things, e.g. by
changing tagset.
 b. Do a further thresholding test on a more reasonable test of re-estimation,
e.g. using part of the LOB corpus for training and part for testing.
3. Other issues
 a. Write a letter to Sharp indicating that I aim to bring the tagger with
me and that they may make user of it for research purposes, but that the
original program was written at the Computer Lab, and that rights to it belong
to the lab and to me. Ted may be taking a copy with him to Xerox.
 b. Do more documentation on it, in particular the major data structures.
 c. Contact Sharp and suggest that in the time between me starting and moving
onto the SALT project (which probably won't start for a bit), I continue to
work on the tagger.

14-4-94
-------
Immediate actions/compiled agenda

AGENDA *** PROD items still wanted ***
......
(Things marked PROD are changes which would be needed for production quality
software rather than a research tool. Leave them as "don't" for now.)

1. Program changes
PROD allow unknown words to be merged in during re-estimation.
PROD add changes to retain formatting
PROD add changes to split off punctuation.
PROD allow equivalence classes (i.e. dictionary entries are sets of tags, and
the lexical probabilities are done on such sets; see Kupiec and Maxwell p15).
2. Remarks from Kupiec CS&L paper
PROD Add extra tags for low freq words (section 3.1)
PROD Do morphological analyser
PROD Smoothing of bigrams and work frequencies to allow for sparse data
3. Readtr
DONE Make Readtr know about inphrase tags. (* This proves to be just a mettre
of recompiling with Bracketed_Phrases set. *)
DONE Get rid of readtr1 (move it to minor and get rid of it from makefile).
4. FSM/Parser
DONE Reinstate FSM code.
DONE. Put custom may_tag etc calls in parser and fsm files.
DONE. Define a standard interface for phrasal functions.
DONE Extend makefile for the special defines for phrasal things.
DONE Retest FSMs. (* see below *)
5. Chart
DONE Implement a chart in the stack structure, with traceback to base
hypothesis.
BELOW Write chart dumping code.
BELOW Write chart editor.
5. Paper
DONE Do extra BW tests.
DONE Add error analysis. (* see below *)
DONE Do error analysis with thresholding.
6. Other
DON'T (Ted will see them) Write to Sharp about rights.
DON'T Write to Sharp about what to work on.
DONE Document major data structures.


Standard interface for phrasal things. Suppose the name of the module is xxx.
Then we want the following calls:
xxx_may_tag	called from may_tag.
xxx_advance	called after pushing a word. (For fsms, execution is merged in
		here).
xxx_tidy	called in stack freeing.
xxx_read	read file.
xxx_read_named	read named file.

Constructed a script Make, which constructs the makefile. Run it if
dependencies change or after changing the variable Phrase at the head of this
file.

* Got it to the stage where it will recompile with FSM, but not yet tested.
May be possible to simplify data structures, e.g. by getting rid of master
somehow. (Got rid of it 15-4-93).

BW tests
........
We now do some more BW tests, using one part of LOB as initial training and
the other as test, and then using thresholding. We want to find a case where
BW is actually improving the performance, but where we are not starting with
the low performance found when an initialisation option is used.
Training: use k, l, m, and n (four of the fiction categories); add unknown
words from p (1837 of them) -> ktop.
Test: use p (the remaining fiction category).
* Does not meet requirements: performance falls initially.
* Try other corpora until we find one that works. May want to use a smaller
initial dictionary.
* A suitable test seems to be v7l against the dictionary from v7b with the
unknown words added. Dictionary for this in cor/z.lex.

(* Table moved to below *)

15-4-93
-------
Try similar tests looking for an even lower starting point.
Against v7b, the lowest is p (88.51).
Against v7k, all the values are high.
Against v7l, b gives 88.50.
Against btoj, all the values are high.

So we still aren't getting down to usefully low values. So do a test against
v7b with init options 8 and/or 17.
8	a	83.76	l	82.21	p	82.02
17	a	68.61	l	69.67	p	69.21

From which it looks as if using l/b with I8 would be a good compromise. (*
Table moved to below *)

---

16-4-93
-------
I uncovered an unfortunate bug in "total", the program which gives the
probability thresholds. It said that the threshold 0.631 meant that 1% of
correct words had probability less than or equal to this value. In fact, this
should have been 2%, and all the other values were similarly out. So we need
to redo the tests here (BORING!). Also add some extra ones.


(a) Tests
b/b with no init, I8 and I29 (self test). 32.35% ambiguity.
l/b with no init, I8 and I29. 31.29% ambiguity.
Use the thresholds from before, with some additional ones: 0.562 and 0.910.

I8  = 1/g,t/g = D2,T0
I29 = 1/n,1/T = D3,T1

(b) Predicted and actual self test accuracies.
Thresh	c	i	ignore	oracle	ignore	oracle
			(predicted)	(actual)
0.562	0.01	0.187	95.79	95.87	95.77	95.85
0.631	0.02	0.330	96.47	96.60	96.37	96.50
0.777	0.05	0.595	97.77	97.94	97.76	97.94
0.897	0.10	0.802	98.83	98.99	98.85	99.01
0.999	0.37	0.997	99.97	99.98	99.97	99.98

c =   correctly tagged words with score below threshold.
i = incorrectly tagged words with score below threshold.

We also want to do this for Penn.
Thresh	c	i	ignore	oracle	ignore	oracle
			(predicted)	(actual)
0.562	0.01	0.187	92.31	92.51	92.02	92.26
0.631	0.02	0.330	93.51	93.83	92.93	93.33
0.777	0.05	0.595	95.85	96.27	95.29	95.89
0.897	0.10	0.802	97.82	98.18	97.19	97.81
0.999	0.37	0.997	99.95	99.97	99.82	99.95

(c) Results (ambig success rates). 9 iterations, option L.
Some of these are different from before (the values  here are probably the
correct ones.)

b/b. No init.
	1	2	3	4	5	6	7	8	9
0	94.92	93.89	93.34	92.86	92.49	92.33	92.15	92.08	91.92
0.562	94.92	94.30	93.93	93.81	93.68	93.63	93.54	93.39	93.32
0.631	94.92	94.18	93.91	93.85	93.81	93.74	93.67	93.60	93.55
0.777	94.92	94.06	93.71	93.54	93.36	93.15	93.01	92.87	92.90
0.897	94.92	93.34	92.75	92.56	92.34	92.32	92.32	92.31	92.30

b/b. I8.
	1	2	3	4	5	6	7	8	9
0	86.70	89.81	90.65	91.15	91.24	91.41	91.48	91.37	91.44
0.562	86.70	91.68	92.23	92.47	92.55	92.44	92.39	92.38	92.31
0.631	86.70	91.66	92.25	92.51	92.55	92.46	92.40	92.44	92.38
0.777	86.70	91.55	92.14	92.24	92.08	91.82	91.60	91.63	91.62
0.897	86.70	90.93	91.08	91.01	91.02	90.93	90.87	90.79	90.81

b/b. I29.
	1	2	3	4	5	6	7	8	9
0	58.92	69.31	71.49	73.14	73.79	74.48	75.01	75.46	75.59
0.592	58.92	72.22	75.43	78.54	79.19	79.52	79.61	79.53	79.48
0.631	58.92	72.22	75.48	78.52	79.15	79.51	79.60	79.54	79.40
0.777	58.92	72.22	75.88	78.55	79.24	79.49	79.39	79.34	79.31
0.897	58.92	72.22	75.44	78.23	79.10	79.39	79.32	79.32	79.31

l/b. No init.
	1	2	3	4	5	6	7	8	9
0	88.75	90.10	90.47	90.44	90.12	90.03	89.99	89.90	89.92
0.562	88.75	90.13	90.67	90.77	90.58	90.46	90.42	90.27	90.21
0.631	88.75	90.15	90.71	90.72	90.46	90.29	90.12	90.09	90.02
0.777	88.75	89.97	90.54	90.09	89.81	89.67	89.56	89.50	89.42
0.897	88.75	89.76	89.90	89.34	88.85	88.81	88.71	88.64	88.55

l/b. I8.
	1	2	3	4	5	6	7	8	9
0	82.21	86.32	88.62	89.26	89.54	89.55	89.49	89.41	89.34
0.562	82.21	86.99	88.74	89.40	89.40	89.47	89.49	89.37	89.35
0.631	82.21	87.02	88.71	89.45	89.46	89.37	89.31	89.26	89.23
0.777	82.21	86.85	88.56	88.84	88.67	88.43	88.11	87.96	87.84
0.897	82.21	85.69	86.51	86.78	86.54	86.26	85.87	85.84	85.66

l/b. I29.
	1	2	3	4	5	6	7	8	9
0	56.84	72.42	75.82	77.59	78.79	79.29	79.38	79.52	79.58
0.562	56.84	71.77	73.98	75.31	75.45	75.29	75.10	74.93	74.84
0.631	56.84	71.77	74.00	75.20	75.16	75.09	74.87	74.56	74.47
0.777	56.84	71.77	73.10	73.99	73.94	73.39	73.22	72.83	72.65
0.897	56.84	71.77	71.39	71.78	71.89	71.48	71.39	71.27	71.27

---

What about the error analysis? The things which would be useful to put in the
paper are the major error classes, and then perhaps a summarised collection of
other errors. First need to remind myself how to do this! Appears the way to
do it is to use option 128. We should also redo this following some automatic
correction to see if and how the categories have changed.

So start by repeating the self test on b with error analysis. I think we will
need to use the results of this rather than the earlier analysis: it is
basically the same, but there are a few minor differences, probably a
combination of typos and bug fixes.

19-4-93
-------
Four tests done: self test on b, self test on b with threshold 0.631. BW test
on b with I8 B6, same with L = 0.631. Also do a self test with l, and a test
with Penn to see variation with corpus and with tagset.

Combining the results, we get:

Basic errors (self test)
............
	Self	Self (0.631)	Self/l
ABN/RB	1	1		9
CC/QL	1			1
CS/RN	1	1		4
IN/RI	1	1		1
JJ/IN	1	1		1
JJ/UH				1
JJ/VBD	1
JJB/IN	1	1
JJB/RP				1
JJB/VB	1	1
MD/NN				1
MD/VB	1
NN/BEG	1	1
NN/CS	1	1
NN/MD	1	1
NN/OD				1
NN/RB				1
NN/VBD				1
NN/VBZ	1	1
NNP/JNP	1
NNPS/NNP 1			1
NNS/NNS$ 1	1
NP/JNP	1	1
NPT/NPL	1
OD/RB	1	1
PN$/ATI				1
PN/ATI				1
PN/RB	1			5
PPLS/DT	1
QL/DT	1	1		2
QL/IN	1	1
QL/JJ				1
QL/PN	1
QL/RB				1
QL/RBR	1
RB/ABL	1	1		1
RB/DTI	1	1		5
RB/DTX				1
RB/JJB	1	1
RB/NN				1
RB/OD	1	1		3
RB/RBR	1	1
RB/TO	1	3
RB/UH	1	1
RB/VB	1
RB/VBN				1
RBR/JJB	1
RBT/AP	1	1
RI/IN	1			2
RI/RP				1
RN/EX	1
RP/NN	1			3
TO/CS				1
TO/RB				1
UH/VB				1
VB/MD	1	1		1
VBD/JJ				1
VBN/NN	1	1		1
WPI/WP				1
WP$I/WP$ 1	1		1
WP/DT	1	1		1
ABL/RB	2
AP/QL	2	1		1
AT/AP	2	1		1
BEZ/HVZ				2
CC/IN	2	2
CS/RB	2	1		6
DT/PPLS	2	2		1
DTI/QLP				2
HVN/HVD	2
IN/VBG				2
JJR/RBR	2			2
NN/RP				2
NN/VBN	2	2		2
NPT/NP	2	1
PP$/PP3O			2
PP3O/PP$ 2	1		4
QLP/RB				2
RB/ATI	2			4
RB/QL	2	2		9
RBR/JJR	2	2		3
RBR/QL	2	2
RBT/JJT	2	2
UH/RB				2
VB/AP	2	2		1
VB/JJ				2
VB/RB	2	1		2
VB/VBD	2	1		10
VBD/VB	2	1		4
VBG/IN	2	1
WDTI/WDT 2	2		1
WP/WPI	2	1
ATI/RB	3			3
CC/CS	3	2		2
CS/CC				3
EX/RN	3	3		3
HVD/HVN				3
IN/VB	3	1		2
NP/NPL	3	3
PPLS/CD1 3	3
QL/AP	3	2		5
QL/CS	3			3
RB/CC	3	3
RB/QLP	3
RB/RP	3			1
VBN/VB				3
WRB/RB	3	3
AP/JJ	4	1		9
ATI/PN$				4
CC/RB				4
IN/ABL	4	4
IN/QL	4	1		4
JJ/VB	4	4		1
JJ/VBG	4	4		4
JJB/NN	4	2
RB/PN	4	4		3
RBR/AP	4	3		2
TO/IN	4	3		4
VBN/JJ	4	4		7
CD/CD-CD 5	2
IN/NNU	5	5
JJ/AP	5	4		4
NP/NPT	5	5
RP/RB	5	2		7
VB/VBN	5	4		8
AP/RBR	6	1		4
HVD/MD				6
IN/CC	6	6		4
MD/HVD				6
NNS/NN	6	3		1
PN/CS				6
PN/QL	6	6		4
RB/AT	6	6		7
VBZ/NNS	6	6
HVZ/BEZ				7
IN/TO	7	5		8
VBG/JJ	7	6		1
CD-CD/CD 8	7
DT/WP	8	3		9
NNS/VBZ	8	6		1
AP/RB	9	7		11
VBG/NN	9	3		6
NN/JJB	10	9		1
NN/VBG	10	8		9
RB/ABN	11	11		22
WP/CS	11	10		22
CD/NNU	12	3		4
CS/DT	13	6		8
JJ/VBN	14	12		10
RB/JJ	14	9		29
CS/QL	15	8		9
RP/IN	15	11		52
AP/AT	16	10		6
IN/RB	16	11		9
NN/NNS	16	16
RB/AP	16	11		9
JJ/RB	17	15		28
JJ/NN	19	12		11
RB/CS	19	13		22
CD1/CD	20	15
NN/VB	20	12		9
VBN/VBD	21	10		75
DT/CS	24	22		8
IN/RP	24	19		61
VBD/VBN	24	14		28
VB/NN	30	19		9
NNU/CD	43	34
NN/JJ	50	34		18
RB/IN	52	36		33
IN/CS	73	39		43
CS/IN	81	57		49

which can be condensed as follows. First define some classes:
noun = NN NNS NNS$ NNP NNPS NNU NP NPL NPT NPS PN PPLS PP.. PP. NR
verb = VB VBD VBG VBN VBZ MD BEG HVD HVN
adj  = JJ JJB JJR JJT JNP
adv  = RB RBR RBT QL QLP RN RI RP WRB
minor= ABL ABN AP AT ATI CC DT DTI EX QL TO UH WDT WDTI WP WPI WP$ WP$I
number=OD CD CD1 CD-CD
IN
CS

Giving the following. We also add condensed errors from a test on Penn, using
the following:
noun = NN NNS NP NPS PP PP$
verb = MD VB VBD VBG VBN VBP VBZ
adj  = JJ JJR JJS
adv  = RB RBR RBS WRB
minor= CC DT EX FW LS PDT POS RP UH WDT WP WP$ " ''
number=CD
IN

Condensed errors (self test)
................
		Self	Self (0.631)	Penn	Self/l
adv/number	1	1			3
adv/verb	1				1
IN/adj					1
noun/CS		1	1			6
noun/minor	1			15	6
number/adv	1	1
adj/IN		2	2		2	1
minor/noun	2	2		5	5
noun/IN					2
verb/IN		2	1		5
verb/adv	2	1		1	2
verb/minor	2	2		4	1
IN/verb		3	1		2	4
minor/adj	4	1			10
minor/verb				4	1
IN/noun		5	5		1
adj/minor	5	4		2	5
adv/noun	6	4		5	7
minor/IN	6	5		38	4
number/minor				6
noun/adv	7	6		4	8
verb/adj	11	10		41	11
number/noun	12	3		11	4
CS/minor	13	6			20
IN/minor	17	15		102	16
CS/adv		18	10			10
adj/adv		19	15		55	31
adv/adj		20	14		74	22
adv/adv		20	14		2	11
adv/CS		22	13			27
adj/noun	23	14		38	11
adj/verb	24	21		33	15
minor/adv	27	14		55	41
minor/minor	32	19		43	29
number/number	33	24
noun/noun	36	30		54	8
minor/CS	38	34			36
noun/verb	43	31		47	22
IN/adv		45	32		62	71
noun/number	46	37		14	1
verb/noun	46	29		89	18
adv/minor	52	42		94	60
verb/verb	58	31		266	153
noun/adj	62	44		63	19
adv/IN		69	48		101	87
IN/CS		73	39			43
CS/IN		81	57			49
total		991	683		1341	879

Or, as a table:

SELF
correct	noun	verb	adj	adv	minor	num	IN	CS
chosen
noun	36	46	23	6	2	12	5
verb	43	58	24	1			3
adj	62	11		20	4
adv	7	2	19	20	27	1	45	18
minor	1	2	5	52	32		17	13
num	46			1		33
IN		2	2	69	6			81
CS	1			22	38		73

SELF (0.631)
correct	noun	verb	adj	adv	minor	num	IN	CS
chosen
noun	30	29	14	4	2	3	5
verb	31	31	21				1
adj	44	10		14	1
adv	6	1	15	14	14	1	32	10
minor		2	4	42	19		15	6
num	37			1		24
IN		1	2	48	5			57
CS	1			13	34		39


BW tests with init code 8, 6 iterations. L = 0 then L = 0.631.

Basic errors
............
	BW	BW (0.631)
ABL/IN	1	2
CC/CS	1	3
CC/QL		1
CS/NN		1
CS/RB	1	10
CS/RN	1	1
CS/WP	1
DTI/RB		1
DTX/RB		1
IN/JJB		1
JJ/IN	1	2
JJ/VBD	1	1
JJB/IN	1	1
JJB/RB	1	1
JJB/VB	1	1
JNP/NNP	1	1
MD/NN	1	2
MD/VB	1	5
NN/BEG	1
NN/CS	1	1
NN/IN		1
NN/MD	1	1
NN/OD	1	1
NN/VBZ	1	1
NNS/NNS$ 1	1
NNU/IN		1
NP/JNP	1	1
NP/NN	1	1
NP/NR	1	1
NPT/NP	1	1
NPT/NPL	1	1
OD/RB	1	1
PN/ATI		1
QL/DT	1	1
QL/IN		1
RB/ABL	1	1
RB/DTI	1	1
RB/JJB	1	1
RB/RBR	1	1
RB/UH	1	1
RB/VB	1	1
RBR/JJR	1	1
RBR/QL	1	1
RBT/JJT	1	2
RI/IN	1	1
RN/CS	1	2
RN/EX	1
RP/NN	1	1
VB/IN		1
VB/JJR	1
VB/MD	1	1
VBD/JJ		1
VBN/VB	1	1
WDTI/WDT 1	1
WP/DT	1	2
WPI/WP	1
WRB/RB	1	4
ABL/RB	2	4
AP/QL	2	3
AP/RBT	2
CC/IN	2	2
CD-CD/CD1 2
DT/QL	2	1
HVN/HVD	2	3
JJR/RBR	2	1
NN/VBN	2	2
NNP/JNP	2	2
PN/CS	2	4
PN/QL	2	8
PP3O/PP$ 2	3
QL/CC	2
RB/QL	2	4
RBR/RB	2	3
RBT/AP	2	2
RP/CS	2
UH/RB	2	2
VB/AP	2	2
VB/RB	2	1
VB/RP	2
VB/VBD	2	2
VBD/VB	2	3
WP/WPI	2	2
AT/AP	3	19
AT/RB		3
CD-CD/CD 3	8
CD1/NNU	3	1
CS/PN	3
DT/PPLS	3	3
IN/QL	3	5
IN/VB	3	2
PN/RB	3	3
PPLS/CD1 3	3
QL/RB		3
QL/RBR	3	2
RB/QLP	3	3
TO/IN	3	6
TO/RB	3	3
VBZ/NN	3	3
ABN/RB	4	10
EX/RN	4	5
JJB/NN	4	7
NP/NPL	4	4
QL/PN	4	2
VBG/IN	4	4
VBN/NN	4	4
IN/ABL	5	4
JJ/VB	5	5
NNU/CD1	5
RB/PN	5	4
RP/RB	5	4
WDT/WDTI 5	5
DT/WP	6	5
IN/NNU	6	5
NNPS/JNP 6	5
NNU/CD-CD 6	1
RB/AT	6	4
RB/CC	6	3
RB/RP	6	1
AP/JJ	7	9
JJ/VBG	7	7
VB/VBN	7	6
IN/TO	8	3
NNS/VBZ	8	7
RP/IN	8	24
VBG/JJ	8	10
CC/RB	9	36
VBN/JJ	9	12
VBZ/NNS	9	9
NP/NPT	10	10
RB/AP	10	12
RB/JJ	10	16
CD/NNU	11	4
NN/JJB	11	11
RB/ABN	11	10
WP/CS	11	10
CS/QL	12	16
VBG/NN	12	15
QL/AP	13	30
CS/DT	14	17
CD1/CD	15	19
NN/VBG	16	15
JJ/VBN	17	17
QL/RBT	17
NN/NNS	18	19
NNS/NN	20	24
AP/AT	23	9
CD/CD1	25
RB/CS	28	21
JJ/NN	30	31
DT/CS	31	26
ATI/RB	32	24
IN/RB	38	31
AP/RBR	38	38
CS/CC	38	10
IN/CC	38	4
VBN/VBD	38	30
JJ/RB	39	34
VB/NN	39	43
VBD/VBN	39	42
AP/RB	40	50
NN/VB	46	40
CD/CD-CD 47	17
IN/RP	52	20
RB/IN	52	70
NNU/CD	53	60
CS/IN	56	103
NN/JJ	69	80
IN/CS	241	87

Condensed errors
................
		BW	BW (0.631)
adj/IN		1	2
adv/verb	1	1
IN/adj			1
noun/minor		1
number/adv	1	1
noun/IN			2
verb/minor	2	2
CS/noun		3	1
IN/verb		3	2
minor/noun	3	3
noun/CS		3	5
verb/IN		4	5
verb/adv	4	1
noun/adv	5	11
IN/noun		6	5
adv/noun	6	7
minor/IN	6	11
minor/adj	7	9
adv/adj		13	20
CS/adv		14	27
number/noun	14	5
verb/adj	18	23
adv/CS		28	23
adj/verb	31	31
adj/noun	37	49
adv/adv		41	26
adj/adv		42	36
minor/CS	42	39
minor/minor	42	43
IN/minor	51	11
CS/minor	53	28
adv/minor	55	68
CS/IN		56	103
noun/noun	59	65
adv/IN		61	96
noun/number	68	65
noun/verb	75	66
verb/noun	78	76
IN/adv		87	56
noun/adj	89	99
number/number	92	36
verb/verb	93	93
minor/adv	139	182
IN/CS		241	87
total		1674	1523

As a table:

BW
correct	noun	verb	adj	adv	minor	num	IN	CS
chosen
noun	59	78	37	6	3	14	6	3
verb	75	93	31	1			3
adj	89	18		13	7
adv	5	4	42	41	139	1	87	14
minor		2		55	42		51	53
num	68					92
IN		4	1	61	6			56
CS	3			28	42		241

BW (0.631)
correct	noun	verb	adj	adv	minor	num	IN	CS
chosen
noun	65	76	49	7	3	5	5	1
verb	66	93	31	1			2
adj	99	23	36	20	9		1
adv	11	1		26	182	1	56	27
minor	1	2		68	43		11	28
num	65					49
IN	2	5	3	96	11			103
CS	5			23	39		87

20-4-93
-------
As a further test, do an error analysis on tagging with the Penn treebank and
see how the condensed tables match up. Do this on newpenn/10.

Basic errors
............
   1 EX/RB
   1 IN/JJ
   1 IN/NP
   1 IN/WRB
   1 JJ/PDT
   1 JJ/RP
   1 JJ/VB
   1 JJ/VBP
   1 MD/VBD
   1 NN/"
   1 NN/IN
   1 NP/IN
   1 RB/CC
   1 RB/RBS
   1 RBR/RB
   1 RBS/JJS
   1 VB/MD
   1 VB/RB
   1 VB/VBN
   1 VBD/MD
   1 VBN/NN
   1 VBP/JJ
   1 VBP/VBN
   1 VBP/WDT
   1 VBZ/NN
   1 WDT/RB
   2 DT/NN
   2 IN/VB
   2 JJ/IN
   2 JJ/VBD
   2 NN/FW
   2 NN/VBP
   2 NNS/NPS
   2 NP/FW
   2 PP/NN
   2 VBP/IN
   2 WP/WDT
   3 ''/POS
   3 CC/IN
   3 DT/NP
   3 DT/UH
   3 IN/CC
   3 JJ/NP
   3 MD/NN
   3 MD/VBP
   3 PP/CD
   3 VB/IN
   3 VBD/JJ
   3 VBZ/POS
   3 WRB/UH
   4 CC/DT
   4 CD/PP
   4 DT/CC
   4 NN/NNS
   4 NN/RB
   4 POS/VBZ
   4 PP$/PP
   4 RB/EX
   4 VB/JJ
   4 VBN/VB
   4 WDT/WP
   5 CC/RB
   5 DT/PDT
   5 MD/NP
   5 RB/NN
   5 VB/VBD
   5 VBP/VBD
   6 CD/LS
   6 JJR/RBR
   6 NNS/VBZ
   6 VBD/VB
   6 WDT/IN
   7 CD/NN
   7 NN/NP
   7 NN/PP
   7 UH/RB
   8 JJS/RBS
   8 NP/NN
   8 RP/RB
   8 VBG/JJ
   8 VBZ/NNS
   9 IN/WDT
   9 VBP/NN
  10 NP/DT
  10 RBR/JJR
  11 NN/CD
  13 IN/DT
  13 NN/VB
  14 DT/IN
  15 RP/IN
  18 DT/WDT
  18 NP/JJ
  18 RB/UH
  20 PP/PP$
  20 RB/DT
  25 VB/NN
  25 VBN/JJ
  26 NN/VBG
  29 JJ/VBN
  33 DT/RB
  35 JJ/NN
  36 VBN/VBD
  37 VBG/NN
  41 JJ/RB
  45 NN/JJ
  46 RB/RP
  61 IN/RB
  61 VBD/VBN
  63 RB/JJ
  70 VB/VBP
  71 VBP/VB
  77 IN/RP
 101 RB/IN

Condensed errors: see above

---

We now leave this word and move on to reconfiguring the stack to be a chart.
What we want to do is (a) to get rid of the need for a separate chart in the
parser; (b) have links between phrasal hypothesis and their base hypotheses;
(c) built fsms into the same structure.

(b) is probably the one to address first because it affect the data structures
most of all. We could simply have each lexical hypothesis which is in a
phrasal span have a link back to the base hypothesis it came from. Then if we
change a base hypothesis we search all hypothesis lists which might be related
for it etc... But I don't like this. I think my preferred solution is to
change the data structures so that Hyp is no longer a linked list: it is just
the hypothesis data. Then we use Link structures to bind them together. So
each lexical hypothesis appears only once. Each stack level has a Link through
all lexical hypotheses here (for freeing etc), and a Link for the base
hypothesis. Subsumed hypotheses are also done using Links. So let's start to
make the change for this, and in doing so, we will allow the things subsumed
by a phrasal hypothesis to be Links rather than just Hyps, so we can have
disjunctions in phrases if we decide we need them. What we are in effect doing
is detaching structure from content.

In doing this, we also change the stack data structure so that there are two
versions, one of which is at word level and contains the relevant information,
and the other of which is just used for binding spans together. Lexical
hypothesis will the need a pointer back to the word hypothesis they came from
for reestimation. (* Actually they already have this *)

We lose some code in doing this: Special_branch stuff. Also get rid of all
tags output, which we will instead do via the code the output a chart.


21-3-93
-------
After yesterday's rather pathetic attempt, I now start to revise the data
structures in earnest, with a view to making them simpler and more amenable to
chart representations.
1. Get rid of stack levels for skipped words. These now get built into the
Word data structure in the form of left and right contexts for the main word.
This helps when we are tracing back to leaves from a phrase.
2. The base data structure is a stack, with one level per word. Each stack
level points to a node; stack levels are not linked together, but nodes are.
Nodes have a pointer back to a stack level, which is NULL except for ones
created from the stack.
3. Nodes are the principal data structure. The main things they have are a
list of all hypotheses which were assigned from this node (generally starting
here, though not necessarily), and a list of all hypotheses starting and
ending here in the form of Links.  This includes base hypotheses, which
simplifies the algorithms which word on them.
4. Phrasal hypotheses have pointers to the start and end of the nodes they
subsume. Lexical hypotheses have a pointer back to their stack entry. All
hypotheses are shared, i.e. the lexical edges of a phrase are also the base
hypotheses. However, since such shared things may have different scores
depending on what they are being considered for, the scores must be recorded
in the link structures. This is a bit unfortunate in that we have separate
start and end structures pointing to the same thing, so to get round this we
introduce a level of structuring midway between links and hypotheses called
SHyps. This is what Links point to; they contain the scoring information and a
pointer to the Hyp. The base score still goes on the Hyp: the SHyp just
contains alpha and beta values etc.

This covers the major points. Now we attempt to start the process of
reconstruction.

22-4-93
-------
Implemented the above data structures and got them working. However, it occurs
to me that it may be possible to make them simpler still, buy noting that
(unlike the first chart parser implementation), all edges get attached to
nodes that have a lexical counterpart, which makes it look as if there is no
need for nodes at a higher level than this. Where we do need them is for
tagging sub-phrases, where they direct us to the constituents of the phrase,
accessed via their list of scored links. So on this basis, just have a check
on the data structures to see that the node contains no more than is necessary
for this.

* Also every node can have a pointer back to a stack structure, representing
where it starts and ends (though I'm not sure we need this); in place of 's'
and of the pointers in the hypothesis (perhaps). Do it this way: essentially
it means getting rid of the 's' field and moving the start and end parents out
of hypotheses, in which case they need pointer back to their nodes instead, or
at least to their start nodes.

* Do make the following changes: re-institute 'nexthyp' so we can ignore some
sorts of edge; add a numerical code to stack edges; replace 's' in node
structure by start and end.

* We now assume that spanned sequences of nodes have pred and succ pointers
which take us back into the main sequence of things. E.g. in:
Node1 = Det, Node2 = N, Node3 = VP, Node4 = NP(Det,NP), then the succ of Node4
is Node3.

(In fact, what I finally end up with is a little different from this).

* This is now done. Next we move on to reintegrating the parser into this,
which may be quite complex. We first add support for a new sort of hypothesis
used to represent active edges.

23-4-93
-------
Now provided all hypotheses with start and end nodes. For lexical, these are
the same and point to the lexical hypothesis, for phrases they are the
subsumed span, and for active edges they point to the node at which the rule
fired and where we have got up to. This represents something of a tradeoff
between code and data: a bit more data (and involving some redundancy, in
order to make the code simpler in some places.

* A problem comes to light with the thing about spanned nodes, namely that if
we are advancing a word at a time we don't have the successor to set up. So
the way I will get round this is that the pred and succ of a span of subsumed
nodes points to the stack node of the same word.

26-4-93
-------
Parser not integrated with the main data structure. Next attempt to do the
same thing for FSMs (which is likely to be rather harder!). We first go for a
much simpler definition of FSMs, since what we have arrived at is really far
more complex than what we want. The definition will be on about the same level
as the parser; the main difference is to include an allowance for disjunction
and repetition (which we could, but won't, build into the parser).

Syntax (moved to file).

All this syntax is rather ugly but too bad.

FSMs may not be recursive. They operate only on lexical hypotheses (for now).

27-4-93
-------
Still working on FSMs.

Differences between FSMs and Parsr:
1. FSMs are non-recursive, Parser allows phrasal constituents.
2. Parser gives phrase of single hypotheses, FSM allows sets.
3. FSM state recorded at node level, Parser at hypothesis level.

Found some possible bugs in the parser, when there are phrases within phrases,
so go back to it to check. (Done)

28-4-93
-------
Continuing on this, checking and debugging. Make Bracketed Phrases become a
permanent thing, but with an option to insert and anchor rather than a phrase
tag. NB! We are running out of space for option codes!

The next thing is to define an external format for charts. We want to mae sure
of the following:
(1) that the chart can be reconstructed from the output
(2) that where there is duplicated information it appears exactly once.
(3) that where there is an external link (such as to the dictionary), that it
is recorded in a form which is recoverable.

Change the dictionary format so that it includes the litaral tags. I did think
of including a table at the start of the tags file giving the tags that are
used in it, but I think I will not do so, because inphrase tags have the same
printing form as non-inphrase tags. (And they never occur on lexical items, so
there is no problem here).

Now on to the design of the external format for a chart. We will also want
code for reading and writing charts.

---

Not too happy about all this chart stuff; temporarily put it all on hold and
document what I've done so far. In doing so, we want: Docn file as it stands
for options, general notes. File formats. Major data structures. Program
modules and amusing intricacies of the source code. By and large these ought
to be documented in the source files rather than in a separate document, so do
it this way.

29-4-93
-------
Things to do:
DONE Main documentation
(2) Source code documentation
DONE Literature search for materials on OO databases etc for lexical stuff +
scan recent proceedings.
(4) Prepare talk
(5) List and implement the things that would improve the labeller, e.g. adding
tags on low frequency words.
(6) Read Acquilex stuff.


What to add to the tagger?
--------------------------
1. Tag inference rules; run after loading a dictionary (or as a separate
program).
   if the word has this frequency relative to the total, and has any of the
following set of tags, then add the following tags with the same frequency.
2. Word equivalence classes: run after loading a dictionary (or as a separate
program).
   when two words have the same set of tags, then merge the tags they have and
assign the same frequencies to them
   OR
   when a word has low frequency, take it from an equivalence class
   (See Kupiec paper on this).
   Particularly look at the effect of this on BW with various starting points.

4-5-93
------
Now do the work on adding things to the tagger. First, tag inference rules.
(See documentation for hwo it works).
(Done, but not run any tests on it)


Adding perplexity
-----------------
Definition: (sum over W of log P(G|W)) / n
where n = number of words tagged
and P(G|W) = prob of a labelling G given a sequence of words W = prob that
comes out of the FB algorithm.
want to minimise: measures diference between the model and the language.

Perplexity tests
(1) B-W on v7b, simple
Iter	Ambig%	Obs,e-6	Perpl
1	94.92	 4.70	-7.74903	
2	93.89	73.0	-6.78342
3	93.34	74.2	-6.78087
4	92.86	74.4	-6.77960
5	92.49	74.5	-6.77879
6	92.33	74.6	-6.77825
7	92.15	74.6	-6.77793
8	92.08	74.6	-6.77774
9	91.92	7.46	-6.77762
Falling success with higher (closer to zero) perplexity

(2) B-W on v7b, with sentence option ('a')
Iter	Ambig%	Obs	 Perpl
1	94.92	2.30e-16 -6.78466
2	93.89	1.47e-7	-5.66603
3	93.34	1.32	-5.66358
4	92.86	1.27	-5.66239
5	92.49	1.27	-5.66164
6	92.33	1.30	-5.66117
7	92.15	1.32	-5.66089
8	92.08	1.35	-5.66073
9	91.92	1.36	-5.66062
Falling success with higher perplexity

(3) Same with initialisation code 8 (D2+T0)
Iter	Ambig%	Obs	Perpl
1	86.70	1.65e-13 -12.9474
2	89.81	7.21e-5	-6.80291
3	90.65	7.39	-6.78941
4	91.15	7.40	-6.78417
5	91.24	7.40	-6.78167
6	91.41	7.41	-6.78030	
7	91.48	7.42	-6.77944
8	91.37	7.42	-6.77885
9	91.44	7.42	-6.77840
Rising success with higher perplexity

(4) Same with initialisation code 29 (D3+T1)
Iter	Ambig%	Obs	Perpl
1	58.92	5.90e-11 -5.38444
2	69.31	1.18e-7	-5.79483
3	71.49	1.23	-5.71501
4	73.14	1.19	-5.68187
5	73.79	1.20	-5.66393
6	74.48	1.26	-5.65277
7	75.01	1.34	-5.64627
8	75.46	1.40	-5.64823
9	75.59	1.44	-5.64081
Rising success with higher perplexity

(5) Same as (3), but with more iterations until we find a changeover point in
performance. (Perplexity does not turn over at this point).

(* More extensice results in plot file *)

We might also look at a perplexity figure based on gamma scores, where we sum
the log of the gammas used in re-estimation, and divide by the total
hypotheses.

(1) B-W on v7b, simple
Iter	Ambig%	Perpl
1	94.92	-73.0972
2	93.89	-73.1954
3	93.34	-73.2542
4	92.86	-73.3075
5	92.49	-73.3596
6	92.33	-73.4120
7	92.15	-73.4656
8	92.08	-73.5210
9	91.92	-73.5790
Falling success with lower perplexity (further from zero)

(2) B-W on v7b, with sentence option ('a')
Iter	Ambig%	Perpl
1	94.92	-72.0549
2	93.89	-71.4621
3	93.34	-71.5380
4	92.86	-71.6067
5	92.49	-71.6698
6	92.33	-71.7491
7	92.15	-71.8214
8	92.08	-71.9191
9	91.92	-72.0160
Falling success with rising perplexity.

(3) Same with initialisation code 8 (D2+T0)
Iter	Ambig%	Perpl
1	86.70	-72.0189
2	89.81	-72.6617
3	90.65	-72.8322
4	91.15	-72.9590
5	91.24	-73.0572
6	91.41	-73.1362
7	91.48	-73.2035
8	91.37	-73.2642
9	91.44	-73.3251
Rising success with lower perplexity

(4) Same with initialisation code 29 (D3+T1)
Iter	Ambig%	Perpl
1	58.92	-0.539377
2	69.31	-0.681451
3	71.49	-0.928813
4	73.14	-1.18675
5	73.79	-1.44333
6	74.48	-1.70778
7	75.01	-1.98222
8	75.46	-2.25305
9	75.59	-2.49521

(5) As (3) with more iterations to see what happens at chageover point.

The conclusion of this seems to be that, even though tests 1 and 2 have a
falling success rate and tests 3 and 4 have a rising one, the perplexity on
both methods of measurement changes in the same direction in all the tests. So
perplexity does not provide a useful predictor of performance. The only
anomalies are some perplexities on the first iteration.

(? This is puzzling: Sharman and others would predict that it does provide a
predictor.)
(No: I think they mean it is a predictor of convergence).

Xerox tagger
------------
Do some experiments with this, following the given file of instructions.
* compiled it (generates some errors, but see if it goes through).
* fails on adder, try it on bittern
* OK on bittern
* Can we do some speed and accuracy tests? (A bit tricky without grobbling
into it)


6-5-93
------
It seems like a good idea to do some more parsing experiments. The possible
soruce of material for this are lancpars, secpar and the treebank. The
treebank (in the parstexts directories) uses a significantly different format,
and we have already looked at lancpars, so now lets have a look at secpar.
(Created a local soft link).

Files we don't want: allns.sec, cgtree2. The others look OK. Take seca as the
working file, since it is a decent size. Format appears to be the same as
lancpars. Generate a tagset in sec.map.

Doing this reveals a difficult bug in tagging phrases, which I think leads to
some changes to the use of the data structures and to some insights into the
tagging algorithms.

(1) changes: at the moment, the succ and pred of the end nodes in a phrase
point right back to the lexical level, whereas they ought to point to the
relevant thing at the next higher level phrase. However, this is rather
difficult to arrange, because if you have e.g. a N within a P, such as
[P xx_XX [N yy_YY N] zz_ZZ]
then the N gets constructed first, before the zz node gets set up, and so we
don't know where to set the succ.
(2) in the tagging code, we ought to find t+1 and t-1 values from the
hypothesis, rather than from the node (t), since they will be different for
each hypothesis (?? is this actually true - it may not be - check).
* No it isn't true, because when we take t-1, for example, we are interested
in the hypotheses ending at t-1 and the ones starting at t. So ignore this
point.

The solution is to do with the succ and pred field of the end nodes in spanned
phrases. What we have to do is to make them point to the appropriate things in
the next higher level phrase. Now, it may be that the next higher level phrase
happens to end at that point at the same point as the phrase we have, in which
case we need to (a) be able to detect this (b) trace to the next level up.

(* Done: it was hard *)

7-5-93
------
Now we go on with the test. I still don't have a good way of evaluaitng phrase
bracketing accuracy, but we can see what effect the rules have on the success
rate. (And note, that it is REAL SLOW). Moved below.

Comment: ran out of memory doing this with rules. So make a copy and insert
anchors at ends of sentences. Didn't help. Try looking at the output and edit
out troublesome sentences. This means that there is no longer an exact match
between the corpus and the rules, but this is fairly unimportant.

10-5-93
-------
Have another go at the seca test. It would also be interesting to know how
many times each defined rule actually occurred in the training corpus, so make
a new version of lanc.c to report this.

If the run works, we can also try looking at the output and attempting to do
some success evaluation by hand.

(* The execution seems to get exponentially slower: perhaps a space leak.
Write some code to check for this. Or use the faster unix malloc package. *)

It did run to conclusion, but there MUST be a space leak. So try to track it
down. Done so (I think - certainly MUCH MUCH faster now)

Without rules: ALL 93.59 AMBIG 77.59 (Words = 10302, ambig = 2945 = 28.59%)
With rules:    ALL 95.36 AMBIG 84.05 (Words = 10302, ambig = 1285 = 12.47%)

Next we want to get FSMs working again, and then do some sort of evaluation of
the rules.

11-5-93
-------
To do a significantly large test of FSMs, take the SEC rules, and weed out any
which include a phrasal category on the rhs. Then compile them into FSMs (by
hand). Try to do this so as to introduce some ambiguity, just for the sake of
interest! Also, in some places where there are repeated categories, allow
arbitrary sequences of them, for the same reason.

Next thing about how to do evaluation of phrases. One way of getting into this
might be just to look at the things that actually occurred in the output from
the parser run.

CASE 1: extra brackets.
[S+
[N
    SA01 v SA01 2 v [N
    Good JJ
    morning NN1
N]
S+]
    N]

Case 2: exact match, possibly nested.
[N
    [N
    founder NN1
[P
    [P
    of IO
[N
    [N
    the AT
    Unification NN1
    church NN1
N]
P]
N]
    N] P] N]

Case 3: missing brackets (also has an extra bracket in this example).
    ^ ^
    SA01 5 v [S& [Nr
a   Next MD
a   week NNT1
[Fr
    Nr]

?? start in right place, end in wrong place ???

Note we can also have "crossing" to confuse the whole thing:
[N
    [N
    jail NN1
N]
P]
[P
    N] P] [P
...




12-5-93
-------
Still need some similarity measure. I have a program which whips out brackets,
so this will give some approach to do it. If we get pairs of the brackets that
were produced by the labeller and ones from the corpus, we can try comparing
them. However, this won't always be obvious. Look at the patterns.

(1) exact match: easy
(2) first n symbols are the same: delete from both and continue (may leave one
string empty).
(3) last n symbols, similarly
(4) any other cases, report to user

Anything left in the first string is an extra symbol, anything left in the
second one is missed.

(* In fact, this algorithm is not 100% reliable. It may give false matches,
for example in:
  P] N]
  G] N] P] N]
where the N] in the first string is supposed to match the first one of the
second string and not the second one.
*)

Done this: the output still needs post-editing for the "rejected" cases, and
will probably make some spurious spotting, but it gives a rough idea.

Ted suggests doing an experiment with secpar/allns.sec, which is NP
definitions (some with other categories of phrase within them), but I don't
see this as yielding anything very useful.

Instead, for the time being, make a new agenda, and go into a reading phase.

AGENDA ***
......
DONE(more or less) Write chart dumping code.
DON'T Write chart editor.
DON'T Allow disjunctions in parser rules.
DONE Do a real experiment on phrase spotting. Maybe sectag/allns.sec or train
on sec/test on dow jones.
5. Put code into corpora/tools before leaving.
DONE Better similarity measure on trees needed.

----

*** Things which would be an improvement: calculate entropy in a better way.
Implement I/O with CNF grammars. Study probability/log probability as a
measure for convergence. All this from Lari and Young.



14-5-93
-------
Plans for what to do in the remaining time.
(1) Reading papers.
(2) Reading C++.
(3) Start work on a design specification for OO version.
(4) Plan the algorithms which would be built on top of this.
(5) Devise a control strategy, so algorithms could be strung toether and/or
run in parallel.
(6) Devise a different user interface, in the form of a command language,
which could be either typed in, or derived from command lines, or triggered
from menus, etc.
DONE Devise an evaluation strategy for parsings (e.g. bracketing
compatibility).

Written a new version of "outcomp" for evaluation. Running it on output from
seca, we get (rules from seca/test on seca):
3166 exact	38.02%
2953 extra	35.46%
2209 missing	26.52%
(Note that the latter two categories can include places where the start was in
the right place but the end was not, and vice-versa.)

19-5-93
-------
* To carry this test a bit further, try using a different corpus with the same
rules: rules from seca/test on secb.
1117 exact	20.02%
2603 extra	46.66%
1859 missing	33.32%

* Also do this the other way round.
Rules from secb/test on secb
1869 exact	42.30%
1442 extra	32.64%
1107 missing	25.06%

Rules from secb/test on seca
1807 exact	19.49%
3897 extra	42.03%
3568 missing	38.48%

* Finally, make rules from the two corpora taken together (this is not the
same as just merging the rule sets, because of probabilities).
** Failed: ran out of memory.

--- A better test ---
* Now try a different test which should cost less in memory. Make new corpora
from seca and secb containing just NP, AP and PP categories, i.e. delete any
phrase brackets except the following:
J, J&, J+
N, N&, N+, Nr
P, P&, P+

- Step 1: edit down the corpora to get just these tags. Tagset file secnp.map.
Add anchors after each full stop.
- Step 2: produce rules files from the separate corpora and the two together.
Files: rules-a rules-b rules-ab. Combined corpus in secab.
869  rules in a
541  rules in b
1229 rules in ab
- Step 3: get basic success rates with no rules.
	Words	AWords%	All%	Ambig%
a/a	10302	28.59	93.97	78.91
b/b	5752	21.47	94.70	75.30	
ab/ab	16054	31.05	93.45	78.89
- Step 4: test each corpus against its own rules set. Get success rates as well
as bracketing rates.
- Step 5: test each corpus against the rule sets from the other corpora.

The results of steps 4 and 5 are:
	Words	AWords%	All%	Ambig%	exact%	extra%	missed%
a/a	10302	21.16	95.74	84.95	75.74	21.13	24.26
b/b	5752	17.32	96.97	92.07	74.99	26.08	25.01
ab/ab	16054	22.87	94.67	87.14	69.94	30.14	30.06
a/b	10302	30.61	2.54	83.54	46.19	52.28	53.81
a/ab	10302	21.29	95.13	86.18	73.67	27.96	26.33
b/a	5752	28.30	94.91	88.08	49.45	48.13	50.55
b/ab	5752	19.71	96.49	91.80	70.47	30.21	29.53
ab/a	16054	26.75	94.41	84.77	63.60	31.23	36.40
ab/b	16054	28.89	93.39	86.16	53.81	43.40	46.19

* The next thing to do is to produce a much reduced grammar of these phrases,
which we could take to a finite state fragment (expanding out phrases within
phrases). To get the evaluation correct on this, we first produce new corpora,
flata, flatb and flatab, which have only phrases of type N, N&, N+ or Nr, and
in which there are no nested phrases. The ones we retain are the ones which
look reasonable, i.e. some of the very long ones with a very complex internal
structure are deleted; and we do some general fiddling as well.
* Actually, this is really tedious and it is not clear what it will really
tell us afterwards. So halt it, I think, and start to write some of this up.


It would help if the FSM package were adapted to allow nested phrases. How
would we do this? On building a phrase, see if it can advance any of the FSMs
at the node at the left hand end of the phrase, and if so, then add the result
to the node at the right hand end of the phrase. Just pass a single hypothesis
in doing so, rather than a list as with the base hypotheses.

20-5-93
-------
Done the above. So next do some experiments with just the above categories
using FSMs. When this is done, then go on to using some simplified (and
theoretical) ones, and then write it up.

First create FSMs. Do this in an automatic way to begin with, by just modifying
lanc.c to output one FSM state per step in a rule and one FSM per rule. Test
on this.
	Words	AWords%	All%	Ambig%	exact%	extra%	missed%
a/a	10302	31.76	95.23	89.39	47.86	46.50	52.14
b/b	5752	27.09	96.18	91.21	45.79	50.44	54.21
ab/ab	16054	34.65	94.11	89.61	42.25	52.28	57.75
a/b	10302	35.26	94.43	89.71	36.97	56.70	63.03
a/ab	10302	32.19	94.76	90.11	46.55	49.66	53.45
b/a	5752	33.80	95.39	89.87	37.71	55.43	62.29
b/ab	5752	29.80	95.69	90.84	43.53	52.37	56.47
ab/a	16054	36.40	94.31	88.83	40.65	51.10	59.35
ab/b	16054	36.40	94.20	89.94	36.82	55.59	63.18

(Note exact+missed = 100%)
??? why are these different ???

---

Now we call a halt to this work, in order to write up some notes on this.

Thinking about it, I may actually want to change these evaluation criteria. We
have exact/(exact+extra+missed) etc. What I think we really want is
exact/(exact+missed), since exact+added is the total phrase tags in the
original; also missed/(exact+missed). Also extra/(exact+extra), i.e.
proportion of assigned tags which are superfluous.
Altered the code to do it. Fortunately, we can recover the figures we want
from what I output before. Done this and changed the tables above.
