From cl.cam.ac.uk!larry.piano Fri Feb 16 21:57:48 1996
X-Mailer: exmh version 1.6.4+cl+patch 10/10/95
To: johnca@cogs.susx.ac.uk, Tung-Ho.Shih@cl.cam.ac.uk
cc: Larry.Piano@cl.cam.ac.uk
Subject: The Tagger
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Fri, 16 Feb 1996 21:59:14 +0000
From: Larry Piano <Larry.Piano@cl.cam.ac.uk>

John and Stone,

I put my latest version of the tagger in the normal place, ~ljp1003/myunk.  I 
also copied the old tagger to ~ljp1003/myunk/oldlabel if the new one really 
gives you trouble.  The new version may take somewhat longer than the old one, 
I don't know.  A sample command line is:

~ljp1003/myunk/label suste B1 C2 J/tmp/lar/sec.ftr H3 k8 K10 N Umyrules Y rsec 
m/homes/jac/corpora/sec.map O1 S>/tmp/lar/output 2>/tmp/lar/hypothesis

The /tmp/lar/sec.ftr contains the analyzed features from the .lex file.
The /tmp/lar/output file is the output file on stdin.
The /tmp/lar/hypothesis file is known and unknown tag hypothesis information 
on stderr.

The .ftr and  hypothesis files are controlled by the Y (debug) option. The H 
option is the length of a suffix.  The k option is the maximum length of a 
prefix "cut" and the K option is the maximum length of a suffix "cut".  Cuts 
try to find out valid prefixes and suffixes for word transitions.  For 
example, if "bag" was an unknown word and "bags" was in the lexicon as a known 
word, the tager would learn that cutting s off a word changes it sometimes 
into a noun and sometimes into a verb.  That's just a quick explanation, but 
the larger the k and K numbers are the longer it's likely to take, because 
more endings are being analyzed.  Let me know any problems.

Thanks,

Larry

P.S.  I'll be away next week, so I'll answer mail as soon as I get back.


From cl.cam.ac.uk!larry.piano Thu Apr  4 17:06:26 1996
X-Mailer: exmh version 1.6.4+cl+patch 10/10/95
To: johnca@cogs.susx.ac.uk, ejb@linc.cis.upenn.edu
cc: Larry.Piano@cl.cam.ac.uk
Subject: Tagger
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Thu, 04 Apr 1996 17:06:42 +0100
From: Larry Piano <Larry.Piano@cl.cam.ac.uk>

Dear John and Ted,

Not that it'll make a grand bit of difference to anyone, but I can at least 
leave here knowing that the tagger got 85% correct on unknown words on at 
least one corpus! - which was my goal 'way back when.  Anyway, here are the 
tagger results on Susanne, Penn Treebank and LOB, just for your information.  
I ran the Penn TB corpus two ways: 

	1.  Both training corpus and test corpus were run through John's sed script 
to insert *'s on sentence-initial words.

	2.  Only the test corpus was altered with the * sed script.  This has the 
effect of making known words into "unknown", which drives them through some of 
the unknown word analysis procedures.  This method increases the total number  
of words correctly tagged, only by about 150, but it is an increase.

One thing that was particularly helpful was using "variable suffixes".  I 
didn't like getting the errors caused by always taking a set number of letters 
for a suffix, so I made it vary, as follows:

	- if the word is "long enough" for a suffix and ends in a consonant, scan 
back until a vowel is found, then scan back to the first vowel.  Examples: 
learned -> -ed, documentation -> -ion, parking -> -ing, languages -> -uages.

	- if the word ends in a vowel, scan back to a consonant, then to a vowel then 
to the first vowel.  Examples: maintenance -> -ance, licence -> -ence, 
charisma -> -isma, probable -> -able, necessary -> -ary, strictly -> -ictly 
(well, it's not perfect).
 
My opinion of tagging after all this, is that it is fun but quirky.  The 
supposedly hand-tagged training corpora are very inconsistently tagged and/or 
contain a lot of "noise".  Also, even if an unknown word's "correct" tag has 
the highest score upon entering the labelling phase, the tagger will often 
chose another, even the lowest scored tag.  Such is the uncertain world of 
probability.

Also, at the end, I list some debug output which has allowed me analyze 
problems and really helped me to improve the results.  The first is the 
accounting of the success of each word feature category and the second is what 
words failed for each of the categories.  I will write this all up whether 
it's before I leave or after!

Anyway, I won't be in tomorrow, Good Friday, so I hope you have a good Easter.

Larry
------------------------------------------------
Susanne
-------

Class                                              Total      Correct
=====                                              =====      =======
All words                                          21192      20263  (95.62%)
Known words                                        19877      19140  (96.29%)
Unknown words ( 6.21%)                             1315       1123   (85.40%)
All ambiguous words (45.60%)                       9664       8945   (92.56%)
Known ambiguous words                              9165       8563   (93.43%)

Average observation probability 7.85128e-05
Perplexity -12.6126


Penn Treebank - training and test w/*'s
-------------

Class                                              Total      Correct
=====                                              =====      =======
All words                                          754863     727701 (96.40%)
Known words                                        744643     719463 (96.62%)
Unknown words ( 1.35%)                             10220      8238   (80.61%)
All ambiguous words (65.23%)                       492392     467860 (95.02%)
Known ambiguous words                              488718     465323 (95.21%)

Average observation probability 1.67296
Perplexity -8.72149


Penn Treebank - test only w/*'s
-------------

Class                                              Total      Correct
=====                                              =====      =======
All words                                          754863     727851 (96.42%)
Known words                                        713880     690013 (96.66%)
Unknown words ( 5.43%)                             40983      37838  (92.33%)
All ambiguous words (63.82%)                       481774     457753 (95.01%)
Known ambiguous words                              475971     453652 (95.31%)

Average observation probability 1.73597
Perplexity -8.51928


LOB
---

Class                                              Total      Correct
=====                                              =====      =======
All words                                          282388     268734 (95.16%)
Known words                                        270828     259734 (95.90%)
Unknown words ( 4.09%)                             11560      9000   (77.85%)
All ambiguous words (58.47%)                       165108     153693 (93.09%)
Known ambiguous words                              159897     150140 (93.90%)

Average observation probability 0.00151034
Perplexity -10.3414

---------------------------------------------------------
Debug Output:

Incorrect Words
---------------

.. (repeating chars)    	can..  	 	MD      NN
3 letters (all capital) 	*GET    	VB      NP
3 letters (all capital) 	*HAD    	VBD     NP
3 letters (all capital) 	TOW     	NN      NP
3 letters (capital)     	AMs     	NNS     NP
3 letters (capital)     	Aaa     	JJ      NP
3 letters (capital)     	GMr     	NN      NP
3 letters (capital)     	Hee     	UH      NP
3 letters (capital)    		Hoe     	VB      NP
.
.
.
added init w/lowercase  	Approval 	NP      NN
added init w/lowercase  	Ardent  	NP      JJ
added init w/lowercase  	Billiards      	NP      NN
added init w/lowercase  	Blunt   	NP      JJ
added init w/lowercase  	Blunt   	NP      JJ
added init w/lowercase  	Blunt   	NP      JJ
.
.
.
downcased allcaps       	AFTERMATH      	NP      NN
downcased allcaps       	ALL     	NP      RB
downcased allcaps       	ANNUITY NP      NN
downcased allcaps       	APPELLATE       NN      JJ (correct in hyp)
downcased allcaps       	ARRESTED        VBD     VBN
.
.
.
true capital special cut        Gaslight        NN      NP
true capital special cut        Guardsmen       NPS     NP
true capital special cut        Halliburton     NN      NP
true capital special cut        Irishmen        NPS     NP


Word Features
-------------

Unknown Word Features            Total      Correct         Corr-in-hyp
====================             =====      =======         ===========
Words with features              40983      37838  (92.33%)    1308   (95.52%)
Match with Rules                 20         14     (70.00%)    3      (85.00%)
No Match after Rules             106        62     (58.49%)    26     (83.02%)
all capitals                     54         52     (96.30%)    0      (96.30%)
all capitals -                   1          1      (100.00%)   0      
(100.00%) allcaps-allcaps                  9          6      (66.67%)    1     
 (77.78%)
allcaps-lowercase                22         16     (72.73%)    5      (95.45%)
capital-capital                  120        103    (85.83%)    6      (90.83%)
cardinal number                  4          4      (100.00%)    0     (100.00%)
cmprsd capital                   2          1      (50.00%)    0      (50.00%)
cmprsd down capital              5          4      (80.00%)    0      (80.00%)
compressed word                  76         58     (76.32%)    4      (81.58%)
.
.
.
existing capital word            18         15     (83.33%)    2      (94.44%)
existing mix word                20         15     (75.00%)    0      (75.00%)
existing separator ending        1437       1118   (77.80%)    163    (89.14%)
initial capital container cut    4          4      (100.00%)    0     (100.00%)
initial capital prefix cut       1          1      (100.00%)    0     (100.00%)
initial capital repl suffix cut  8          8      (100.00%)    0     (100.00%)
initial capital root cut         2          1      (50.00%)    0      (50.00%)
initial capital suffix           2          2      (100.00%)    0     (100.00%)
initial capital suffix cut       12         10     (83.33%)    1      (91.67%)
lowercase cmpnd special cut      191        161    (84.29%)    9      (89.01%)
lowercase container cut          344        268    (77.91%)    27     (85.76%)
lowercase prefix cut             335        273    (81.49%)    18     (86.87%)
lowercase repl suffix cut        349        290    (83.09%)    19     (88.54%)
lowercase root cut               50         33     (66.00%)    9      (84.00%)
lowercase smart prefix cut       145        122    (84.14%)    10     (91.03%)
lowercase smart suffix cut       282        260    (92.20%)    11     (96.10%)
lowercase special cut            510        411    (80.59%)    30     (86.47%)
lowercase suffix                 31         14     (45.16%)    4      (58.06%)
lowercase suffix cut             1261       1085   (86.04%)    86     (92.86%)
number-number                    4          3      (75.00%)    0      (75.00%)
number-string                    145        134    (92.41%)    2      (93.79%)
ordinal number                   11         5      (45.45%)    0      (45.45%)
plain -                          47         14     (29.79%)    10     (51.06%)
plain /                          1          0      ( 0.00%)    0      ( 0.00%)
.
.
.
true capital                     113        111    (98.23%)    0      (98.23%)
true capital -                   4          4      (100.00%)    0     (100.00%)
true capital cmpnd special cut   19         13     (68.42%)    1      (73.68%)
true capital container cut       117        108    (92.31%)    0      (92.31%)
true capital prefix cut          86         81     (94.19%)    0      (94.19%)
true capital repl suffix cut     189        182    (96.30%)    4      (98.41%)
true capital root cut            5          5      (100.00%)    0     (100.00%)
true capital smart prefix cut    2          2      (100.00%)    0     (100.00%)
true capital smart suffix cut    43         38     (88.37%)    2      (93.02%)
true capital special cut         357        344    (96.36%)    1      (96.64%)
true capital suffix              94         86     (91.49%)    1      (92.55%)
true capital suffix cut          394        335    (85.03%)    47     (96.95%)
variable down allcaps suffix     3          3      (100.00%)    0     (100.00%)
variable down capital suffix     4          1      (25.00%)    0      (25.00%)
variable down init capital suffix 27         16     (59.26%)    7     (85.19%)
variable initial capital suffix  9          9      (100.00%)    0     (100.00%)
variable lowercase suffix        382        284    (74.35%)    17     (78.80%)
variable separator capital suffix 16         13     (81.25%)    0      (81.25%)
variable separator lowercase suffix 13       12     (92.31%)    1      (100.00)
variable true capital suffix     677        639    (94.39%)    7      (95.42%)



From cl.cam.ac.uk!larry.piano Fri Apr 12 14:15:31 1996
X-Mailer: exmh version 1.6.4+cl+patch 10/10/95
To: Tung-Ho.Shih@cl.cam.ac.uk, johnca@cogs.susx.ac.uk
cc: Larry.Piano@cl.cam.ac.uk
Subject: Tagger
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Fri, 12 Apr 1996 14:15:52 +0100
From: Larry Piano <Larry.Piano@cl.cam.ac.uk>

John and Stone,

My last attempt at the tagger is now in ~ljp1003/myunk/label.  As before, I 
copied label to oldlabel, in case you have problems.  If you've been using the 
h option in your command line, take it out.  The h option is now used to 
indicate the maximum number of unknown words expected in the corpus.  The 
default is 100,000.  I need this number since I store each unknown word in a 
list after the tagger assigns tag hypotheses, so it doesn't have to redo a 
"known" unknown word.  It speeds things up if there are a lot of repeats.

Let me know if you have any problems.

Larry


