Detailed errors analysis
========================

Here I attempt to analyse some of the major error classes from a self test on
b and l (see file Errors) in greater detail, and with more context. The file
of errors, annotated with back references to this one is in out/v7b-e.

Comment: the earlier figures need to be taken with a pinch of salt - I am
finding some errors which are not in the earlier file. Having a better
analysis tool (and also being more careful) should help.

Notational convention: _word to mark the incorrect word of interest (* is a
LOB special character.)

For error classes, Tag1/Tag2 means a word with correct tag Tag1 was assigned
Tag2.

Why is the goal of this analysis? To attempt to discover (a) distribution of
errors and (b) some ways which they might be corrected; in particular whether
FSM/sub-models etc might help.

In a number of places, I have tried to allocate the error to being a
consequence of either a lexical factor or a transition one. What I mean by
this is that given the correct tag C and the chosen tag T, and the preceding
and following tags s and e, then if the transition strength of the sequence
s:C:e is similar to that of s:T:e, the lexical probabilities make the major
effect; if not it is the transition probabilities that matter. The former
might be mended by smoothing the lexicon, the latter by adding phrasal
information. (In some cases: these are hypotheses to be tested; also note that
this is a fairly approximate test and other factors, like ambiguity on the
preceding and followinbg words, might obscure it; also, deciding on what
constitutes "similar" is a bit subjective, so the rations are recorded.)


Methodological point: hwo can you separate lexical from transitional factors?
You can't really. I think the thing to do is to look at the context in the
sense of the preceding and following tags, and look for regularities within
that. Because that is the sort of thing that using FSMs might fix.

Notation: X _ Y indicates the context, giving the *correct* tags of the words
on either side. The end of the line gives correct/chosen. l means category to
left was ambiguous, r means one to right was, b means both wer, n means
neither were.
(* Haven't done this everywhere yet *)


1. Verb/verb errors
-------------------
Larger classes: VBD/VBN (25), VBN/VBD (21).

(a) VBD/VBN
 CC _ ATI (1:0.34)
without a shot but _left the masses confused				VBD/VBN

 CC _ IN (1:3.2)
and _settled in Kettering						VBD/VBN

 CC _ RB
this situation has existed and _worked well				VBN/VBD

 DT _ ,
^ this said ,								VBN/VBD

 NN _ AT
the British goverment has from the beginning _sought a			VBN/VBD

 NN _ IN (1:2.2)
the check point _arrived with a flourish				VBD/VBN
the sun _set at about 5.34						VBD/VBN
the awful bloomer the producers of the film _made in showing		VBD/VBN
criticises the emergency resolution _passed by the			VBN/VBD

 NN _ RB
the inheritance _recevied rather than					VBN/VBD

 NN _ . (1:3.1)
some worthy cause _benefited .						VBD/VBN
the attempt _failed .							VBD/VBN
night's rest _destroyed .						VBN/VBD

 NN _ , (1:1.4)
which her husband _paid ,						VBD/VBN

 NNS _ IN (1:3.0) 
the anti-Liberal papers _suffered from an epidemic			VBD/VBN
guns _pointed at							VBN/VBD
whose gravity during the past few days _led to 				VBN/VBD
^ subways _preferred to baths						VBN/VBD

 NNS _ . (1:4.2)
under which the liberal candidates _worked .				VBD/VBN

 NNS _ ?
why were these little children _killed ?				VBN/VBD

 NP _ IN
is Wilensky _justified in being						VBN/VBD

 PP3 _ IN
I have heard it _said that						VBN/VBD
what has it _made of the Congo						VBN/VBD

 PPLS _ . (1:6.3)
which they themselves _lacked .						VBD/VBN

 RB _ ABL (1:1.8)
no conference decision ever _justified such action .			VBD/VBN

 RB _ AT
has Moscow yet _got a firm foothold					VBN/VBD

 RB _ ATI (1:1.3)
never _heard the slightest rumour					VBD/VBN
he has never _signed the cheque						VBN/VBD

 RB _ CS
it is often _said that							VBN/VBD
I have always _said that						VBN/VBD

 RB _ PP$ (ratio VBD:VBN 1:1.2)
decisively _raised their status
never _lost her affection						VBD/VBN

 RB _ RP (1:5.3)
the conservative political centre report on consumer protetction recently
	_pointed out							VBD/VBN

 RB _ TO (1:6.8)
I always _used to hear							VBD/VBN
voluntarily _agreed to maintain strict control				VBD/VBN
so often _managed to succeed						VBD/VBN

 RB _ , (1:7.3)
an editor who once _proclaimed ,					VBD/VBN

 RN _ JJ (1:3.9)
and then _placed indefensible restrictions				VBD/VBN

 RN _ TO
have now _decided to go forward						VBN/VBD

 , _ AT
, _made a number of							VBN/VBD

 , _ CS (1:1.1)
, _proposed that the nation						VBD/VBN

 , _ NP
, _elected \0M \0P							VBN/VBD

 , _ PP3 (1:1.2)
, _made it very unlikely ;						VBD/VBN

 , _ QL (1:2.0)
, _put as good a face on it						VBD/VBN

 , _ RB (1:1.4)
the shipbuilders , however, _put forward				VBD/VBN

 *- _ NR
West Germany *- _followed yesterday by the Ducth *- has made		VBN/VBD



2. Verb/noun errors
-------------------
In these cases a verb is mistakenly labelled as a noun.

Comment: VB is the base form of a verb and VBZ the 3sg form.
Error classes: VB/NN (30), VBG/NN (9), VBN/NN (1) VBZ/NNS (6)

(a) Coordination
Lexical:
 to curtail and _guillotine debate					VB/NN
 they either know or _care about our terms				VB/NN
 seek and _share guidance						VB/NN
 to meet and _talk							VB/NN
 he just listens and _looks						VBZ/NNS

Phrasal:
 have a good time here and _study what is				VB/NN
 return from the forces at the age of 20 and _work for him		VB/NN

(b) Bare infinitive
 to help _finance national defence					VB/NN

(c) Number mismatch.
1. Inversion, introducing an auxiliary.
 why does Cheltenham _need *+230,000					VB/NN
2. Use of a syntactically singular noun as plural.
 the sporting public _hope that						VB/NN
3. Intervening relative clause or PP.
 what the great masses of ordinary people in the world _desire most	VB/NN
 aspects of income taxation _worry people				VB/NN
4. Intervening punctuation.
 when parents , and teachers , and children , _work as a team		VB/NN
5. "Quasi-subjunctive"
 may his tribe _increase	(occurs four times)			VB/NN

(d) Negation
 they might not _work .							VB/NN

(e) "that".
Each of these is a different use of "that"
1. Relative pronoun.
 decision are taken that _lead to lamentable waste			VB/NN
2. Demonstrative pronoun.
 if you want evidence of that _look at the wrangle			VB/NN
3. that-clause object
 publish the rebate scales used by \0f.h.a members and _state that they adhere
									VB/NN
(f) "Idiomatic" expressions.
 *' _mind if I give you							VB/NN

(g) Ellipsis.
1. unskilled jobs are reasonably well-paid and many _look attractive	VB/NN
2. (Less sure about this one)
 for these reasons it is usually argued that the first move towards a scaled
tax should be to modify the purchase-tax system into a uniform percentage rate
tax and that this should be extended, if administratively possible and right
in principle to _tax .							VB/NN

(h) What is a noun? These are cases which I think correctly are nouns, and
which the tagger predicts to be, but which the tagging of LOB indicates as
being verbs. (* This is a question of what is the right tag; of course, if
there were enough instances of them tagged in this way in the training data,
then the verbal tag would have been assigned. *)
1. V-ing.
 as the national hunt festival _meeting approaches			VBG/NN
 I cannot help the _feeling that the					VBG/NN
 _training in communication should 					VBG/NN
 _farming is Britain's most vital					VBG/NN
2. Compound nouns.
 this was shown at the Scottish _trades union congress			VBZ/NNS
 a member of the Oxford tutorial _classes committee			VBZ/NNS
3. Others.
 why not a local sales _tax ?						VB/NN
 
(i) "living" + place.
 a letter from an 83-years-old lady _living near Sleaford		VBG/NN
 by a lady _living in James-street					VBG/NN
 a reader _living on Yaroborough-crescent				VBG/NN
 a woman _living in Broadway						VBG/NN


(0) Not yet classified:
make their blood _run cold						VB/NN
to do more than _influence the position					VB/NN
he had known her _stop at least seven					VB/NN
all other countries _control immigration				VB/NN
in order to _increase exports						VB/NN
to do except _stand firm						VB/NN
the agreement to stop _building	bombers					VBG/NN
the honours examination question at present _set is in fact		VBN/NN
what we want is a society where the individual _matters ,		VBZ/NNS
the discussion of religion and politics _figures in			VBZ/NNS
while the goods which Germany _exports will be				VBZ/NNS


3. Noun/adjective errors
------------------------
Noun mistakenly classified as an adjective: NN/JJ (50), NN/JJB (10).
Many of these seem to be associated with specific lexical items or phrases;
listed under (b) below.

(a) Co-ordination.
Lexical
 left , _right & centre		(occurs twice)		, _ CC		NN/JJ n

*** Need further analysis here ...
(b) Lexical probabilities. These are contexts where either a NN or a JJ looks
reasonable syntactically, but where the tagger chooses JJ. It is probably the
lexical probability which is dominating.

1. "as a whole". JJ = 0.0077, NN = 0.0006
 as a _whole .			(occurs twice)		AT _ .		NN/JJ l
 as a _whole but					AT _ CC		NN/JJ b
2. "present". JJ = 0.0077, NN = 0.0011.
 are at _present over					IN _ IN		NN/JJ b
 at _present ,						IN _ ,		NN/JJ l
 at _present set					IN _ VBN	NN/JJ b
 at _present net company profits			IN _ JJB	NN/JJ b
3. "right". JJ: 0.0052, NN : 0.0013
 has the _right to dictate policy			ATI _ TO	NN/JJ l
 about her _right to make this change			PP$ _ TO	NN/JJ b
 Britain's _right to change				NP$ _ TO	NN/JJ r
 the _right to be treated				ATI _ TO	NN/JJ r
 every _right to blame					AT _ TO		NN/JJ r
4. Others (1).
 uprooted the stakes protecting the _square and ripped	ATI _ CC	NN/JJ r
 the traders and _public at large			CC _ RB		NN/JJ b
 the sporting _public hope that				JJ _ VB		NN/JJ b
 fully indoctrinated _communist .			VBN _ .		NN/JJ n
 himself a _communist ,					AT _ ,		NN/JJ l
 as a _daily in the					AT _ IN		NN/JJ b
 is his _due .						PP$ _ .		NN/JJ l
 ^ _editorial .						^ _ .		NN/JJ n
 a blundering _general ,				JJ _ ,		NN/JJ n
 the family influence for _good is not			IN _ BEZ	NN/JJ l
 in the _main .						ATI _ ,		NN/JJB r
 a *+20 _minimum )					NNU _ )		NN/JJ l
 a _native of Kansas					AT _ IN		NN/JJ r
 into the _open ,					ATI _ ,		NN/JJ n
 a _total of						AT _ IN		NN/JJ b
 70 *' _plus ,						*' _ ,		NN/JJ n
 five prescription a _head last year			AT _ AP		NN/JJB b
 no _liberal or tory					ATI _ CC	NN/JJ l
 have stood _firm ,					VBN _ ,		NN/JJ n
 liquidation of the _firm to pay death duties		ATI _ TO	NN/JJ r

(c) Noun compounding (errors within the noun phrase, not in the last word of
it). Noun compounds are significantly dispreferred: the NN:NN transition
probability is 0.00565753, compared to JJ:NN of 0.546667. So a sequence such
as AT NN NN is nearly always dispreferred to AT JJ NN, unless the lexical
probabilities dominate (which would take a large number of NN occurrences).
Note that in some of these cases, the lexical probability also contributes to
the problem, as it does in (b).
Comment: some of these might as well have been called adjectives as nouns;
also see (d).
1. "living".
 the pressure on _living space				IN _ NN		NN/JJ l
 a _living standard					AT _ NN		NN/JJ b
 works for a _living wage				AT _ NN		NN/JJ l
 to improve _living standards				VB _ NNS	NN/JJ r
*2. "tory"
 unreasoning _tory prejudice				JJ _ NN		NN/JJ n
 a single _tory \0M.P.					JJ _ NPT	NN/JJ n
 the present _tory government				JJ _ NN		NN/JJ r
3. "post"
 the _post office	(occurs six times)		ATI _ NN	NN/JJB n
 our _post offices					PP$ _ NNS	NN/JJB n
4. Others.
 a society where the _individual matters ,		ATI _ VBZ	NN/JJ r
 the _minimum wage					ATI _ NN	NN/JJ n
 the name of _working solidarity			IN _ NN		NN/JJ l
 a reduction of _working hours				IN _ NNS	NN/JJ l
 there are ten _swimming baths				CD _ NNS	NN/JJ n
 the 88-year-old _standard bearer			JJB _ IN	NN/JJ n
 in _sporting circles					IN _ NNS	NN/JJ l
 ^ _cold war front .					^ _ NN		NN/JJ n
 the _maximum use of					ATI _ NN	NN/JJ r
*** This could be fixed up by a phrasal NP recogniser.

(d) Long-range syntactic structure.
 although he is no longer a titular _chief Albert Luthuli is
							JJ _ NP		NN/JJB n
 whatever _light God has given them			WDT _ NP	NN/JJ n

(e) LOB idiosyncrasies. In these cases, it is not clear that the LOB analysis
as a NN is really preferable to JJ.
 the contributions , _total and fraction ,		, _ CC		NN/JJ r


*** perhaps a way of weaking the problems with specific lexical items is to
uniformise (or weaken the effect of) the word probabilities.

4. Adjective/noun
-----------------
Adjective mistakenly classified as a noun.
JJ/NN (19), JJB/NN (4)

they have been impressed by _light , airy schools		JJ/NN
given to _public as against pricate interests			JJ/NN
for the _living , not the dead					JJ/NN
the first _resident tutor					JJ/NN
you will find obvious _manual workers				JJ/NN
the many fine _swimming pools					JJ/NN
their _left or centre supporters				JJ/NN
a situation _ideal for exploitation				JJ/NN
the first _humanitarian impulse					JJ/NN
put the economy _right .					JJ/NN
the apparently static common market _due in turn to the growth of	JJ/NN
if administratively possible and _right in principle		JJ/NN
the stop-watch _manufacturing methods				JJ/NN
it is just as interested in the _living .			JJ/NN
contrast his _firm , successful rule				JJ/NN
where there is little _light industry				JJ/NN
there is little for the west to do except stand _firm .		JJ/NN
to make their blood run _cold .					JJ/NN
the _patient efforts of \0UN conciliators			JJ/NN
an urgent _post world war 2 problem				JJB/NN
the immediate _post war years					JJB/NN
the _advance orders ,						JJB/NN
more than their left or _center supporters			JJB/NN

5. Noun/verb errors
-------------------
Noun mistakenly classified as a verb.
NN/VB (19), NN/VBG (10), NNS/VBZ (8), NN/VBN (2), NN/VBZ (1)

(a) Co-ordination
 the issue is touch and _go .						NN/VB
 no collecting and _recording of monthly instalments			NN/VBG
 algebra and _set theory						NN/VB
 the integrity and _standing of their customers				NN/VBG
 his name and _address *'						NN/VB

*** look into probs ***
(b) Lexical probabilities. Similar to noun/adj problem.
 ^ telephone _calls from a man						NNS/VBZ
 the pay _claims of the few .						NNS/VBZ
 head the government _coming under fire					NN/VBG
 marked improvement in _driving standards				NN/VBG
 from incomes and _earning (						NN/VBG
 due to _fear of immigration controls					NN/VB
 will strengthen the government now in office and _lead to a 		NN/VB
 comparisons of _living are difficult					NN/VBG
 by this _means the council						NN/VBZ
 for _teaching purposes							NN/VBG
 the Chelsea flower _show so that					NN/VB
(* Comment: lack of agreement also affect the above one *)
 ^ _left , right & centre 		(occurs twice)			NN/VBN
 ^ _mention of this to the						NN/VB
(* These last two may argue for taking away start of sentence uniformising
(which is a bit silly anyway) *)

*** look into probs ***
(c) Noun compounding problems (within noun phrase). Similar to noun/adj. VB:NN
is 0.0434348, which is still an order of magnitude more than NN:NN, though
rather less bad than JJ:NN.

(d) Errors and idiosyncrasies in LOB.
 the local situation _calls for an exceptional ruling			NNS/VBZ
 make the health service _look as if 					NN/VB
 the local health authority _looks after maternity services		NNS/VBZ
 reports from Salisbury _show that					NN/VB

(e) Linguistic oddities
1. Discourse markers (?)
 no _wonder the tory rebels are in uproar				NN/VB
 no _wonder some returning immigrants					NN/VB
2. Headlines
 ^ _building bricks .							NN/VBG
 ^ _exports on a plateau .						NNS/VBZ
 ^ _need to disperse immigrants .					NN/VB
3. Unusual noun compounds
 the large number of services _votes from distant lands			NNS/VBZ

(0) Not yet classified
look at it again _act by act .						NN/VB
new social sciences _building at Nottingham				NN/VBG
it reports that new factory _building this year				NN/VBG
give the Africans _control in Northern Rhodesia				NN/VB
they had in _mind a kind of						NN/VB
called to _order by the							NN/VB
^ _points of view							NNS/VBZ
^ _points from readers' letters						NNS/VBZ
is at _present consulting						NN/VB
^ at _present the obstacles						NN/VB
a united nations _trust territory					NN/VB


6. Adverb/preposition
---------------------
The largest class after the CS-IN cases. RB/IN (52)

(a) Idioms. (Marked with ditto tags in LOB.)
1. _at least (22)
2. _in particular (3)
3. _of course (2)
4. _at once (1)
5. _as well (1)
6. _in full (1)
7. _in general (1)
8. _for ever (1)

(b) Adverb + measure. Probably comes from lexical probabilities, since the
trannsitions RB:CD and IN:CD are almost identical.
1. about (10)
2. over (7)
3. under (1)

(c) Others
 the traders and public _at large can
 reduce income tax _as well ,		(* should be idiom? *)

7. Adjective/adverb
-------------------
JJ/RB (17), JJR/RBR (2)

it is all very _well for				QL _ IN		JJ/RB b
keep people _well ,					NNS _ ,		JJ/RB l
keep people _ well and					NNS _ CC	JJ/RB b
will be _well and					MD _ CC		JJ/RB b
( _late Secretary \0I.L party )				( _ NPT		JJ/RB n
30 yards _ long and eats		(occurs twice)	NNS _ CC	JJ/RB l
after so _long a period					QL _ AT		JJ/RB b
are at _long last moving				RB _ AP		JJ/RB b
find it _hard ,						PP3 _ ,		JJ/RB n
stand _still ,						VB _ ,		JJ/RB l
to be left _alone .					VBN _ .		JJ/RB l
*' go it _alone **'					PP3 _ **'	JJ/RB n
ways of life _alone .					NN _ .		JJ/RB n
isolated and _alone .					CC _ .		JJ/RB l
a _forward looking governmenr				AT _ JJ		JJ/RB b
shown himself ready *-					PPL _ *'	JJ/RB n
getting slightly _better rather than worse	RB _ RB		JJR/RBR n
domonstrably better than			RB _ IN		JJR/RBR n


Some other large classes
------------------------
Worth noting, probably not worth analysing (yet).

CS/IN	81
IN/CS	73
NNU/CD	43
DT/CS	25	(all of them "that")
IN/RP	24	(many of them phrasal (or quasi-phrasal) verbs)
CD1/CD	20
RB/CS	19
RP/IN	16

==============================================================================

Revised totals of errors
------------------------
Formed by greping for "error", then editing the result, sorting and using uniq
-c. The totals include idiom errors.

Format: x	= errors in b
	x/y	= errors in b/errors in l
	/y	= errors in l
1. Nouns
--------

Errors within noun class
Correct	NN	NNPS	NNS	NP	NPT	PP3O	PP$
NNS	16
NNP		1/1
NN			6/1
NP					2
NPL				3	1
NPT				5
PP$						2/4
PP3O							/2
NNS$			1

Errors with other major classes
Correct	NN	NNP	NNS	NP	PN
BEG	1
MD	1
VB	19/10
VBD	/1
VBG	10/9
VBN	2/2
VBZ	1		8/1
JJ	50/18
JJB	10/1
JNP		1		1
QL					6/4
RB	/1				1/5
RP	/2

Errors with determiner classes
Correct	PPLS	PN$	PN
DT	1
ATI		/1	/1

Errors with prep

Errors with minor classes
Correct	NN	NNU	PPLS	PN
CS	1			/6
CD		43
CD1			3
OD	/1

2. Verbs
--------

Errors within verb class
Correct	HVD	HVN	HVZ	MD	VB	VBD	VBN	BEZ
BEZ			/7
HVD		2		/6
HVN	/3
HVZ								/2
MD	/6				1/1
VB				1		2/4	/3
VBD					2/10		21/75
VBN					5/8	25/28

Errors with other major classes
Correct	VB	VBD	VBG	VBN	VBZ	MD
NN	30/9	/1	9/6	1/1		/1
NNS					6
JJ	/2	/1	8/1	4/7
RB	2/2

Errors with determiner classes
Correct	VB
AP	2/1

Errors with prep
Correct	VBG
IN	2

Errors with minor classes
Correct


3. Adj
------

Errors within adj class
Correct	

Errors with other major classes
Correct	JJ	JJB	JJR
NN	19/11	4
VB	4/1	1
VBD	1
VBG	4/4
VBN	14/10
RB	17/27
RBR			2/2
RP		/1

Errors with determiner classes
Correct	JJ
AP	5/4

Errors with prep
Correct	JJ	JJB
IN	1/1	1

Errors with minor classes
Correct	JJ
UH	/1


4. Adv
------

Errors within adv class
Correct	QL	RB	RBR	RP	QLP	RI
RBR	1	1
QL		2/9	2
QLP		3
RP		3/1				/1
RB	/1			5/7	/2

Errors with other major classes
Correct	QL	RB	RBR	RBT	RP
NN		/1			1/3
VB		1
VBN		/1
PN	1	4/3
JJ	/1	14/29
JJB		1	1
JJR			2/3
JJT				2

Errors with determiner classes
Correct	QL	RB	RBR	RBT
AP	3/5	16/9	4/2	1
ABL		1/1
ABN		11/22
AT		6/7
ATI		2/4
DT	1/2
DTI		1/5
DTX		/1

Errors with prep
Correct	QL	RB	RI	RP
IN	1	52/34	1/2	16/52

Errors with minor classes
Correct	QL	RB	RN
CC		3
CS	3/3	19/27
OD		1/3
TO		1
UH		1
EX			1

5. Det
------

Errors within det class
Correct	AP	AT	WDTI
AT	16/6
AP		2/1
WDT			2/1

Errors with major classes
Correct	ABL	ABN	AP	ATI	DT	DTI
PPLS					2/1
PN$				/4
JJ			4/9
QL			2/1
QLP						/2
RB	2	2/9	9/11	3/3
RBR			6/5

Errors with prep

Errors with minor classes
Correct	DT
CS	25/8
WP	8/9


6. Prep (IN)
-------

Errors with major classes
Chosen	NNU	VB	VBG	QL	RB	RI	RP
	5	3/2	/2	4/4	16/9	1/1	24/61

Errors with determiner classes
Chosen	ABL
	4

Errors with minor classes
Chosen	CC	CS	TO
	6/4	73/43	7/8

7. Minor
--------

Errors with other minor classes
Correct	CC	CD-CD	CD	CD1	WP$I	WP	CS	WPI	TO
CC							/3
CS	3/2					11/22			/1
CD		8		20
CD-CD			5
WP$					1/1
WPI						2
WP								/1

Errors with major classes
Correct	CC	CD	CS	EX	WRB	TO	UH	OD
NNU		12/5
VB							/1
QL	1/1		15/9
RB	/4		2/7		3	/1	/2	1
RN			1/4	3/3

Errors with prep
Correct	CC	CS	TO
IN	2	81/49	4/4

Errors with determiner classes
Correct	CS	WP
DT	13/8	1/1


