An analysis of errors from self tests on large parts of the LOB corpus
======================================================================

(Started 8-1-93)

The errors here are the result of two test on parts of the LOB corpus. In each
case, a dictionary and transitions matrix were built from part of the corpus,
and the same corpus then tagged against them. This (experimentally) gives the
best performance you can expect in terms of accuracy. The corpora were lobv7b
and lobv7l. The only special treatment was for those things marked as
"idioms" in the corpora. Option C2 to the tagger was used, which treats the
idiom as if it were just the first word. I.e. if the corpus contains 
	such_IN as_IN"
where IN" is a ditto tag, then it is treated as just
	such_IN
This dents the results a bit, but saves having to give special treatment to
idioms and ditto tags. See below for a list of the errors resulting from this.
Also option 'N' was used which treats most sequences containing digits as
being numbers.
(Notes file 7-1-93 for more details).

The figures may contain some minor inaccuracies, because analysing the errors
by hand can be a rather error prone process, and because it is not always
clear how to classify an error. Could really do with a tool for working on
error output.

Raw errors: see error-b-l.

See the end for summarised totals


An attempt at classifying errors
================================

Attempt to identify a number of categories of error occurring in v7b and v7l.

Format: columns for b and l give the number of errors. [a] for footnotes. x/y
means x errors in corpus b, y in corpus l.


1. Incorrect tag assignment in idioms
-------------------------------------
Errors associated with sequences which LOB would tag as idioms, i.e. which
appear with ditto tags.

	Idiom		b		l
	a few		11		3
	a good many	1		0
	a little	12		12
	all but		0		1
	all right	0		9
	and so on	3		0
	apart from	3		1
	as between	1		0
	as for		2		0
	as if		1		2
	as opposed to	1		0
	as though	0		4
	as to		1		5
	as well		3		0
	as well as	6		3
	as yet		0		1
	at first	0		2
	at least	0		1
	at least	22		8
	at length	0		1
	at once		2		3
	because of	8		2
	but for		0		2
	each other	1		0
	except for	1		0
	for certain	1		0
	for ever	2		3
	for once	1		0
	in full		1		0
	in general	2		0
	in order that	2		0
	in order to	1		0
	in particular	3		1
	in so far as	1		0
	in spite of	0		1
	in that		1		1
	in vain		1		0
	more than	3		9
	no one		0 [a]		3
	no one's	0		1
	none the less	0		1
	now that	0		2
	of course	2		3
	one another	3		0
	now that	1		0
	so as (not to)	0		1
	so as to	0		1
	so that		1		2
	such as		4		0
	\0M \0P		2		0
			---		--
	Total		111		90

[a] There are occurrences of "no one" in b, but they are all correct. The same
applies in some other cases where there are zeros in this table.


2. Orthographic problems
------------------------
Problems associated with certain specific orthographic devices.
			b	l
NPT instead of NP	5 [a]	0
NPL instead of NPT	1 [b]	0

[a] Typically as initials - hard to tell from titles.
[b] \0St.

3. Lack of context
------------------
Errors in minimal constructs such as headlines and single items appearing in
parentheses.

Example: "^ headline .". headline tagged as JJ instead if NN.
			b		l
			1		0

4. Numbers
----------
Mistagging of objects analysed as numbers.
			b		l
CD instead of NNU	20		0
CD instead of CD1	20		0
NNU instead of CD	12		5
CD-CD instead of CD	4		0
CD-CD instead of CD	8		0
CD1 instead of PPLS	3 ("one")	0

Certain of these (essentially the CD vs CD1 or CD-CD errors) could be fixed by
better number parsing. Most of the l ones are headings (cf. category 3).


5. Conjunctions and lists
-------------------------
There are cases where there is a conjunction or list of single words where the
first is correct but later ones are not.
			and/b	or/b	,/b	and/l	or/l
JJ instead of AP					1
JJ instead of NN	1		2
JJ instead of VBN		1
JNP instead of NP	1
NN instead of VB	3	2
NN instead of JJ	1
NNS instead of VBZ	1
RB instead of JJ		2
RB instead of RP	1
VB instead of NN	3	1
VBD instead of VBN	1	3		1
VBG instead of NN	3
VBN instead of JJ		1
			--	--	--	--	--
			15	10	2	1	1


6. Singular/plural nouns
------------------------
Certain nouns should appear as singular/plural but appear as the other (or are
indistinguishable between the two cases).
			b	l
NN instead of NNS	6  [a]	1
NNS instead of NN	16 [b]	0

							b	l
[a]
 fish: "the plot of fish and"				0	1
 means: "the means of", "the means test"		4
 people: "a good many people"				1
 police: "the secret police has"			1
[b]
 means: "the only means of"				1
 people: "warlike people", "the British people"		14
 police: "according to police reports"			1


7. Preposition errors
---------------------
This appears to be a frequent error in both corpora. Some of the errors are
accounted for by "idioms". ('.' after number means no interesting
regularities).

Incorrect tag [1]	b	l		b	l	Correct tag [2]
NNU			5  [e]	0		0	0	NNU	
VB			3  [f]	2  [f]		0	0	VB
VBG			0	2.		2.	0	VBG
JJ			0	0		1.	1.	JJ
JJB			0	0		1.	0	JJB
QL			4.	3.		1.	0	QL
RB			16.	9.		20 [c]	14 [c]	RB
RI			1.	1.		1.	2.	RI
RP			21.	52.		16 [g]	51 [g]	RP
CC			6  [d]	2  [d]		0	0	CC
CS			59 [a]	39 [a]		74 [b]	44 [b]	CS
TO			7.	8.		3.	4.	TO

[1] IN should have been assigned.
[2] IN was incorrectly assigned.

Partial analysis (the value and well-definedness of some of these sub-classes
isn't too great):

[a] CS instead of IN.
Correlative subordinators (more/less/much ... *than (6/0), more *than (2/1),
Adjective *than (2/3), as ... *as (4/14), *as ... as (1/0))
				15		18
after (0/4), before(3/1), for(1/0), since(7/2), till (0/1), until(3/2) +
temporal/locative
				14		10
as + role (live as hewers, regard apartheid as evil, as a mystic)
				23		9
Phrasal verb elements (come as, stand for, look for, exist for)
				4		0
Others				3		2

[b] IN instead of CS.
Correlative subordinators (Adjective *than (1/2), as ... *as (7/2), less/more
*than (3/1), more/less ... *than (1/2), rather than (1/0), so ... *as (0/1))
				13		8
after(1/4), as(1/7), before(10/6), since(5/1), until(3/3) + temporal
				20		21
as(2/0), because(3/1), for(16/7), since(1/0) + reason
				22		8
as(6/3), for(1/0) + role		7		3
Phrasal verb elements (accept as, report as saying, quote as saying)
				4		1
Others				8		3


[c] IN instead of RB.
about(10/11), over(7/2), under(1/0) + measure
				18		13
Others				2		1

[d] CC instead of IN: all cases are "but" (e.g. all but a handful)
[e] NNU instead of IN: all cases are "per" (e.g. per annum)
[f] VB instead of IN: all cases are "like".
[g] IN instead of RP: mostly cases where the RP ought to "adhere" more tightly
to the verb.
In b: bandy about, bring in, bring about (3), carry in, carry through, crowd
in, give in, put in, put on, send in, take in, take on. (The remaining case is
"go back *down south".)
In l (more variety): break off, bring round, check over, cheer on, come down,
drag in, draw in, fill in, flick on, go across, go over, go/went on (5), hold
down, jerk on, keep on, leave behind, pass on, pay in, plug in, pull on, put
on (4), set down, stay in, take in, take on (2), take ... on, think over,
track down, turn on, wander about. (13 other cases, not of this sort, inc.
"down south", "upside down".)


8. Noun errors
--------------
Where is a noun (or other similar category) incorrectly tagged?
(Also see section 6: figures not includes in these totals)
(x/y means errors in b/errors in l; 0 if missing)
In nouns, we include pronouns.

Errors within class:
Correct ->	NNS	NP	NNPS	PP3O	PP$
Tagged as
NNS$		1/0	0	0	0	0
NPL		0	3/0	0	0	0
NNP		0	0	1/1	0	0
PP$		0	0	0	2/4 [b]	0
PP3O		0	0	0	0	0/2 [b]

Noun tagged as another major class:
Correct ->	NN		NNS	PN	NNP
Tagged as
VB		14/9		0	0	0
VBD		0/1		0	0	0
VBG		0/9		7/0	0	0
VBN		0/2		2/0	0	0
VBZ		0		8/1	0	0
MD		1/0		0	0	0
JJ		44 [a]/18	0	0	0
JJB		0/1		10/0	0	0
JNP		0		0	0	1/0
RB		0/1		0	1/3	0
RP		0/2		0	0	0
QL		0		0	6/4	0

[a] Includes a number of things that might be solved by "idioms": e.g. at
present, as a whole. Similarly: post[JJB] office.
[b] In both b and l, the errors are all "her".


9. Verb errors
--------------
Where is a verb (or other similar category) incorrectly tagged?
In verbs we include modals.

Errors within class:
Correct -> VB	MD	VBD	VBN	HVN	HVD	HVZ	BEZ
Tagged as
VB	   -	1/0	2/4	0/1	0	0	0	0
MD	   1/1	-	0	19/0	0	0/6 [a]	0	0
VBD	   2/10	0	-	1/73	0	0	0	0
VBG	   0	0	0	0	0	0	0	0
VBN	   6/8	0	24/28	-	0	0	0	0
HVD	   0	0/6 [a]	0	0	2/0	-	0	0
HVN	   0	0	0	0	0	0/2	0	0
BEZ	   0	0	0	0	0	0	0/7 [b]	-
HVZ	   0	0	0	0	0	0	0	2/0

Verb tagged as another major class:
Correct ->	VB	MD	VBD	VBG	VBN
Tagged as
NN		24/8	0/1	0/1	8/6	1/1
NNS		0	0	0	5/0	0
RB		12/2	0	0	0	0
JJ		0/2	0	0/1	8/1	4/6

[a] All of these are "'d".
[b] All of these are "'s".


10. Adj errors
--------------
Where is an adj (or other similar category) incorrectly tagged?

Errors within class: none

Adj tagged as another major class:
Correct ->	JJ	JJB	JJR
Tagged as
NN		18/11	4/0	0
VB		21/1	1/0	0
VBD		1/0	0	0
VBG		4/4	0	0
VBN		14/9	0	0
RB		2/25	0	0
RBR		0	0	2/2
RP		0	0/1	0

11. Adv errors
--------------
Where is an adv (or other similar category) incorrectly tagged?

Errors within class:
Correct ->	RB	RBR	RP	WRB	QL	RI	QLP
Tagged as
RB		-	0	4/5	3/0	0/1	0	2/0
RBR		1/0	-	0	0	1/0	0	0
RP		2/1	0	-	0	0	0/1	0
QL		1/0	2/0	0	0	-	0	0
QLP		3/0	0	0	0	0	0	-

(QL, QLP = degree adverb)

Adv tagged as another major class:
Correct ->	RB	RBR	RP	QL	RBT
Tagged as
NN		0/1	0	1/3	0	0
PN		4/2	0	0	1/0	0
VB		1/0	0	0	0	0
VBN		0/1	0	0	0	0
JJ		14/29	0	0	0/1	0
JJB		1/0	1/0	0	0	0
JJR		0	2/1	0	0	0
JJT		0	0	0	0	2/0


12. Determiner errors
---------------------
Errors within the various determiner classes and with the major classes
(determiners include quantifiers).

Errors within determiner class
Correct ->	WDT	WDTI	AP	AT
Tagged as
WDT		-	2/1	0	0
WDTI		0	-	0	0
AP		0	0	-	2/1

Det tagged as a major class:
Correct ->	AP	DT	ABL	ABN	ATI	DTI
Tagged as
PPLS		0	2/1	0	0	0	0
PN$		0	0	0	0	0/4	0
JJ		4/4	0	0	0	0	0
RB		7/9	0	2/0	2/9	3/3	0
RBR		5/6	0	0	0	0	0
QL		2/0	0	0	0	0	0
QLP		0	0	0	0	0	0/2

Major class tagged as Det:
Correct ->	QL	RB	JJ	VB	RBR	RBT
Tagged as
AP		2/5	14/8	5/4	2/1	4/2	1/0
DT		1/2	0	0	0	0	0
DTI		0/1	1/5	0	0	0	0
DTX		0	0/1	0	0	0	0
ABL		0	1/1	0	0	0	0
ABN		0	10/11	0	0	0	0
ATI		0	1/4	0	0	0	0


13. Minor class errors
----------------------
Minor class = anything other than noun, verb, adj, adv, det, prep or
punctuation.

Minor class tagged as another minor class:
Correct ->	CS	WP	WP$I	WPI
Tagged as			
CS		0	11/22	0	0
WP		0	0	0	0/1
WPI		0	2/0	0	0
WP$I		0	0	0	0
CC		0/3	0	0	0
WP$		0	0	1/1	0


Minor class tagged as a major class or Det:
Correct ->	CS	OD	WP	EX	CC	UH
Tagged as
QL		16/8	0	0	0	0	0
RB		1/6	1/0	0	0	0/4	0/2
RN		0/2	0	0	3/3	0	0
DT		13/8	0	1/1	0	0	0
VB		0	0	0	0	0  	0/1

Major class or Det tagged as a minor class:

Correct ->	RB	RN	QL	NN	PN	JJ	DT
Tagged as
OD		1/3	0	0	0/1	0	0	0
CS		13/27	0	3/3	1/0	0/5	0	27/8
EX		0	1/0	0	0	0	0	0
UH		1/0	0	0	0	0	0/1	0
TO		1/0	0	0	0	0	0	0
WP		0	0	0	0	0	0	8/8


Summarised totals
=================
				b	l
Errors attributed to idioms:	111	90
Orthographic errors:		6	0
Lack of context errors:		1	0
Numbers:			67	5
Conjunctions:			27	2

Condensed version of "class" errors
Correct ->	noun	verb	adj	adv	det	prep	minor
Classed as
noun		29/8	38/17	22/11	6/6	2/5	5/0	0/0
verb		32/22	60/146	41/14	1/1	0/0	3/4	0/1
adj		55/19	12/10	0/0	20/31	4/4	0/0	0/0
adv		7/10	12/2	4/28	19/8	21/29	42/65	21/25
det		0/0	2/1	5/4	35/40	4/2	0/0	14/9
prep		0/0	2/0	2/1	37/67	0/0	0/0	77/44
minor		1/6	0/0	0/1	20/33	35/16	77/48	14/27

Notes:
1. noun/noun includes the errors from section 6.
2. See the preceding sections to get an idea of what is included in each
class.
3. prep is just IN.

We can also sum the x/y errors and the y/x errors:
Correct ->	noun	verb	adj	adv	det	prep	minor
noun		29/8	70/39	77/30	13/16	2/5	5/0	1/6
verb			60/146	53/24	13/11	2/1	5/4	0/1
adj				0/0	24/59	9/8	2/1	0/1
adv					35/40	56/69	79/132	41/55
det						4/2	0/0	49/25
prep							0/0	154/92
minor								14/27


These figures are weighted by the fact that there are simply more of some
category than of others. So we can also look at them as proportions of the
number of words in the correct category (times 10000).

Correct ->	noun	verb	adj	adv	det	prep	minor
noun		16/4	39/15	50/39	19/16	2/8	7/0	0/0
verb		18/13	61/134	94/50	3/2	0/0	4/8	0/2
adj		31/11	12/9	0/0	66/85	5/7	0/0	0/0
adv		4/6	12/1	9/101	63/22	27/51	64/138	42/50
det		0/0	2/0.9	11/14	116/110	5/3	0/0	28/18
prep		0/0	2/0	4/3	123/184	0/0	0/0	154/89
minor		0.5/3	0/0	0/3	66/90	46/28	117/101	28/55

Correct ->	noun	verb	adj	adv	det	prep	minor
noun		16/4	72/35	178/108	43/44	2/8	7/0	2/12
verb			61/134	1220/86	53/30	2/1	7/8	0/2
adj				0/0	79/162	11/14	3/2	0/2
adv					63/110	74/122	120/280	82/112
det						6/3	0/0	98/50
prep							0/0	309/187
minor								56/55


Comment: large variation between the two corpora.



These figures based on defining classes as follows. Punctuation, numbers
excluded.
Noun: NN, NN$, NNP, NNP$, NNPS, NNPS$, NNS, NNS$, NNU, NNUS, NP, NP$, NPL,
NPL$, NPLS, NPS, NPS$, NPT, NPT$, NPTS, NPTS$, PN, PN$, PP$, PP$$, PP1A,
PP1AS, PP1O, PP1OS, PP2, PP3, PP3A, PP3AS, PP3O, PP3OS, PPL, PPLS. Total:
17237/16002
Verb: BE, BED, BEDZ, BEG, BEM, BEN, BER, BEZ, DO, DOD, DOZ, HV, HVD, HVG, HVN,
HVZ, MD, VB, VBD, VBG, VBN, VBZ, VD. Total: 9705/10867
Adj: JJ, JJB, JJR, JJT, JNP. Total: 4317/2766
Adv: NR, NR$, NRS, QL, QLP, RB, RB$, RBR, RBT, RI, RN, RP. Total: 3003/3627
Det: ABL, ABN, ABX, AP, AP$, APS, APS$, AT, ATI, DT, DT$, DTI, DTS, DTX, WDT,
WDTI. Total: 7521/5613
Prep: IN. Total: 6529/4708
Minor: CC, CS, EX, TO, UH, WP, WP$, WP$I, WPA, WPI, WPO, WPOI, WRB, XNOT, ZZ.
Total: 4973/4907


Comment: there are some words which contribute to errors in a number of
different classes. Some particular offenders are: about, as, for, so, than,
that.
