Punctuation analysis
====================

Aim: to evaluate some of the effects of punctuation on tagging.

Why?
1. Tagging works with a context of one word on either side. So if there is a
punctuation mark between two words, the connection between the words is lost,
and the tagging depends on the transitions from one word to the punctuation
mark and from the punctuation mark to the next word. So eliminating
punctuation might improve performance.
2. Punctuation also can signal consituent boundaries, such as at the start of
some relative clauses. So it may actually be contributing positively to the
performance of the tagger.
3. Punctuation may also signal the presence of elements which do not form part
of the normal grammatical structure of the sentence, and which are to be
treated independently. Example: certain parenthetics and quoted text.
Eliminating such sequences might mean that a transition from the word before
the parenthetic to the one after it gets used by the tagger, when it would not
do so before.

Methodology:
As far as posible, we want to use techniques which do not rely on human
intervention, or on algorithms which draw on much grammatical knowledge, since
the tagger is meant to be as automatic and robust as possible, and because we
would like to be able to use the same technique in other languages with
minimal effort to get it going.
 Work with dictionary and transitions matrix build from almost all of the LOB
corpus (parts b, c, d, e, f, j, k, l, m, n, p, r); called "all" below. Use
part b (newpaper text) as the major test sample, supplemented by part l
(fiction). Baseline of performance is "self-tests", i.e. tagging against
dictionary and transitions trained from the same corpus as used for the test.
This gives an upper limit on tagging accuracy.

Baseline information:
Figures are number of words correctly labelled and percentage for all words
and ambiguous words only, plus the number of words skipped (i.e. markup
symbols and in later tests, parenthetic passed over).
		All		Ambig		Skip
self test b	59661	98.36%	18632	94.93%	20964
self test l	56787	98.45%	16569	94.89%	17641
b/all		59302	97.77%	33355	96.10%	20964
l/all		56836	97.76%	31932	96.11%	17641


Occurrence frequencies
----------------------
Total frequencies of major punctuation words
	all	b	l	Comments
,	37602	2446	2640
.	36417	2685	3322	has two tags, second one rare
*'	7892	289	866
**'	7756	281	857
*-	2676	265	268
?	2295	173	304
)	1960	85	12
(	1939	85	12
;	1684	76	95
:	1271	93	51	has two tags, second one rare
!	923	46	99
...	517	12	83
'	86	0	0
**"	67	3	3
*"	41	2	2
Comment: most punctuation marks have only one tag, so they anchor models for
tagging. Where there is a second tag is rare; for example '.' occurs with tag
'.' 36406 times and with tag 'IN' 11 times in the "all" dictionary. So it is
the transitions to an from punctuation tags which are important.


Total frequencies of major punctuation tags.
	all	b	l
,	37614	2447	2641
.	36418	2686	3323
*'	7945	292	969
**'	7835	285	961
*-	2688	266	269
?	2307	174	305
)	1972	86	13
(	1952	86	13
;	1696	77	96
:	1283	94	52
!	935	47	100
...	529	13	84

(Some of the lower frequency words and tags are ignored in the remainder as
being of relatively low importance.)


Transitions to and from punctuation
-----------------------------------
(Figures are normalised values * 1000, rounded)
(?, ! and ... follow below.)

Comment: what is interesting is unformities, i.e. places where there is a
range of tags wtih relatively similar transitions, since in these cases, the
tagging is dominated by lexical probabilities and generally gets poor
performance. This means looking at punctutation tags which can be preceded or
followed by a wide range of tags with similar transitions. (To a first
approximation: in fact, it really depends on whether those tags are likely to
be hypotheses on the same word).
To see where there are uniformities, a condensed form of the tables follows
the main ones.
We need also to note that simply the overall frequency of occurrence matters:
Since '.' occurs much more than '!', it is more productive to chase errors
associated with the former than with the latter.

Remark on sentences: in LOB, end of sentence punctuation is always followed by
a special symbol used in the tagger as an anchor, and not included in the
tables (since it is given a uniform distribution). So transitions FROM such
punctuation are not of great interest.

Transitions FROM specified tags

From ->	,	.	*'	**'	(	)	*-	;	:
To:
!	0	0	0	0	0	1	0	0	0
&FO	3	0	0	0	13	4	3	1	29
&FW	3	0	15	0	12	2	4	4	11
(	1	1	0	12	2	4	1	6	8
)	0	2	0	5	0	8	0	0	0
*'	12	1	0	3	13	6	15	13	55
**'	40	77	0	0	0	1	41	1	0
*-	2	1	0	8	0	7	0	0	28
,	0	0	0	32	0	157	0	0	0
.	0	0	0	39	0	257	1	0	0
...	0	0	2	0	0	0	0	0	0
:	0	0	0	1	0	7	0	0	0
;	0	0	0	4	0	16	0	0	0
?	0	0	0	2	0	2	1	0	0
ABL	0	0	0	0	0	0	2	1	1
ABN	2	0	3	0	2	1	4	7	1
ABX	2	0	0	0	0	1	1	1	0
AP	3	0	5	0	1	1	5	5	0
AP$	0	0	0	0	0	0	0	0	0
APS	0	0	0	0	0	0	0	2	0
APS$	0	0	0	0	0	0	0	0	0
AT	18	0	14	2	9	6	44	27	28
ATI	54	0	41	9	25	24	58	78	78
ATP$	0	0	0	0	0	0	0	0	0
BE	1	0	1	0	0	1	0	1	0
BED	2	0	0	1	1	10	3	0	1
BEDZ	6	0	2	4	1	7	7	1	0
BEG	1	0	0	0	0	0	0	1	1
BEM	0	0	1	0	0	0	0	0	0
BEN	0	0	0	0	0	0	0	0	0
BER	4	0	5	2	0	10	5	1	0
BEZ	8	0	8	9	1	29	8	2	2
CC	166	0	37	30	52	62	128	231	23
CD	13	1	8	1	217	4	11	24	57
CD$	0	0	0	0	0	0	0	0	0
CD-CD	0	0	0	0	13	0	0	2	2
CD1	4	0	3	1	21	2	6	7	9
CD1$	0	0	0	0	0	0	0	0	0
CD1S	0	0	0	0	0	0	0	0	0
CDS	0	0	0	0	0	0	0	0	0
CS	59	0	23	5	25	21	55	50	16
DO	2	0	12	0	1	0	2	1	1
DOD	1	0	4	0	0	0	0	1	1
DOZ	0	0	1	0	0	3	0	0	1
DT	7	0	21	1	8	2	23	18	14
DT$	0	0	0	0	0	0	0	0	0
DTI	2	0	2	0	1	1	4	5	1
DTS	1	0	1	0	0	0	6	6	4
DTX	1	0	0	0	0	0	2	2	0
EX	6	0	7	1	2	2	5	14	3
HV	2	0	3	1	0	6	3	1	1
HVD	3	0	1	2	0	2	3	2	1
HVG	1	0	0	0	0	0	1	0	1
HVN	0	0	0	0	0	0	0	0	0
HVZ	2	0	1	2	0	8	3	0	0
IN	80	0	24	34	54	96	73	77	29
JJ	37	0	59	3	26	12	25	31	23
JJB	1	0	3	0	2	0	1	1	2
JJR	1	0	1	0	0	1	1	1	0
JJT	0	0	1	0	1	0	0	0	0
JNP	2	0	2	0	1	1	1	2	2
MD	9	0	10	3	1	14	13	2	3
NC	0	0	14	0	1	0	0	0	0
NN	32	1	66	22	94	25	26	29	30
NN$	0	0	1	0	0	0	0	0	0
NNP	0	0	0	0	1	1	0	0	0
NNP$	0	0	0	0	0	0	0	0	0
NNPS	0	0	0	0	0	0	0	0	0
NNPS$	0	0	0	0	0	0	0	0	0
NNS	15	0	21	9	13	7	11	11	7
NNS$	0	0	0	0	1	0	0	0	1
NNU	0	0	0	0	2	1	0	1	0
NNUS	0	0	0	0	2	0	0	0	0
NP	45	1	46	42	120	5	23	35	25
NP$	1	0	1	1	2	0	3	2	1
NPL	0	0	0	0	2	1	0	0	0
NPL$	0	0	0	0	0	0	0	0	0
NPLS	0	0	0	0	0	0	0	0	0
NPS	0	0	0	0	0	0	0	0	1
NPS$	0	0	0	0	0	0	0	0	0
NPT	11	0	10	4	5	1	5	5	4
NPT$	0	0	0	0	0	0	0	0	1
NPTS	0	0	0	0	0	0	0	0	0
NPTS$	0	0	0	0	0	0	0	0	0
NR	1	0	2	0	7	0	1	1	0
NR$	0	0	0	0	0	0	0	0	0
NRS	0	0	0	0	0	0	0	0	0
OD	0	0	1	0	2	0	0	1	1
PN	2	0	7	0	1	0	9	9	0
PN$	0	0	0	0	0	0	0	0	0
PP$	13	0	12	2	4	2	10	21	12
PP$$	0	0	1	0	1	0	0	0	0
PP1A	15	0	105	15	3	1	25	13	5
PP1AS	6	0	17	0	2	2	5	6	1
PP1O	0	0	0	0	0	1	0	0	0
PP1OS	0	0	0	0	0	0	0	0	0
PP2	5	0	49	0	1	1	10	9	3
PP3	17	0	32	1	6	4	21	42	23
PP3A	24	0	20	104	4	4	23	45	15
PP3AS	6	0	7	1	1	0	7	21	5
PP3O	0	0	0	0	0	0	0	0	0
PP3OS	0	0	0	0	0	0	0	0	0
PPL	0	0	0	0	0	0	1	0	0
PPLS	0	0	0	0	0	0	0	0	0
QL	4	0	5	1	5	1	7	1	2
QLP	0	0	0	0	0	0	0	0	0
RB	65	0	37	3	57	9	58	60	20
RB$	0	0	0	0	0	0	0	0	0
RBR	0	0	1	0	1	0	1	0	1
RBT	0	0	0	0	0	0	0	0	0
RI	0	0	0	0	1	2	0	0	1
RN	9	0	10	1	2	1	7	12	2
RP	2	0	2	0	3	2	1	1	1
TO	8	0	4	2	1	14	10	5	4
UH	3	0	76	0	1	0	20	0	0
VB	12	0	47	1	29	13	17	5	2
VBD	20	0	4	50	1	17	13	3	0
VBG	40	0	7	1	10	12	12	9	2
VBN	15	0	6	3	16	7	5	2	0
VBZ	7	0	3	5	0	22	3	1	2
VD	0	0	0	0	0	0	0	0	0
WDT	17	0	20	5	19	5	13	2	5
WDTI	0	0	0	0	0	0	0	0	0
WP	9	0	0	1	5	3	4	1	0
WP$	1	0	0	0	1	0	1	0	0
WP$I	0	0	0	0	0	0	0	0	0
WPA	0	0	0	0	0	0	0	0	0
WPI	0	0	3	0	0	0	0	0	0
WPO	0	0	0	0	1	0	0	0	0
WPOI	0	0	0	0	0	0	0	0	0
WRB	14	0	23	1	6	5	13	9	7
XNOT	5	0	5	1	6	1	11	6	2
ZZ	2	0	4	0	60	2	0	0	1

Transitions FROM specified tags

From ->	!	...	?
To:
!	0	6	0
&FO	0	0	0
&FW	6	2	0
(	0	0	0
)	13	0	3
*'	0	0	0
**'	414	51	492
*-	10	0	6
,	1	6	0
.	0	482	0
...	2	0	2
:	0	0	0
;	0	0	0
?	0	25	0
ABL	0	2	0
ABN	0	2	0
ABX	0	0	0
AP	0	2	0
AP$	0	0	0
APS	0	0	0
APS$	0	0	0
AT	1	15	0
ATI	0	17	0
ATP$	0	0	0
BE	0	0	0
BED	0	0	0
BEDZ	0	2	0
BEG	0	0	0
BEM	0	0	0
BEN	0	0	0
BER	0	0	0
BEZ	0	8	0
CC	2	47	2
CD	0	6	0
CD$	0	0	0
CD-CD	0	0	0
CD1	0	2	0
CD1$	0	0	0
CD1S	0	0	0
CDS	0	0	0
CS	1	15	0
DO	0	8	0
DOD	0	4	0
DOZ	0	0	0
DT	0	6	0
DT$	0	0	0
DTI	0	0	0
DTS	0	0	0
DTX	0	0	0
EX	1	6	0
HV	0	0	0
HVD	0	2	0
HVG	0	0	0
HVN	0	0	0
HVZ	0	2	0
IN	0	26	0
JJ	0	8	0
JJB	0	0	0
JJR	0	2	0
JJT	0	0	0
JNP	0	0	0
MD	1	2	0
NC	0	0	0
NN	1	17	0
NN$	0	0	0
NNP	0	0	0
NNP$	0	0	0
NNPS	0	0	0
NNPS$	0	0	0
NNS	0	15	0
NNS$	0	4	0
NNU	0	2	0
NNUS	0	0	0
NP	1	26	0
NP$	0	2	0
NPL	0	0	0
NPL$	0	0	0
NPLS	0	0	0
NPS	0	0	0
NPS$	0	0	0
NPT	0	2	0
NPT$	0	0	0
NPTS	0	0	0
NPTS$	0	0	0
NR	0	2	0
NR$	0	0	0
NRS	0	0	0
OD	0	0	0
PN	0	2	0
PN$	0	0	0
PP$	0	2	0
PP$$	0	0	0
PP1A	0	9	0
PP1AS	0	2	0
PP1O	0	0	0
PP1OS	0	0	0
PP2	0	8	0
PP3	0	8	0
PP3A	1	17	2
PP3AS	0	0	0
PP3O	0	0	0
PP3OS	0	0	0
PPL	0	0	0
PPLS	0	0	0
QL	0	0	0
QLP	0	0	0
RB	0	19	0
RB$	0	0	0
RBR	0	0	0
RBT	0	0	0
RI	0	0	0
RN	1	9	0
RP	0	2	0
TO	0	6	0
UH	0	9	0
VB	2	19	0
VBD	0	0	0
VBG	0	13	0
VBN	0	2	0
VBZ	0	2	0
VD	0	0	0
WDT	0	4	0
WDTI	0	0	0
WP	0	0	0
WP$	0	0	0
WP$I	0	0	0
WPA	0	0	0
WPI	0	0	0
WPO	0	0	0
WPOI	0	0	0
WRB	1	8	0
XNOT	0	2	0
ZZ	0	4	0


Transitions TO specified tags

To ->	,	.	*'	**'	(	)	*-	;	:
From:
!	1	0	0	414	0	13	10	0	0
&FO	163	135	1	2	18	29	1	5	1
&FW	114	43	1	39	28	12	9	7	3
(	0	0	13	0	2	0	0	0	0
)	157	257	6	1	4	8	7	16	7
*'	0	0	0	0	0	0	0	0	0
**'	32	39	3	0	12	5	8	4	1
*-	0	1	15	41	1	0	0	0	0
,	0	0	12	40	1	0	2	0	0
.	0	0	1	77	1	2	1	0	0
...	6	482	0	51	0	0	0	0	0
:	0	0	55	0	8	0	28	0	0
;	0	0	13	1	6	0	0	0	0
?	0	0	0	492	0	3	6	0	0
ABL	15	1	1	0	0	1	0	1	1
ABN	51	51	0	0	0	1	3	1	0
ABX	17	17	2	0	0	2	2	2	0
AP	24	27	2	1	0	1	2	1	1
AP$	0	0	37	0	0	0	0	0	0
APS	193	134	0	5	5	0	0	0	0
APS$	0	0	0	0	0	0	0	0	0
AT	0	0	6	0	0	0	0	0	0
ATI	0	0	6	0	0	0	0	0	0
ATP$	0	0	0	0	0	0	0	0	0
BE	10	8	3	0	1	0	1	1	1
BED	15	7	2	0	0	1	1	1	1
BEDZ	14	5	3	0	0	0	1	0	1
BEG	6	3	9	3	0	0	0	0	2
BEM	15	14	0	0	0	0	0	0	0
BEN	5	5	1	0	0	0	1	0	0
BER	16	5	3	0	1	1	1	0	2
BEZ	20	4	4	0	1	0	2	0	1
CC	14	0	5	0	2	0	1	0	0
CD	82	94	1	3	7	76	3	8	9
CD$	0	0	0	0	0	0	0	0	0
CD-CD	75	103	0	0	5	169	5	33	5
CD1	40	51	1	3	2	23	6	1	3
CD1$	0	0	0	0	0	0	0	0	0
CD1S	140	221	0	23	0	12	23	23	35
CDS	87	92	0	5	0	0	11	22	11
CS	16	0	6	0	1	0	1	0	1
DO	31	31	0	1	0	2	4	1	1
DOD	24	25	0	0	0	0	3	1	0
DOZ	27	17	2	0	2	0	4	2	0
DT	43	31	4	1	0	0	3	1	2
DT$	0	0	0	0	0	0	0	0	0
DTI	8	6	3	0	0	0	0	0	0
DTS	10	4	6	0	1	1	1	0	1
DTX	15	11	4	0	0	0	0	0	0
EX	0	0	0	0	0	0	1	0	0
HV	6	4	1	0	0	0	1	0	0
HVD	5	2	0	0	0	0	0	0	0
HVG	15	4	4	0	0	0	0	0	0
HVN	5	30	5	0	0	0	0	0	0
HVZ	9	2	2	0	0	1	0	0	0
IN	2	2	5	0	1	0	0	0	0
JJ	54	46	2	4	1	1	3	3	1
JJB	6	2	6	6	1	0	0	0	0
JJR	41	63	1	4	1	1	1	2	0
JJT	45	21	4	3	0	1	0	0	1
JNP	34	10	7	3	1	1	1	2	1
MD	12	5	2	0	0	0	1	0	0
NC	46	93	0	454	14	9	0	0	0
NN	119	134	1	8	5	3	8	6	3
NN$	7	6	4	5	0	0	0	0	0
NNP	183	142	0	3	9	17	23	12	6
NNP$	0	36	0	0	0	0	0	0	0
NNPS	149	111	0	13	15	9	11	4	4
NNPS$	0	0	0	0	0	0	0	0	0
NNS	129	132	1	7	6	4	9	7	4
NNS$	0	7	3	0	0	0	0	0	0
NNU	51	99	1	0	25	49	5	4	1
NNUS	43	85	0	0	0	43	0	0	0
NP	157	111	2	6	9	7	9	4	3
NP$	16	13	22	1	3	1	1	1	0
NPL	243	115	1	5	12	6	8	2	5
NPL$	0	0	50	0	0	0	0	0	0
NPLS	187	147	0	0	13	13	27	13	0
NPS	221	152	0	20	12	4	4	0	0
NPS$	32	0	32	0	0	0	0	0	0
NPT	66	26	0	3	2	1	6	1	0
NPT$	0	8	16	0	0	0	0	0	0
NPTS	121	66	0	0	22	0	0	0	0
NPTS$	0	0	0	0	0	0	0	0	0
NR	156	137	0	4	2	5	11	5	2
NR$	0	32	65	0	0	0	0	0	0
NRS	92	108	0	0	0	0	0	0	0
OD	38	16	2	1	1	1	2	0	1
PN	74	82	0	3	0	1	10	3	2
PN$	0	0	0	0	0	0	0	0	0
PP$	0	0	3	0	0	0	0	0	0
PP$$	201	247	0	0	6	6	11	0	0
PP1A	3	1	1	0	0	0	3	0	0
PP1AS	5	0	0	0	0	0	2	0	0
PP1O	124	165	2	7	0	0	15	3	8
PP1OS	86	122	8	4	0	6	6	4	6
PP2	43	42	0	1	0	0	6	1	0
PP3	40	66	1	1	1	1	4	2	1
PP3A	3	0	1	0	0	0	0	0	0
PP3AS	2	0	1	0	1	0	0	0	0
PP3O	128	220	1	0	0	1	6	9	3
PP3OS	101	190	3	5	3	0	12	10	2
PPL	140	156	2	1	4	2	5	4	7
PPLS	115	196	0	6	6	0	6	9	3
QL	0	1	1	0	0	0	1	0	0
QLP	128	119	0	0	0	0	18	18	9
RB	111	72	2	1	1	2	4	3	3
RB$	0	59	0	0	0	0	0	0	0
RBR	114	124	0	0	1	2	7	1	2
RBT	36	95	0	0	0	0	0	0	0
RI	214	260	0	2	11	18	4	13	9
RN	147	89	1	1	2	0	5	3	2
RP	84	113	0	3	1	1	5	4	2
TO	1	2	2	0	0	0	1	0	0
UH	618	116	0	8	0	2	32	8	3
VB	36	44	3	2	1	0	4	2	2
VBD	53	62	2	0	0	0	2	1	9
VBG	32	32	4	1	2	0	2	2	3
VBN	59	74	3	2	2	1	5	4	2
VBZ	44	29	5	3	1	1	2	1	12
VD	0	0	0	0	0	0	0	0	0
WDT	25	1	1	0	0	0	0	0	0
WDTI	31	0	15	0	0	0	0	0	0
WP	20	0	4	0	0	0	0	0	0
WP$	0	0	0	0	0	0	0	0	0
WP$I	0	0	0	0	0	0	0	0	0
WPA	0	0	0	0	0	0	0	0	0
WPI	8	0	8	8	0	0	8	0	8
WPO	24	0	8	0	0	0	0	0	0
WPOI	0	0	0	0	0	0	0	0	0
WRB	15	3	2	0	1	0	1	0	0
XNOT	22	14	2	0	0	1	2	0	0
ZZ	162	101	0	15	10	103	2	20	2

Transitions TO specified tags

To ->	!	...	?
From:
!	0	2	0
&FO	0	0	1
&FW	8	1	5
(	0	0	0
)	1	0	2
*'	0	2	0
**'	0	0	2
*-	0	0	1
,	0	0	0
.	0	0	0
...	6	0	25
:	0	0	0
;	0	0	0
?	0	2	0
ABL	0	1	0
ABN	2	1	6
ABX	2	0	0
AP	1	1	2
AP$	0	0	0
APS	0	0	27
APS$	0	0	0
AT	0	0	0
ATI	0	0	0
ATP$	0	0	0
BE	0	1	2
BED	0	0	0
BEDZ	0	0	0
BEG	0	0	0
BEM	0	0	4
BEN	0	1	3
BER	2	0	2
BEZ	1	0	1
CC	0	0	0
CD	0	0	1
CD$	0	0	0
CD-CD	0	5	0
CD1	2	1	2
CD1$	0	0	0
CD1S	0	0	12
CDS	0	0	0
CS	0	0	0
DO	2	0	13
DOD	1	0	2
DOZ	0	0	0
DT	3	0	8
DT$	0	0	0
DTI	1	0	1
DTS	0	0	1
DTX	0	0	4
EX	0	0	1
HV	0	0	1
HVD	0	0	0
HVG	0	0	0
HVN	0	0	0
HVZ	0	0	0
IN	0	0	0
JJ	2	1	2
JJB	0	0	0
JJR	1	1	2
JJT	3	1	0
JNP	0	0	0
MD	0	0	0
NC	5	5	5
NN	2	1	6
NN$	1	0	1
NNP	0	3	9
NNP$	0	0	0
NNPS	2	0	2
NNPS$	0	0	0
NNS	2	1	5
NNS$	0	0	0
NNU	0	3	0
NNUS	0	0	0
NP	5	2	9
NP$	0	0	1
NPL	0	2	9
NPL$	0	0	0
NPLS	0	0	0
NPS	0	4	20
NPS$	0	0	0
NPT	1	1	4
NPT$	0	0	0
NPTS	11	0	0
NPTS$	0	0	0
NR	5	1	13
NR$	0	0	0
NRS	0	0	0
OD	0	1	0
PN	3	2	11
PN$	0	0	0
PP$	0	0	0
PP$$	0	0	52
PP1A	1	0	2
PP1AS	0	0	6
PP1O	9	3	25
PP1OS	2	2	10
PP2	7	4	23
PP3	3	1	12
PP3A	0	0	2
PP3AS	0	0	5
PP3O	5	3	12
PP3OS	4	1	10
PPL	4	0	7
PPLS	3	0	12
QL	0	0	0
QLP	13	9	9
RB	2	1	3
RB$	0	0	0
RBR	2	3	3
RBT	0	0	12
RI	4	0	4
RN	4	1	22
RP	4	1	6
TO	0	0	0
UH	56	18	33
VB	2	1	7
VBD	0	0	1
VBG	1	1	3
VBN	1	1	4
VBZ	1	1	1
VD	0	0	0
WDT	1	0	5
WDTI	0	0	0
WP	0	0	0
WP$	0	0	0
WP$I	0	0	0
WPA	0	0	0
WPI	0	0	16
WPO	0	0	0
WPOI	0	0	0
WRB	0	1	7
XNOT	2	0	2
ZZ	1	0	2


Condensed transitions FROM punctuation:
We look at bands of transition frequencies, and see how many tags there are in
each band. (Bands are again normalised transitions * 1000, rounded).

From ->	,	.	*'	**'	(	)	*-	;	:
0	64	125	66	83	66	64	64	66	71
1-5	34	7	33	36	41	40	34	37	40
6-10	11	0	9	4	7	13	10	10	4
10-49	19	0	21	8	12	12	20	15	15
50-99	4	1	3	1	5	2	4	4	3
100+	1	0	1	1	2	2	1	1	0

From ->	!	...	?
0	116	78	127
1-5	13	25	4
6-10	2	15	1
10-49	1	13	0
50-99	0	1	0
100+	1	1	1

To ->	,	.	*'	**'	(	)	*-	;	:
0	33	37	51	76	75	79	54	77	78
1-5	10	20	59	36	38	34	49	38	42
6-10	10	7	11	9	9	7	18	8	10
10-49	39	24	9	8	11	10	12	10	3
50-99	13	16	3	1	0	1	0	0	0
100+	28	29	0	3	0	2	0	0	0

To ->	!	...	?
0	84	90	62
1-5	42	41	40
6-10	4	1	14
10-49	2	1	16
50-99	1	0	1
100+	0	0	0

Comments:
1. Range. In general, there are more tags which have non-zero transitions TO
ounctuation than FROM punctuation.
2. Clustering. In general, the tags with non-zero transitions TO punctuation
have a wider range of values than the ones FROM punctuation.

Conclusion: an ambiguous word which precedes a punctuation symbol will get
less help in the disambiguation than one which follows it. So we should expect
to see more errors before punctuation than after it.


Errors before/after punctuation
-------------------------------
Now we look at the frequency of various types of errors occurring immediately
before or after punctuation, to see if there any any regularities. Use all/b
as test. Notation: X/Y means correct tag X, tag Y predicted by labeller.

Note that the fact that an error occurs just before or after punctuation does
not mean that the punctuation "causes" it. The aim is to see if there are any
classes that significantly stand out.

(* in totals column means see other part of table.)

1. Error immediately before punctuation symbol.

	,	.	*'	**'	(	)	*-	;	Total
AP/JJ	1	1							2
AP/RB	2								2
AP/RBR	2	3					1		6
AP/RBT	2								2
CC/RI	9								9
CD/CD-CD 6								7
CD/NNU		3			1	1			5
CD-CD/CD	1							1
CD1/CD	3	1					1		5
CD1/CD-CD	1							1
CS/DT	3								3
CS/IN	1						1		2
CD/NN	1								1
DT/CS	1								1
DT/PPLS		2							2
DTI/RB	2								2
HVD/HVN		1							1
IN/CS	2								2
IN/RB	1								2
IN/RI		2							2
IN/RP	2	1					1		4
IN/TO			1				1		2
JJ/NN	4	4							8
JJ/RB	3	5							9
JJ/VB		1							2
JJ/VBG		1							1
JJ/VBN	3	5							8
NN/JJ	8	8		1		1			18
NN/JJB	1			1					2
NN/NNS	2	6							8
NN/VB	1	1		1			1		4
NN/VBG		1			1				2
NN/VBN	2								2
NNPS/JNP	2							2
NNPS/NNP	1							1
NNS/VBZ	1	1							2
NNU/CD	4	6			1	1	2		14
NP/NPL	2	2							4
NP/NR		1							1
PN/RB	2								2
RB/DTX		1							1
RB/JJ	2	2		1					5
RB/JJB	1								1
RB/NN		1							1
RB/QLP	4								4
RB/RBR	1								1
RBT/AP		1							1
RI/IN	2	1							3
RP/IN		1							1
RP/RB	1	1							2
TO/IN			1						1
VB/NN	1	3							7
VB/RB		1							1
VB/UH	1								1
VBD/VBN	2	4							6
VBG/JJ	1								1
VBN/JJ	1	2							3
VBN/VBD	5	1					1		7
VBZ/NNS	3								3
WP/DT	1								1
Total	97	80	2	4	3	3	9	0	205


	:	!	...	?	Total
CD/CD-CD		1		*
IN/RB				1	*
JJ/RB		1			*
JJ/VB				1	*
VB/NN		2		1	*
Total	0	3	1	3	*


2. Error immediately after punctuation symbol.

	,	.	*'	**'	(	)	*-	;	Total
AP/JJ		1							1
AP/RBR	1								1
CD/NNU	1								1
CD-CD/CD	1							1
CD1/CD	1					1			2
CS/IN	2	2	2				1		7
CS/QL	1	2					1		4
DT/CS		1							1
DT/PPLS				1					1
IN/CC		1							1
IN/CS	3	2							5
IN/NNU	2				1				3
JJ/NN	1								1
JJ/VBG	1	1		2					4
JJ/VBN		1							1
JJB/IN		1					1		2
NN/JJ	3	3		3					9
NN/NC		1							1
NN/VB		1							1
NN/VBG		1							1
NNS/VBZ	1								1
NNU/CD		4				2			6
NP/NPT	1	1							2
NPT/NP		1							1
OD/RB		1							1
RB/ATI		1							1
RB/CS	4								4
RB/IN		2					2		4
RB/JJ	1	1							2
RB/QL		1							1
RBR/JJR		1							1
RBR/RB	1								1
RP/IN		1				1			2
TO/IN		1							1
VB/AP	1								1
VB/NN		4		1		1			6
VB/RP		1							1
VB/VBN		1							1
VBG/JJ	1	1							2
VBG/NN	2	1							3
VBN/VBD	2	1							3
VBZ/NNS		1							1
WRB/RB		1							1
Total	30	43	2	6	1	5	5	0	93


	:	!	...	?	Total
RBT/JJT				1	1
Total	0	0	0	1	*


3. Total errors, repeated and summed.

(Before is number of errors before punctuation, After is number of errors
after punctuation.)

	,	.	*'	**'	(	)	*-	;
Before	97	80	2	4	3	3	9	0
After	30	43	2	6	1	5	5	0
Both	127	123	4	10	4	8	14	0

	:	!	...	?
Before	0	3	1	3
After	0	0	0	1
Both	0	3	1	4

Comment: there are in general more errors before punctuation than after, as
anticipated.


Performance variation with punctuation
--------------------------------------
1. Does punctuation help or hinder the labeller? To test this, first try
simply skipping certain punctuation symbols and seeing how the performance
varies. Do this test on all/b.

Skipped		All		Ambig		Skipped
skipped
none		59302	97.77%	33355	96.10%	20964	(as above)
,		56804	97.58%	33303	95.95%	23410
.		56466	97.40%	30519	95.30%	23649
*'		59007	97.75%	33349	96.08%	21253
**'		59023	97.76%	33357	96.10%	21245
*-		59035	97.75%	33353	96.09%	21229
?		59119	97.74%	33345	96.07%	21137
)		59214	97.76%	33352	96.09%	21049
(		59209	97.75%	33347	96.07%	21049
;		59224	97.76%	33353	96.09%	21040
:		59205	97.76%	33258	96.07%	21057
!		59252	97.76%	33351	96.08%	21010

Conclusion: punctuation has a positive effect on performance (so if it is ever
to be skipped, it can't be done in a simple-minded way.)

2. A related test is to ignore certain easily recognised parenthetics. The
only ones which are easy are (...) and *'...**'. This is intended to deal with
examples such as this sentence (I mean the one you are reading) where the
parenthetic breaks the syntactic structure. As opposed to this one (where it
does not). Such sequences are not skipped if they extend over a sentence
boundary.

Skipped
skipped
none		59302	97.77%	33355	96.10%	20964	(as above)
(...)		58602	97.77%	33069	96.12%	21683
*'...**'	57071	97.66%	32382	95.94%	23180

Conclusion: omitting parenthetic sequences does bring about a slight
improvement in performance. These figures must be treated with a bit of
suspicion, however, because the improvement may simply be a result of there
being fewer ambiguous words, having got rid of the ones in the parenthetic.
See below for a more detailed look.

Omitting quoted sequences decreases performance. The same caveat applies.


An examination of errors
------------------------
A self test on v7b was run and the error sequences examined. Where there
appeared to be some chance that the error was related to punctuation, a
related form of the text was produced which removed the effect of punctutation
and the labeller was tested with this data.

In cases where there were several closely related instances of a phenomenon,
only one was tested. (So the absolute numbers the results need to be taken
with a pinch of salt.) Also the whole sample is rather small, which is a
further reason for caution.

In detail, in each of the following pairs, the original sentence and the test
phrase (which is always bracketed by unambiguous words and/or anchors) are
shown. In the original text, the word which was incorrect is shown with an
underscore (there may in fact have been more than one such word, but there is
one which is the focus of the test). For each example, the original error
class is given. The anchor ^ is show where its presence is significant. yes/no
indicates whether the error was overcome. Each example is classified according
to why the change was made, as follows:
P = the punctutation was omitted because the text was still parsable without
    it.
A = adverbial or parenthetic within the punctuation; without it the result is
    parsable, so try missing it out. (Also relative clauses).
S = the element following the punctuation could form a sentence in its own
    right, so try starting a new sentence at that point.
E = same as S, but a sentence was ended.
I = parenthetic phrase tagged independently of its context (by putting an
    anchor before it). Related to S and E.
M = omission of punctuation and modifiers, such as adjectives.
L = omission of punctuation in a list. Related to P.

Counts:
	P	A	S	E	I	M	L	Total
No	17	4	9	4	7	1	2	44
Yes	11	8	3	1	2	3	2	30
Total	28	12	12	5	9	4	4

Conclusion: omitting parenthetics and adverbials (class A) is worth doing, as
is omitting modifiers (class M). But this should not be done blindly.

Full listing (ordered by class of oracle and success failure):

** P, no

AP/AT	no	P (scare quotes)
on the grounds that *' _a little knowledge **' can
 the grounds that a little knowledge

AP/RB	no	P
in the parliamentary labour party *- _much more than in the
 party much more than in the

CS/DT	no	P
which article stated _that : *' disintegration is being
 article stated that disintegration 

CS/IN	no	P (scare quote)
criticism of the Hungarian papers as *' deadly dull .
 papers as deadly

DT/CS	no	P
today , in London , _that rash and thoughtless policy has caused a crisis
 today that rash

IN/CS	no	P
has existed and worked well *- _since the defeat of the Berlin 
 existed and worked well since the

IN/CS	no	P
this was not murder *- _because of his diminished
 murder because of his diminished

IN/RB	no	P
shortly afterwards , _at about 5.50 !
 afterwards at about 5.50 !

IN/RP	no	P (start of a list)
he was talking _about *- our post offices , the old ones , the shabby relics
 talking about our

JJ/NN	no	P
for the _living , not the dead
 the living not

JJ/RB	no	P
we shall stand _still , or collapse , \0Mr Khrushchev
 shall stand still or

JJ/RB	no	P
to *' go it _alone **'
 go it alone .

NN/JJ	no	P
as things stand at _present , certain inequalities
 things stand at present certain

NN/JJ	no	P
the contributions , _total and fraction will all go up
 contributions total and fraction

NN/VB	no	P
publish his name and _address *' as it might well cost
 name and address as it

RB/AP	no	P
statement that *' _only swimmers and learners **'
 statement that only swimmers

VBZ/NNS	no	P
a society where the individual _matters , and
 the individual matters and not

** P, yes
CS/DT	yes	Px2
which article stated _that : *' disintegration is being
 article stated that *' disintegration 

IN/CS	yes	P
social difficulty for 16 years *- _since the close of
 years since the

IN/CS	yes	P
the position of splendid isolation , _except for pipelines
 isolation except for pipelines

IN/CS	yes	P
to taking away more _than , say , 15 \0s
 away more than 15 \0s

IN/TO	yes	P
what proportion of the people belongs _to *' the opposition **'
 belongs to the opposition

IN/TO	yes (but 'subject' now wrong)	P
would be subject _to *' fall-out **' radiation
 be subject to fall-out

JJ/RB	yes	P
always shown himself _ready *- and no one who known him can doubt
 himself ready to lead the

NN/JJ	yes	P
coal miner's *' _plus , **' for instance ,
 miner's plus ,

RB/IN	yes	P
he said that *' about 40 per \0cent of
 about 40 per \0cent of the

TO/IN	yes	P
makes no matter _to *' go it alone
 matter to go it

VB/NN	yes	P
they might not *' work . **'
 not work .

** A, no
ABL/RB	no	A
but not , perhaps , _quite the impression
 not quite the

RB/CS	no	A
their books tidied ; and often , _even their shoes cleaned
 and even their

RB/JJ	no	A
the crisis comes surprisingly _late , for Britain's trading position
detiorated shaprly last year , and is now getting
 surprisingly late and is

VBD/VBN	no	A
the shipbuilders , however, _put forward
 shipbuilders put forward an

** A, yes
AP/RB	yes	A
here , however , _much , if not indeed all , may
 ^ much may

CD/NNU	yes	A
of the 550,000 young people starting work in 1960 _420,000 ( 73 per cent ) went
 starting work in 1960 420,000 went

DT/CS	yes	A
there is only one alternative , and _that , again , is war .
 is only one alternative and that is

IN/CS	yes	A
could not go much farther _than , for example , the truism
 farther than the

JJ/RB	yes	A
it would find it _hard , and perhaps indeed , impossible
 it hard to absorb

NN/JJ	yes	A
Foulkes , who is himself a communist , _put as good a face on it
 Foulkes put as good a face on it

NN/JJ	yes	A
a blundering _general , with the active encouragement of the English and the
French , who destroyed Russian
 blundering general who destroyed Russian

VBD/VBN	yes	A
Lord Melbourne , the Prime Minister , _proposed that
 Melbourne proposed that the

** S, no
CS/IN	no	S
this has gone on , _for the Fisher act was not a standstill reform
 ^ for the

CS/IN	no	S
the railway engine , and so on _in order that the solid achievements
 ^ in order that

CS/IN	no	S
gone full circle _for , after the austerities of the first war , the grim
 ^ for the grim (+reason)

CS/QL	no	S
it was a Reading car club , _so a good many people on the
 ^ so a good many people on the

DT/WP	no	S
as for the means test , _that would be retained
 ^ that would be

IN/CS	no	S
in those days of course *- _until his death in December 1855 .
 ^ until his death

IN/CS	no	S
this was not murder *- _because of his diminished
 ^ because of his diminished

JJ/AP	no	S
when great issues are shirked , _little differences are given
^ little differences

NN/JJ	no	S
the contributions , _total and fraction will all go up
 contributions total and fraction

** S, yes
AP/RB	yes	S
in the parliamentary labour party *- _much more than in the
 much more than in the

DT/CS	yes	S
today , in London , _that rash and thoughtless policy has caused a crisis
 ^ that rash
 ^ that rash

WDTI/WDT	yes	S
given four minutes' warning from Fylingdales , _which of your readers
 which of your

** E, no
IN/RP	no	E
at a pit I went _down , the list
 went down .

IN/RP	no	E
we live _in , and Oxford shares in most of the
 live in .

JJ/RB	no	E
we shall stand _still , or collapse , \0Mr Khrushchev
 shall stand still .

NN/JJ	no	E
the west have stood _firm , refusing to panic
 stood firm .

** E, yes
DT/CS	yes	E
having said _that , it must be made clear
 having said that .

** I, no
IN/ABL	no	I
older types of vehicle , _such as veteran motor-cars ,
 ^ such as veteran

IN/CS	no	I
I , _as one of the younger generation
 ^ as one of the

IN/CS	no	I
clear policy details *- _as on bases *- were
 ^ as on bases

IN/CS	no	I
parties ( _except the united federal ) have made
 ^ except the

NNPS/NNP	no	I
, defiant leader of the free _French , in Algiers
 the free French

RB/CC	no	I
railway engine , _and so on ,
 ^ and so on

VBD/VBN	no	I
whose pension , for which her husband _paid , is wiped out
 husband paid

** I, yes
DT/WP	yes	I
reduce these dates in the school year from three to two , _that is, at Easter
 ^ that is

JJ/RB	yes	I
Short ( _late Secretary \0I.L party )
^ late Secretary

** M, no
JJ/AP	no	M
they can not leave the _little , familiar ways of life alone
 the little ways

** M, yes
JJ/NN	yes	M
they have been impressed by _light , airy schools
 impressed by light schools

JJ/NN	yes	M
contrast his _firm , successful rule in Rhodesia
 contrast his firm successful

VB/NN	yes	M
parents , teachers , and children , _work as a team
 parents work as a team

** L, no
JJ/AP	no	L
they can not leave the _little , familiar ways of life alone
 the little familiar ways

NN/JJ	no	L
left , _right & centre
 left right & centre .

** L, yes
JJ/NN	yes	L
they have been impressed by _light , airy schools
 impressed by light airy schools

VB/NN	yes	L
parents , teachers , and children , _work as a team
 parents teachers and children work as a team
