Hidden Markov Models are widely used for statistical language modelling in various fields , e.g. , part-of-speech tagging or speech recognition Rabiner and Juang 1986 .
The models are based on Markov assumptions , which make it possible to view the language prediction as a Markov process .
In general , we make the first-order Markov assumptions that the current tag is only dependant on the previous tag and that the current word is only dependant on the current tag .
These are very ` strong ' assumptions , so that the first-order Hidden Markov Models have the advantage of drastically reducing the number of its parameters .
On the other hand , the assumptions restrict the model from utilizing enough constraints provided by the local context and the resultant model consults only a single category as the contex .
A lot of effort has been devoted in the past to make up for the insufficient contextual information of the first-order probabilistic model .
The second order Hidden Markov Models with appropriate smoothing techniques show better performance than the first order models and is considered a state-of-the-art technique Merialdo 1994 , Brants 1996 .
The complexity of the model is however relatively very high considering the small improvement of the performance .
IDIOMTAG serves as a front-end to the tagger and modifies some initially assigned tags in order to reduce the amount of ambiguity to be dealt with by the tagger .
IDIOMTAG can look at any combination of words and tags , with or without intervening words .
By using the IDIOMTAG , CLAWS system improved tagging accuracy from 94% to 96 - 97% .
However , the manual-intensive process of producing idiom tags is very expensive although IDIOMTAG proved fruitful .
Besides the original states representing each part-of-speech , the network contains additional states to reduce the noun/adjective confusion , and to extend the context for predicting past participles from preceding auxiliary verbs when they are separated by adverbs .
By using these additional states , the tagging system improved the accuracy from 95.7% to 96.0% .
However , the additional context is chosen by analyzing the tagging errors manually .
An automatic refining technique for Hidden Markov Models has been proposed by
It starts with some initial first order Markov Model .
Some states of the model are selected to be split or merged to take into account their predecessors .
As a result , each of new states represents a extended context .
With this technique ,
In this paper , we present an automatic refining technique for statistical language models .
First , we examine the distribution of transitions of lexicalized categories .
Next , we break out the uncommon ones from their categories and make new states for them .
All processes are automated and the user has only to determine the extent of the breaking-out .
From the statistical point of view , the tagging problem can be defined as the problem of finding the proper sequence of categories
, wn ( We denote the i ' th word by wi , and the category assigned to the wi by ci ) , which is formally defined by the following equation :
With this model , we select the proper category for each word by making use of the contextual probabilities , P ( ci|ci-1 ) , and the lexical probabilities , P ( wi|ci ) .
This model has the advantages of a provided theoretical framework , automatic learning facility and relatively high performance .
It is thereby at the basis of most tagging programs created over the last few years .
For this model , the first-order Markov assumtions are made as follows :
With Equation
With Equation
Through these assumptions , the Hidden Markov Models have the advantage of drastically reducing the number of parameters , thereby alleviating the sparse data problem .
However , as mentioned above , this model consults only a single category as context and does not utilize enough constraints provided by the local context .
The first-order Hidden Markov Models described in the previous section provides only a single category as context .
Sometimes , this first-order context is sufficient to predict the following parts-of-speech , but at other times ( probably much more often ) it is insufficient .
The goal of the work reported here is to develop a method that can automatically refine the Hidden Markov Models to produce a more accurate language model .
We start with the careful observation on the assumptions which are made for the `` standard '' Hidden Markov Models .
With the Equation
As we know , it is not always true and this first-order Markov assumption restricts the disambiguation information within the first-order context .
The immediate ways of enriching the context are as follows :
to lexicalize the context .
to extend the context to higher-order .
To lexicalize the context , we include the preceding word into the context .
Contextual probabilities are then defined by P ( ci|ci-1 , wi-1 ) .
Figure
Figure
To extend the context to higher-order , we extend the contextual probability to the second-order .
Contextual probabilities are then defined by P ( ci|ci-1 , ci-2 ) .
Figure
The simple way of enriching the context is to extend or lexicalize it uniformly .
The uniform extension of context to the second order is feasible with an appropriate smoothing technique and is considered a state-of-the-art technique , though its complexity is very high : In the case of the Brown corpus , we need trigrams up to the number of 0.6 million .
An alternative to the uniform extension of context is the selective extension of context .
The uniform lexicalization of context is computationally prohibitively expensive : In the case of the Brown corpus , we need lexicalized bigrams up to the number of almost 3 billion .
Moreover , many of these bigrams neither contribute to the performance of the model , nor occur frequently enough to be estimated properly .
An alternative to the uniform lexicalization is the selective lexicalization of context , which is the main topic of this paper .
This section describes a new technique for refining the Hidden Markov Model , which we call selective lexicalization .
Our approach automatically finds out syntactically uncommon words and makes a new state ( we call it a lexicalized state ) for each of the words .
Given a fixed set of categories ,
The random variable
cC )
We convert the process of
The ( squared ) distance between two arbitrary vectors is then computed as follows :
Similarly , we define the lexicalized state transition vector ,
In this situation , it is possible to regard each lexicalized state transition vector ,
We can then compute the deviation of each lexicalized state transition vector ,
Figure
As you can see in the figure , the majority of the vectors are near their centroids and only a small number of vectors are very far from their centroids .
In the first-order context model ( without considering lexicalized context ) , the centroids represent all the members belonging to it .
In fact , the deviation of a vector is a kind of ( squared ) error for the vector .
The error for a cluster is
and the error for the overall model is simply the sum of the individual cluster errors :
Now , we could break out a few lexicalized state vectors which have large deviation (
As an example , let's consider the preposition cluster .
The value of each component of the centroid ,
As you can see in these figures , most of the prepositions including in and with are immediately followed by article ( AT ) , noun ( NN ) or pronoun ( NP ) , but the word out as preposition shows a completely different distribution .
Therefore , it would be a good choice to break out the lexicalized vector ,
From the viewpoint of a network , the state representing preposition is split into two states ; the one is the state representing ordinary prepositions except out , and the other is the state representing the special preposition out , which we call a lexicalized state .
This process of splitting is illustrated in Figure
Splitting a state results in some changes of the parameters .
The changes of the parameters resulting from lexicalizing a word , wk , in a category , cj , are indicated in Table
cC ) .
This full splitting will increase the complexity of the model rapidly , so that estimating the parameters may suffer from the sparseness of the data .
To alleviate it , we use the pseudo splitting which leads to relatively small increment of the parameters .
The changes of the parameters in pseudo splitting are indicated in Table
We have tested our technique through part-of-speech tagging experiments with the Hidden Markov Models which are variously lexicalized .
In order to conduct the tagging experiments , we divided the whole Brown ( tagged ) corpus containing 53,887 sentences ( 1,113,191 words ) into two parts .
For the training set , 90% of the sentences were chosen at random , from which we collected all of the statistical data .
We reserved the other 10% for testing .
Table
We used a tag set containing 85 categories .
The amount of ambiguity of the test set is summarized in Table
The second column shows that words to the ratio of 52% ( the number of 57,808 ) are not ambiguous .
The tagger attempts to resolve the ambiguity of the remaining words .
Figure
We got 95.7858% of the tags correct when we applied the standard Hidden Markov Model without any lexicalized states .
As the number of lexicalized states increases , the tagging accuracy increases until the number of lexicalized states becomes 160 ( using full splitting ) and 210 ( using pseudo splitting ) .
As you can see in these figures , the full splitting improves the performance of the model more rapidly but suffer more sevelery from the sparseness of the training data .
In this experiment , we employed Mackay and
The best precision has been found to be 95.9966% through the model with the 210 lexcalized states using the pseudo splitting method .
In this paper , we present a method for complementing the Hidden Markov Models .
With this method , we lexicalize the Hidden Markov Model seletively and automatically by examining the transition distribution of each state relating to certain words .
Experimental results showed that the selective lexicalization improved the tagging accurary from about 95.79% to about 96.00% .
Using normal tests for statistical significance we found that the improvement is significant at the 95% level of confidence .
The cost for this improvement is minimal .
The resulting network contains 210 additional lexicalized states which are found automatically .
Moreover , the lexicalization will not decrease the tagging speed , because the lexicalized states and their corresponding original states are exclusive in our lexicalized network , and thus the rate of ambiguity is not increased even if the lexicalized states are included .
Our approach leaves much room for improvement .
We have so far considered only the outgoing transitions from the target states .
As a result , we have discriminated only the words with right-associativity .
We could also discriminate the words with left-associativity by examining the incoming transitions to the state .
Furthermore , we could extend the context by using the second-order context as represented in Figure
We believe that the same technique presented in this paper could be applied to the proposed extensions .