Computer Laboratory

Old ACS project suggestions

Project suggestions from the Natural Language and Information Processing Group and the Speech Group in the Department of Engineering from the CSTIT course, 2009-10

Note that these projects are aimed at those who might like to continue to undertake research into natural language processing for their Ph.D.

Efficient operations on semantic dependency structures

  • Proposer: Ann Copestake
  • Supervisor: Ann Copestake

Description

Although the use of packed representations is usual in syntactic parsing, algorithms for efficient representation of highly ambiguous semantic structures are less well-studied. This project will look at operations on semantic dependency representations (including Dependency MRS, Copestake 2009). The aim is to develop efficient algorithms for operations such as structure comparison on packed representations.

The project will look at practical performance on realistic datasets rather than complexity. The primary evaluation will involve analysing performance against a naive baseline method. The objective would be to build an Open Source package which could be released as part of DELPH-IN.

Remarks

This project would be suitable for someone with a good computer science background, strong programming skills and an interest in algorithms.

Distributional semantics and identification of meaning differences between language varieties

  • Proposer: Ann Copestake
  • Supervisor: Ann Copestake

Description

There are well-known differences between the vocabulary used in different varieties of English (British English, Indian English, American English etc), though these are a relatively small percentage of the vocabulary that is found in most edited texts. Some examples of terms which have different meanings in British English and American English are boot, dorm, pavement, biscuit, table (as a verb) and court shoe. The idea of this project is to see whether such vocabulary differences can be detected automatically using distributional semantic techniques. It will probably be easiest to attempt this project looking at AmE and BrE. One issue will be acquisition of suitable corpora, since there is no large balanced corpus of AmE: newspaper text is a possibility, but not ideal. Wikipedia has an extensive list of differences between AmE and BrE which might be used in the evaluation, although the aim is also to discover subtle meaning differences which may not have been documented.

Remarks

See Chapter 19 of Jurafsky and Martin for an overview of distributional similarity techniques. This project is relatively open-ended, but will involve consideable experimentation with different algorithms and possibly with different corpora.

Finding syntactically irregular multiword expressions in corpora

  • Proposer: Ann Copestake
  • Supervisor: Ann Copestake

Description

A number of statistical techniques have been used for extracting collocations (see discussion in Manning and Schuetze, 1999). However, in some applications, we are primarily interested in multiword expressions (MWEs) which are syntactically irregular in some respect. For instance, the phrase on top is syntactically irregular since top would normally be expected to have a determiner. Similarly, by and large cannot be parsed by a normal grammar rule. Some MWEs are themselves apparently syntactically regular, but occur in phrases which are superficially syntactically abnormal. For example, public sector higher education appears to be a compound where the adjective higher follows the noun sector, though in general such sequences are not possible in English nominal compounds (contrast English honey spoon with * honey English spoon). The compound is licensed because higher education is an MWE (as is public sector) and effectively behaves in the compound as though it were a noun.

The idea of this project is to use linguistic knowledge about possible and unlikely tag sequences in order to filter the results of simple statistical methods for determining collocations. If possible, this will be done by using an existing POS-tagged corpus, although it may be necessary to retrain or adapt the tagger, since existing taggers are trained on the basis of data which does not include mark-up of MWEs. The results will be evaluated against a database of MWEs extracted from a machine-readable dictionary.

Remarks

This project will involve implementing one or more of the standard techniques for extracting collocations, or, possibly, adapting existing code. This will require reasonably good programming skills, in order that the code be efficient on large corpora. Either the British National Corpus or the Wall Street Journal might be used for this project.

References

Sag, Ivan, Timothy Baldwin, Francis Bond, Ann Copestake and Dan Flickinger (2002) Multiword Expressions: A Pain in the Neck for NLP, In Proceedings of the Third International Conference on Intelligent Text Processing and Computational Linguistics (CICLING 2002), Mexico City, Mexico, pp. 1-15

Class-based Language Models for the Rescoring of Translation Lattices

  • Proposer: Adria de Gispert
  • Supervisor: Adria de Gispert
  • Special Resources: SRI language model toolkit for automatic clustering and language model estimation - OpenFST library for application of grammars to translation lattices via composition of Finite-State Transducers

Description

Statistical Machine Translation systems output a set of translation hypotheses structured in a word lattice, and they rely heavily on additional models to achieve significant quality improvements in a second decoding pass. Among these models, high-order language models of the target language estimated on large collections of monolingual data are commonly used. Generally speaking, these models favour those translation hypotheses which contain sequences of words that have been observed in the monolingual training data. The sequence of words must match exactly a training sequence in order to score the hypothesis favourably.

This project will investigate the use of class-based language models for rescoring translation lattices. In putting words into classes, a class-based model can generalise by pooling together the statistics from different sequences into the same class. These models have been used successfully in other language and speech technology tasks.

With the goal of improving translation quality for a state-of-the-art statistical machine translation, the project will study the automatic creation of word classes, building the class-based language models and applying them in rescoring translation lattices. The implementation will make use of Weighted Finite-State Transducers, facilitating the application of standard algorithms and methods to the task.

References

  • Brown, P.F. et al.: "Class-Based n-gram Models of Natural Language". Computational Linguistics, Vol. 18, Num. 4, pps. 467-479, December 1992.
  • Iglesias, G. et al.: "Hierarchical Phrase-Based Translation with Weighted Finite State Transducers". Proc. of HLT/NAACL, pps. 433-441, June 2009.
  • Brants, Th. et al.: "Large Language Models in Machine Translation". Proc. of EMNLP, pps. 858-867, June 2007.

Incorporating Syntax into Hierarchical Phrase-based Translation Rules

  • Proposer: Adria de Gispert
  • Supervisor: Adria de Gispert
  • Special Resources: Syntactic parsers

Description

Hierarchical phrase-based translation has emerged as one of the dominant current approaches to statistical machine translation. It relies on a synchronous context-free grammar that is automatically induced from word-aligned parallel corpora. In decoding, the bilingual grammar is used to parse the input sentence with the source side of the rules, while simultaneously constructing its translation from the target side of rules.

In principle, the grammar defined by the automatically-extracted hierarchical rules has no direct relationship with linguistics. This causes problems of overgeneration (the grammar generates too many wrong translations apart from the desired good one) and decoding efficiency. This project will investigate the benefits of introducing such information into the statistical grammar, and measure the impact in translation quality. In particular, the use of linguistic tags obtained from parsing tools will be used to label non-terminals and add features to the hierarchical phrase-based system. This involves parsing either the source or target side of the parallel corpora independently, and using the parse trees to constrain translation rule extraction and application with the goal of achieving good translation without overgeneration.

References

  • Chiang, D.: "Hierarchical Phrase-Based Translation". Computational Linguistics, Volume 33, Number 2, pps. 201-228, June 2007.
  • Marton, Y. and Resnik, P.: "Soft Syntactic Constraints for Hierarchical Phrased-Based Translation". Proc. of ACL-HLT, pps. 1003-1011, June 2008.
  • Chiang, D. et al.: "11,001 New Features for Statistical Machine Translation". Proc. of HLT/NAACL, pps. 218-226, June 2009.

Unsupervised graded word sense disambiguation

  • Proposer: Diarmuid Ó Séaghdha
  • Supervisor: Diarmuid Ó Séaghdha
  • Special Resources: None

It is well known that many words are ambiguous; disambiguating ambiguous words is a core task that is relevant to many NLP applications. In some cases, a word can have two senses that are clearly unrelated: bank can denote a financial institution or the side of a body of water. A finer-grained analysis of ambiguity may also identify that as well as denoting a financial institution, bank can denote the building in which the financial institution does business. When a speaker uses an ambiguous word he/she may use it in a way that evokes one or more of the word's related senses:

  1. The bank on the corner of Market Square will give you an account.
  2. The bank was constructed with red bricks.

In sentence 1, bank is used to refer to an institution and a building at the same time. In sentence 2, on the other hand, the meaning of bank is dominated by the building sense.

As usually conceived (e.g. at the SENSEVAL competitions), the word sense disambiguation (WSD) task involves identifying a single correct sense for an ambiguous word from a predefined inventory of possible senses for that word. As shown above, however, the assumption that a token will only express a single sense is known to be a simplification of how word senses really work. Researchers have recently begun to attack the more realistic problem of "graded word sense assignment" (Erk et al., 2009; Erk and McCarthy, 2009), where the task is not to identify the correct sense of a word but to identify the correct distribution over all possible senses. For example, the sense distribution for "bank" in sentence 1 above would put a significant amount of mass on multiple senses (building and institution), while the corresponding distribution in sentence 2 would be strongly peaked at a single sense.

Erk et al. (2009) created a dataset of graded sense distributions by collecting judgements from human annotators; this dataset (available here) would be the focus of the MPhil project. Erk and McCarthy (2009) investigated the use of some supervised prediction methods for this problem. The proposed MPhil project will involve the use of unsupervised methods, which have not yet been applied to graded WSD. In particular, the project will explore the methods that Mihalcea et al. (2004) applied to SENSEVAL-style single-sense WSD. The first of these, based on Lesk (1986), is based on the overlap of a candidate sense's dictionary definition to the current sentence; the second adapts Page et al.'s (1999) PageRank algorithm to identify highly salient senses in a semantic network constructed from the sentence.

Remarks:

  • This project would be suitable for a student interested in lexical semantics. The methods that would be used are conceptually straightforward, and the project may be implemented in any suitable programming language of the student's choice.

References:

Distributional compositional semantics

  • Proposer: Diarmuid Ó Séaghdha
  • Supervisor: Diarmuid Ó Séaghdha
  • Special Resources: None

Description:

Distributional approaches to lexical semantics model word meaning in terms of co-occurrence patterns that are estimated from large corpora. Many different techniques and applications have been investigated (Schütze, 1998; Curran, 2004; Padó and Lapata, 2007; Ó Séaghdha and Copestake, 2008); typically these involve constructing a single vector of co-occurrence counts for each word and then comparing individual words using methods from linear algebra. Recently, researchers in corpus-driven semantics have begun to pay attention to the issue of compositional meaning, i.e., how distributional representations of words can be combined to give the semantics of a phrase. A related question is how to capture the way the meaning of a word is modulated by the context in which it appears - contrast the meaning shades of bank in "the bank announced today" (institution) and "the customer is standing in the bank" (physical location).

Methods thus far suggested for compositional semantics include: simple elementwise combination of lexical co-occurrence information (Mitchell and Lapata, 2008); the use of syntactic "expectations" that represent the selectional preferences of the words being combined (Padó and Erk, 2008); and convolutional methods that attempt to compress very high-dimensional (tensorial) representations into tractable vector form while conserving as much information as possible (Plate, 1995). The efficacy of these models is usually demonstrated by correlating their predictions to human judgements of similarity or paraphrase quality, but they should have the potential to be useful for many different kinds of tasks.

This is still a relatively unexplored area, and there are many open questions a project could address. One of the previously proposed models could be extended, e.g. to represent combinations of more than two words. It would also be interesting to investigate the applicability of these models to problems such as prepositional phrase attachment or compound noun interpretation, and/or to investigate whether they provide valuable features for supervised learning. A comparative study of the various techniques in a particular application context would be another possibility.

Remarks:

  • This project would be suitable for a student interested in cutting-edge computational semantics. Depending on the direction taken, a degree of mathematical competence may be required. The project may be implemented in any suitable programming language of the student's choice.

References:

GR Graph-based Parse Selection

  • Proposer: Ted Briscoe
  • Supervisor: Ted Briscoe (with Rebecca Watson, iLexIR Ltd)
  • Special Resources: None

Description

The RASP parser produces ranked directed graphs of bilexical head-dependent grammatical relations (GRs) as output; e.g. Kim badly wants to win:

  • ncsubj(want Kim _)
  • xcomp(to want win)
  • ncsubj(win Kim)
  • ncmod(_ want badly)

(see Briscoe, or Andersen et al for more details and examples). GRs are statistically ranked using an unlexicalized structural model so ranking of PP attachment, compounds, etc can be incorrect, but the parser is also able to output weighted sets of GRs from the best n derivations.

To improve parse selection accuracy by incorporating lexical information, it is possible to discriminatively rerank derivations (Collins and Koo) based on GR incidence in manually GR-banked training data, or by self-training based on weighted GR output (van Noord). GR-banks for the BNC and WSJ exist as does the WSJ DepBank test data, so the project would be to implement a supervised reranking scheme and/or a self-trained one and train and test on this data.

References:

Andersen, O., Nioche, J., Briscoe, E. and Carroll, J. The BNC Parsed with RASP4UIMA, LREC08

Briscoe, E., 2006, An introduction to tag sequence grammars and the RASP system parser, CUCL-TR-662

Collins M. and Koo, T., Discriminative Reranking for Natural Language Parsing, CL 2005

van Noord, G. Using Self-Trained Bilexical Preferences to Improve Disambiguation Accuracy, ACL07 Wkshp

Remarks

The project will involve XML file manipulation in your preferred programming language, and either use of a ML toolkit or implementation of the reranking algorithm, depending on the approach taken.

Named Entity Recognition and Parsing

  • Proposer: Ted Briscoe
  • Supervisor: Ted Briscoe
  • Special Resources: None

Description

The RASP (Briscoe) parser brackets NPs and the system semantically classifies some of them (on the basis of CLAWS tag distinctions and internal structure) into names (places, people, organisations), numbers (including ranges, dates, etc), measure phrases (ounce, year, etc), temporal expressions (days, weeks, months), directions (north, south, etc), partitives (sort of, etc), pronouns, and so forth. However, most remain semantically underspecified as `normal' because they contain a standard mass or count noun as head (e.g. the company / man are both `normal', though they are often further classified as named entities (NEs) ORGanisation, PERson etc.).

NE recognition (NER) has been the subject of a series of competitions with associated datasets and evaluation software, see Wikipedia for a summary. Integrating NER and parsing would be beneficial for at least the following reasons. Firstly, it should be possible to reduce or remove the requirement for training data annotated with NE classes and boundaries by exploiting the CLAWS tags and NP bracketing (Ritchie). Secondly, many NEs contain internal structure and compositional semantics (Bank of England is an ORG containing a LOCation), and/or are encoded elliptically inside coordinate constructions ( the Banks of England and France is two ORGs 'Bank of England', 'Bank of France'), and/or may contain intervening material ( the Interleukin II (IL-II) promoter is a PROTein 'Interleukin-II promoter' with interleaved acronym) so are better represented and recovered from grammatical relations or compositional semantic structures. Finally, better NER integrated with parsing should improve performance on both tasks by mutually constraining output from each (Finkel and Manning, Lewin).

The project is to develop an approach to NER integrated with RASP which achieves some of these benefits without the need for full supervision or a joint model.

References:

Briscoe, E., 2006, An introduction to tag sequence grammars and the RASP system parser, CUCL-TR-662

Ritchie, A. Improving the RASP system: NER and classification, CSTIT MPhil Dissertation, 2003

Finkel, J. and Manning, C. Joint Parsing and Named Entity Recognition. Proceedings of NAACL-2009

Remarks

The implementation can be done using Unix tools, and/or Perl, and existing toolkits including components of the RASP system.

Citation and Reference Name Resolution

  • Proposer: Ted Briscoe
  • Supervisor: Ted Briscoe
  • Special Resources: None

Description

The way in which authors are identified on title pages and in citations and references in academic papers can vary (e.g. Ted Briscoe, E. Briscoe, E.J. Briscoe, Edward Briscoe,...). Web sites such as Google Scholar, Citeseer and the ACL Anthology Network (AAN) automatically analyse papers to extract citation counts, networks of collaboration, and so forth. For these to be useful the mapping of name variants to author identities must be accurate.

A number of approaches to resolving names to individuals have been tried from community-wide efforts to construct ontologies to more data-driven techniques. Many of the latter are summarised in Bhattacharya and Getoor.

The project would be to look at NLP techniques such as NER and parsing and integrate these with a ML data-driven approach testing on one of the datasets used by B and G or on the AAN or a biomedical dataset.

References:

Indrajit Bhattacharya, Lise Getoor Collective Entity Resolution In Relational Data

Remarks

The implementation can be done using Unix tools, and/or Perl, and existing toolkits including components of the RASP system.

Improving Click-Through Rates using Keyword Search Terms

  • Proposer: Stephen Clark
  • Supervisor: Stephen Clark
  • Special Resources: none

Description

This project is a collaboration with Cognitive Match (http://www.cognitivematch.com/), a company which develops software to intelligently decide which ``creative" to display to a user on entering a website. For example, a user entering the John Lewis website might be shown a different set of products depending on a variety of factors: the current weather in the user's locality, the user's previous website, and so on. The general research area could be described as ``Computational Advertising", an exciting new research area emerging at the intersection of Machine Learning, Natural Language Processing and Information Retrieval.

The goal of the project is to use a particular factor to improve the ``click-through rate" (ie how often users on the John Lewis homepage, say, go on to click on a product). The factor is the keyword search terms that the user entered in a search engine to land on the site in question (assuming the user has come from a search engine site). For example, if a user enters "John Lewis furniture" into Google, and then clicks on the John Lewis homepage, then the keywords give a strong indication of what products to display on the front page.

The project will also be in collaboration with John Shawe-Taylor at UCL, who will provide the Machine Learning expertise.

Remarks

This is an exciting opportunity to work with a start-up company at the cutting edge of using text processing technology in a commercial application, and also an opportunity to collaborate with an international authority on Machine Learning at UCL. Will suit someone with an interest in Machine Learning and Natural Language Processing/Information Retrieval.

Improving the Performance of a Wide-Coverage Parser on Unbounded Dependency Recovery

  • Proposer: Stephen Clark
  • Supervisor: Stephen Clark
  • Special Resources: Possibly large amounts of disk space for retraining the parser depending on the direction the project takes.

Description

Rimell et al. (2009) describes a new evaluation set for natural language parsers, consisting of real examples of unbounded dependencies taken from text. Unbounded dependencies are grammatical dependencies in which the distance between head and dependent is in principle unbounded. A standard example is object extraction out of a relative clause:

The man that John likes
The man that Bill said John likes
The man that Bob heard Bill say that John likes
...

Here is a real example from the test corpus in Rimell et al.:

"the same stump which had impaled the car of many a guest in the past 30 years and which he refused to have removed."

The distance between the head in this example (removed) and the dependent (stump) is 20 words.

The main result in the Rimell et al. paper is that current parsing technology is very bad at recovering such dependencies, throwing doubt on standard parsing evaluations which rate parsing accuracy at over 90% (or at least throwing doubt on the suitability of such evaluations to accurately represent current parsing capabilities). The recovery of unbounded depdendencies is crucial for full recovery of predicate-argument structure, which is necessary to fully understand a sentence.

The project will focus on the Clark and Curran (2007) parser, which was the top-performing parser in the evaluation. However, the overall performance of this parser was still only around 50%, leaving much room for improvement. The first part of the project will involve a detailed analysis of the output of the parser on the unbounded dependency data, identifying the main reasons that the parser makes mistakes (for each dependency type). On the basis of this analysis, the remainder of the project will investigate ways in which the performance of the parser might be improved.

References

Unbounded Dependency Recovery for Parser Evaluation
Laura Rimell, Stephen Clark and Mark Steedman
Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-09), pp.813-821, Singapore, 2009

Wide-Coverage Efficient Statistical Parsing with CCG and Log-Linear Models
Stephen Clark and James R. Curran
Computational Linguistics, 33(4), pp.493-552, 2007

both papers available here: http://www.cl.cam.ac.uk/~sc609/pubs.html

Remarks

The parser code is written in C++, but the project may or may not involve modifying the code, depending on the results of the initial analysis. Perl or Python is suitable for the corpus processing required.

Suit those with an interest in the grammar of natural languages (in particular English) and an interest in natural language parsing.

Judging the Grammaticality of Paraphrases in Context

  • Proposer: Stephen Clark
  • Supervisor: Stephen Clark
  • Special Resources: none

Description

Automatic paraphrasing techniques -- eg Callison Burch (2005) -- produce a broad-coverage paraphrase dictionary which would be difficult to produce by hand. However, the fact that the dictionary is produced automatically means that it contains errors, ie (phrase,paraphrase) pairs in which the phrase cannot be correctly replaced with the paraphrase in any context. Moreover, even some of the correct (phrase,paraphrase) pairs are only suitable in certain contexts.

We have access to a dataset of human judgements in which annotators have assessed the correctness of paraphrases in context. More specifically, examples from Callison-Burch's dictionary were used to modify newspaper text, and the human judges were asked whether the new paraphrase in context is grammatical. Grammaticality is clearly only part of what it means for a paraphrase to be correct in context, but it is a necessary requirement.

The purpose of the project is to develop an automatic method for determining the grammaticality of paraphrases in context, evaluated against this dataset. Possible methods include using a language model to detect if there is an element of "surprise" when moving across the paraphrase boundaries, indicating an ungrammatical phrase. Another option is to run a parser over the paraphrase to see if the analysis of the surrounding words changes. We have some preliminary work using the Google n-gram data as a rudimentary language model against which to compare.

References

Paraphrasing with Bilingual Parallel Corpora. Colin Bannard and Chris Callison-Burch, 2005. In Proceedings of ACL-2005.
available at http://www.cs.jhu.edu/~ccb/

Remarks

Corpus processing in Perl or Python required.

Suit those with an interest in automatic linguistic knowledge acquisition techniques.

Model-Based Approaches to Reverberant Noise

  • Proposer: Mark Gales
  • Supervisor: Mark Gales
  • Special Resources: None

Description

There has been considerable amounts of work in addressing the problem of background noise. For example the use of the vector Taylor series approximations allows the acoustic models to be modified to reflect a particular additive and convolutional noise acoustic environment. There has been less research in the related area of handling reverberation. For speech recognition approaches to be deployed, in for example the home, handling reverberation will become increasingly important.

This project will examine model-based approaches to handling reverberation. Two forms of approach will be examined. The first will be based on novel extensions to linear transforms, such as CMLLR, to handle long-term reverberant noise. The second form aims at extending predictive model-based compensation schemes, such as VTS.

The project will be evaluated on a version of the Wall Street Journal task recorded with reverberant noise. The work will involve extending the HTK toolkit VTS and CMLLR adaptation approaches.

Papers

Remarks

This project will extend the HTK toolkit.

Speaker Adaptation using the Bilinear Model

  • Proposer: Mark Gales
  • Supervisor: Mark Gales
  • Special Resources: None

Description

Speaker adaptation is an essential part of speech recognition systems. An important balance that needs to be considered is the complexity of the speaker transformation against the amount of data (and hence possible time delay) required to robustly estimate the model parameters. The dominant form of speaker adaptation is based on linear transforms such as MLLR and CMLLR. Though powerful these transforms normally require at least 10 seconds of data to robustly estimate the transform parameters. Schemes such as Eigenvoices and Cluster Adaptive Training offer faster adaptation. However they do not normally achieve the same level of performance as MLLR or CMLLR as the quantity of adaptation data increases.

This project will examine a recently proposed form of speaker adaptation based on the bilinear model. Here a low-dimensional sub-space is estimated from a wide-range of speaker data. In this sub-space linear transformation schemes can be robustly applied to estimate the speaker transform as only few transform parameters need to be estimated. By varying the dimensionality of the sub-space it is possible to alter the complexity of the transform to reflect the quantity of adaptation data available. If time permits extensions to the standard framework will be implemented.

The performance of the system will be evaluated in a large vocabulary multi-pass adaptation framework, based on current state-of-the-art systems developed in the Speech Group.

Papers

Remarks

This project will extend the HTK toolkit.

Canonical State Acoustic Models for ASR

  • Proposers: Kai Yu and Mark Gales
  • Supervisors: Kai Yu and Mark Gales
  • Special Resources: None

Description

Hidden Markov models (HMMs) with state-specific Gaussian Mixture model (GMM) are the most popular acoustic model in speech recognition. An important issue for training these models is to ensure that there is sufficient data to robustly estimate the context-dependent phone models (usually triphones). The standard approach for this is to use decision tree tying to determine the set of context-dependent states. Given these states, GMMs are then trained for each state. This has two, related, limitations. First, there must be sufficient data to robustly estimate the GMM parameters. Thus diagonal covariance matrices are often used. The second issue is that the number of context dependent states, the depth of the decision tree, is limited due to the need to robustly estimate the GMM parameters. This project will examine approaches for handling these issues by incorportating a model of the inter-state realationship. It may be viewed as a factorized form of state output distribution for HMMs.

The basic concept behind the project is a generalisation of schemes such as the HMM Error Model and the sub-space GMM approaches. The context-dependent states are not individually modelled, but are transformations of some canonical state (or set of canonical states). Thus two sets of model parameters are used. One set is context/state-independent GMMs, refered to as the canonical state model. The other parameters are a set of context-dependent transforms, which can adapt the canonical state parameters to state-specific parameters. The form of these transforms can be similar to those used for speaker adaptation: linear transform MLLR or CMLLR; and interpolation weights such as EigenVoices and CAT. For this form of model, the training is very similar to speaker adaptive training, however the "speakers" are now context dependent states. This form of model gives additional flexibility in the design of the acoustic representation. The canonical state can be made highly complex, as in the UBM approach. Alternatively, powerful mixtures of transforms can be used to represent each canonical state.

The project will examine the performance of the canonical state model compared to current state-of-the-art approaches.

Papers

Remarks

This project will extend the HTK toolkit.

Speaker adaptation by voice conversion for HMM-based speech synthesis

  • Proposer: Heiga Zen, Speech Technology Group, Toshiba Research Europe Ltd.
  • Email: heiga.zen [at] crl.toshiba.co.uk
  • Address: 208 Cambridge Science Park, Milton Road, Cambridge
  • Supervisor: Heiga Zen (Toshiba) / Mark Gales (CUED)
  • Special Resources: A number of TTS databases with labels, software to perform research, high-spec computers, a computer with a GPGPU card, disk spaces, and HMM-based speech synthesis experts

Description:

Text-to-speech (TTS) speech synthesis systems aim to provide a synthetic waveform equivalent of what a speaker would say when reading a given text. Such systems usually consists of a text analysis part, which deduces the said intermediate acoustic specification from the orthography, and a speech synthesis part, which rebuilds the sounds of speech from the specification of the acoustic equivalent of the text.

Hidden Markov models (HMMs)-based speech synthesis is one of the major approaches for the speech synthesis part of TTS systems. This approach has grown in popularity over the last years. In HMM-based speech synthesis, speech parameters including vocal tract and vocal source parameters are modelled by sub-word HMMs, then speech parameters for a given text to be synthesized are generated from the estimated HMMs. One advantage of this approach over the conventional waveform concatenative approach is its flexibility. By transforming the HMM parameters appropriately, it can produce speech with various voice characteristics, speaking styles, and emotions.

One example which demonstrates the flexibility of HMM-based speech synthesis is speaker adaptation. By adapting the HMM parameters to a target speaker using a small amount of speech data (e.g., 5 minutes), it can synthesize speech with target speaker's voice characteristics. Linear transform-based adaptation techniques such as maximum likelihood linear regression (MLLR) are often used to adapt the HMM parameters.

Voice conversion is a technique to transform a source speaker's speech into target speaker's one. Feature mapping based on a statistical model is a popular technique in voice conversion. In this technique, joint probability between source and target speakers' speech are modelled by a statistical model using a parallel corpus. At the mapping stage, first conditional probability distribution given a source speaker's speech is obtained from the statistical model then estimated target speaker's speech is determined from the conditional probability distribution based on a criterion.

The project will investigate the combination of statistical voice conversion and HMM-based speech synthesis to achieve more accurate adaptation. After having produced a brief bibliographic study about the conventional techniques, the student will implement or deploy some statistical model-based voice conversion techniques. The performances of the voice conversion-based adaptation techniques will thereafter be compared against the conventional MLLR-based techniques, at the level of the quality of the synthesized speech..

Suit those with interest in HMM-based speech synthesis. Programming languages involved are Shell, Perl, C, MATLAB.

Remarks:

There is considerable flexibility in the project definition and the student is expected to decide on the final scope (after discussion with the supervisors).

References:

T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, & T. Kitamura, "Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis," Proc. Eurospeech, pp. 2347-2350, 1999.

M. Gales, "Maximum likelihood linear transformations for HMM-based speech recognition," Computer Speech & Language, vol. 12, no. 2, pp. 75-98, 1998.

J. Yamagishi, "Average-voice-based speech synthesis," PhD thesis, Tokyo Institute of Technology, 2006.

Y. Stylianou, O. Cappe, & E. Moulines, "Continuous probabilistic transform for voice conversion," IEEE Trans. Speech Audio Processing, vol. 6, no. 2, pp. 131-142, 1998.

A. Kain & M. Macon, "Spectral voice conversion for text-to-speech synthesis," Proc. ICASSP, pp.285--288", 1998.

T. Toda, A. W. Black, K. Tokuda, "Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory," IEEE Trans. Acoustics Speech Language Processing, vol. 15, no. 8, pp. 2222-2235, 2007.

"Hidden Markov model toolkit (HTK)," http://htk.eng.cam.ac.uk/

"HMM-based speech synthesis system (HTS)," http://hts.sp.nitech.ac.jp/

"Signal processing toolkit (SPTK)," http://sp-tk.sourceforge.net/

Optimal decision trees for adaptation of HMM-based speech synthesis

  • Proposer: Heiga Zen, Speech Technology Group, Toshiba Research Europe Ltd.
  • Email: heiga.zen [at] crl.toshiba.co.uk
  • Address: 208 Cambridge Science Park, Milton Road, Cambridge
  • Supervisor: Heiga Zen (Toshiba) / Mark Gales (CUED)
  • Special Resources: A number of TTS databases with labels, software to perform research, high-spec computers, a computer with a GPGPU card, disk spaces, and HMM-based speech synthesis experts

Description:

Text-to-speech (TTS) speech synthesis systems aim to provide a synthetic waveform equivalent of what a speaker would say when reading a given text. Such systems usually consists of a text analysis part, which deduces the said intermediate acoustic specification from the orthography, and a speech synthesis part, which rebuilds the sounds of speech from the specification of the acoustic equivalent of the text.

Hidden Markov models (HMMs)-based speech synthesis is one of the major approaches for the speech synthesis part of TTS systems. This approach has grown in popularity over the last years. In HMM-based speech synthesis, speech parameters including vocal tract and vocal source parameters are modelled by sub-word HMMs, then speech parameters for a given text are generated from the estimated HMMs.

To improve the accuracy of model, context-dependent sub-word HMMs are widely used in HMM-based speech synthesis. In speech synthesis, not only phonetic contexts but also prosodic and linguistic contexts are usually used. However, as the number of contexts increases, the number of combinations of possible contexts also increases exponentially. It is impossible to cover all combinations of contexts with the limited amount of training data. To address this problem, decision tree-based state tying technique has widely been used. It successively and automatically clusters HMM states according to pre-defined questions about contexts to maximize the likelihood of the model to the training data. Then it shares HMM state-output distribution parameters among states associated with the same leaf node of decision trees.

One advantage of the HMM-based speech synthesis approach over the conventional waveform concatenative approach is its flexibility. By transforming the HMM parameters appropriately, it can produce speech with various voice characteristics, speaking styles, and emotions. One example which demonstrates the flexibility of HMM-based speech synthesis is speaker adaptation. By adapting the HMM parameters to a target speaker using a small amount of speech data (e.g., 5 minutes), it can synthesize speech with target speaker's voice characteristics.

The project will investigate the optimal decision tree construction for adaptation of HMM-based speech synthesis systems. Conventionally, decision trees built for general (i.e., speaker adaptively trained) HMMs are used at the adaptation stage. However, it is not an optimal tree for adaptation because these trees reflect the context dependency of training data rather than adaptation data. After having produced a brief bibliographic study about the conventional techniques, the student will implement or deploy new decision tree-based clustering techniques. Its performances will thereafter be compared against the conventional techniques, at the level of the quality of the synthesized speech.

Suit those with interest in HMM-based speech synthesis. Programming languages involved are Shell, Perl, C, MATLAB.

Remarks:

There is considerable flexibility in the project definition and the student is expected to decide on the final scope (after discussion with the supervisors).

References:

T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, & T. Kitamura, "Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis," Proc. Eurospeech, pp. 2347-2350, 1999.

J. Yamagishi, "Average-voice-based speech synthesis," PhD thesis, Tokyo Institute of Technology, 2006.

J. Odell, "The use of context in large vocabulary speech recognition," PhD thesis, University of Cambridge, 1995.

"Hidden Markov model toolkit (HTK)," http://htk.eng.cam.ac.uk/

"HMM-based speech synthesis system (HTS)," http://hts.sp.nitech.ac.jp/

"Signal processing toolkit (SPTK)," http://sp-tk.sourceforge.net/

Model extrapolation for HMM-based speech synthesis

  • Proposer: Matt Gibson
  • Supervisor: Matt Gibson / Bill Byrne
  • Special resources: none

An ideal automatic speech synthesis system should be able to produce natural sounding speech in a variety of voices. For example, a system should be able to switch easily between multiple voices, as specified by age, gender, emotion, accent, etc. One way in which this can be done is by model interpolation: models are trained for each speaker in a pool of speakers, and these models are interpolated to create a target voice. For example, models for a 10-year old speaker and for a 30-year-old speaker can be interpolated to produce the voice of a 20-year-old.

Modelling techniques developed for speaker interpolation can also be applied to the more challenging problem of speaker extrapolation. In speaker extrapolation, the goal is to produce speech that does not fall naturally within the training set. As an example, the system might attempt to synthesise the voice of a 50-year-old from models trained on 10-year-old and 30-year-old speakers.

This project will investigate techniques for age extrapolation. Time permitting, gender extrapolation will also be examined.

A public domain toolkit for HMM-based speech synthesis called the HTS Toolkit will be used for this project. References on HMM-based synthesis are available on the HTS web page http://hts.sp.nitech.ac.jp/). The CMU ARCTIC speech synthesis databases will be used as training material. The model extrapolation technique will be evaluated using human listening tests.

Remarks

None

References

  • "Speech parameter generation algorithms for HMM-based speech synthesis", K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, T. Kitamura. Proceedings ICASSP (2000)
  • "Multi-space probability distribution HMM", K. Tokuda, T. Mausko, N. Miyazaki, T. Kobayashi,. IEICE Trans. Inf. & Syst., vol.E85-D, no.3, pp.455-464 (2002)
  • "Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis", T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, T. Kitamura. Proceedings Eurospeech (1999)
  • "Speaker interpolation in HMM-based speech synthesis system", T. Yoshimura, T. Masuko, K. Tokuda, T. Kobayashi, T. Kitamura. Proceedings Eurospeech (1997) http://www.sp.nitech.ac.jp/~zen/yossie/mypapers/euro_greece97.pdf
  • "HMM-based speech synthesis with various speaking styles using model interpolation". M. Tachibana, J. Yamagishi, K. Onishi, T. Masuko, T. Kobayashi. Proceedings, Speech Prosody (2004)

An HTK + OpenFst ASR Decoder

  • Proposer: Bill Byrne
  • Supervisor: Bill Byrne
  • Special resources: Baseline HVite ASR system HTK WSJ 5K acoustic models, bigram language model, Google OpenFst Toolkit; www.openfst.org

Description

Many of the components of large vocabulary speech recognition systems are based on models which can be represented as weighted finite state transducers (WFSTs). WFSTs are finite automata whose transitions are labelled with both input and output symbols so that a path through the automata encodes a mapping from a sequence of input symbols to a sequence of output symbols. Weights may also be put on the transducer arcs so that the costs (e.g. negative log probabilities) can be associated with these mappings. WFSTs are particularly useful because there are general purpose algorithms for manipulating transducers - e.g. automata minimization, composition, pruning, best path search - so that models which can be expressed as WFSTs can be applied without the development or implementation of special purpose algorithms. As an example, an WFST can be constructed to realize a pronunciation lexicon. This transducer maps word sequences to their pronunciations with weights supplied by the pronunciation lexicon. A second WFST can be constructed which maps monophones to triphones. By composing the first transducer with the second transducer, a third transducer is formed which maps sequences of words directly to sequences of triphones, without any intermediate monophone representation. WFSTs can be used to implement nearly every aspect of a basic ASR system: n-gram language models, pronunciation lexicons, mapping from context independent to context dependent phones, and mapping context dependent phones to state clustered HMMs.

The aim of this project will be to develop a hybrid ASR decoder based on HTK acoustic models and libraries and the Google OpenFst Toolkit.

References

  • Mehryar Mohri, Fernando Pereira, Michael Riley (2002). "Weighted Finite-State Transducers in Speech Recognition." Computer Speech and Language, v16, no. 1, 69--88
  • Cyril Allauzen, Michael Riley, Johan Schalkwyk, Wojciech Skut , Mehryar Mohri (2007). "OpenFst: A General and Efficient Weighted Finite-State Transducer Library." CIAA 2007, LNCS, 11--23

Confidence measures for HMM-based speech synthesis

  • Proposer: Matt Gibson
  • Supervisor: Matt Gibson / Bill Byrne
  • Special resources: aseline HMM-based TTS system with two-pass acoustic models and baseline ASR lattices and synthesized speech generated by the system

Description

One of the advantages of the statistical approach to ASR is that confidence measures can be produced which identify possible transcription errors in the recognized speech. Word and phrases in the transcription can be assigned scores derived from the posterior distribution over possible hypotheses. It has been found experimentally that low confidence regions are often associated with recognition errors [3,4] and that low-confidence scores indicate weakness in the acoustic or language models used in recognition.

The objective of this project will be to extend the use of confidence measures to HMM-based speech synthesis [2]. The project will be based on a new technique which allows acoustic models to be used for both recognition and for synthesis [1]. The synthesized speech will be re-recognized to produce transcriptions and lattices which can be analyzed to identify recognition errors and regions of low confidence. This project will study whether analysis of confidence scores and transcription errors can be used to predict synthesis quality. Initial work will focus on sentence level quality. More ambitious work will assign confidence levels to words and phrases within sentences to be synthesized. Online user studies will be used to measure synthesis quality.

References

[1] M. Gibson. Two-pass decision tree construction for unsupervised adaptation of HMM-based synthesis models. Proceedings Interspeech, Brighton, U.K., September 2009. http://mi.eng.cam.ac.uk/~mg366/pubs/is2009-mgibson.pdf

[2] Text-to-Speech Synthesis, Paul Taylor, Cambridge University Press. Material on HMM-based speech synthesis.

[3] G. Evermann and P.C. Woodland. Large vocabulary decoding and confidence estimation using word posterior probabilities. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2000.

[4] V. Goel, S. Kumar, and W. Byrne. Segmental minimum Bayes-risk decoding for automatic speech recognition. IEEE Transactions on Speech and Audio Processing, May 2004.