The meaning of many adjectives is highly context dependent. For example, 'cool' can mean different things depending on the context, e.g.
The project would suit a student who is interested in machine learning (in particular unsupervised methods) and its application to NLP tasks. Programming language of choice.
Lin Sun and Anna Korhonen. 2009. Improving Verb Clustering with Automatically Acquired Selectional Preferences. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Singapore.
Blei, D.M. and Ng, A.Y. and Jordan, M.I. 2003. Latent dirichlet allocation. The Journal of Machine Learning Research. (3). 993-1022.
Brody, S. and Lapata, M. 2009. Bayesian word sense induction. Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics. 103-111.
Tim Van de Cruys and Marianna Apidianaki. 2011. Latent Semantic Word Sense Induction and Disambiguation. Proceedings of ACL. 1476-1485.
Agirre, E. and Martinez, D. de Lacalle, O.L. and Soroa, A. 2006. Two graph-based algorithms for state-of-the-art WSD. Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing. 585-593.
Can Twitter data be thought of as a new genre of English? The constraints of the character limit as well as the realtime, connected nature of the medium mean that people write quite differently on Twitter than in other genres, possibly closer to the way they talk than the way they normally write. One feature we might expect to find is that sentence subjects become optional in Twitter, unlike in standard English. This project will involve studying the use of verbs in Twitter and addressing the challenges it poses for NLP. The project will involve
Programming language of choice.
Cedric Messiant. 2008. A Subcategorization Acquisition System for French Verbs. Proceedings of ACL.
Scientific writing tends to be fairly conventionalised. The information (or rhetorical or discourse) structure of a scientific article can be characterised by classifying sentences into categories such as Background, Objective, Method, Result, and Conclusion. Such classification can be useful for the readers of scientific literature as well as for NLP tasks such as information extraction, summarization and information retrieval.
To date, various schemes and approaches have been developed for characterising the information structure of scientific documents, the best of which have yielded promising results (e.g. Teufel and Moens, 2002; Guo et al., 2011). However, relying on supervised or semi-supervised learning and a body of annotated data, existing approaches are expensive to develop. This project will explore using unsupervised learning for this task. Not relying on pre-defined categories, unsupervised learning may help and identify new categories of information structure as well.
A range of clustering algorithms have proved successful for related NLP tasks which could be used, including K-means, Principal Direction Divisive Partitioning (PDDP), Spectral Clustering, Expectation-Maximization (EM) for generative models e.g. Gaussian mixture or latent Dirichlet allocation (LDA), among others (Jain et al., 1999; Andrews and Fox, 2007; Boley, 1998). The project will apply one or more clustering techniques to the corpus of biomedical journal papers by Guo et al. (2011) which has been annotated for some existing schemes of information structure and has been used in previous supervised works on this tasks. The performance will be evaluated in various ways (Halkidi et al., 2001). A list of sentence features which have proved successful in supervised learning (Guo et al., 2011) may be adopted. The student is also welcome to explore new features for this task.
The project would suit a student who is interested in machine learning (in particular unsupervised methods) and its application to NLP tasks. Programming language of choice.
Nicholas O. Andrews and Edward A. Fox. 2007. Recent Developments in Document Clustering. Technical report, Computer Science, Virginia Tech.
Daniel Boley. 1998. Principal direction divisive partitioning. Data Min. Knowl. Discov., 2:325-344.
Yufan Guo, Anna Korhonen, Maria Liakata, Ilona Silins, Johan Hogberg, and Ulla Stenius. 2011. A comparison and user-based evaluation of models of textual information structure in the context of cancer risk assessment. BMC Bioinformatics, 69(12).
Maria Halkidi, Yannis Batistakis, and Michalis Vazirgiannis. 2001. On clustering validation techniques. Journal of Intelligent Information Systems, 17:107-145.
A. K. Jain, M. N. Murty, and P. J. Flynn. 1999. Data clustering: a review. ACM Comput. Surv., 31(3):264-323.
S. Teufel and M. Moens. 2002. Summarizing scientific articles: Experiments with relevance and rhetorical status. Computational Linguistics, 28:409-445.
An increasing amount of research into sentiment and mood classification is based on textual collections of data generated on microblogging sites, such as Twitter. Recent research includes the relationship between Twitter mood and both stock market fluctuations (Bollen, Mao, and Zeng 2010) and consumer confidence and political opinion (O Connor et al. 2010), as well as the prediction of political election results in Germany (Tumasjan et al. 2010).
In May 2012, a new mayor of London will be elected. As the date of the election approaches, voters are expected to generate a considerable amount of sentiment data on Twitter. This project involves the exploration of sentiment classification and opinion mining techniques that are able to adequately analyze microblogging data, resulting in a system that predicts the outcome of the election. Additionally, dynamic topic models (Blei & Lafferty 2006) could be explored, that try to capture trending topics during the campaign.
The project would suit a student interested in unsupervised and supervised machine learning, and its application to sentiment classification and topic detection.
Blei, D. M., and Lafferty, J. D. (2006). Dynamic topic models. In Proceedings of the 23rd International Conference on Machine Learning.
Bollen, J.; Mao, H.; and Zeng, X.-J. 2010. Twitter mood predicts the stock market. Journal of Computational Science 2(1):1-8.
O Connor, B.; Balasubramanyan, R.; Routledge, B. R.; and Smith, N. A. 2010. From tweets to polls: Linking text sen- timent to public opinion time series. In Fourth International AAAI Conference on Weblogs and Social Media.
Maria Halkidi, Yannis Batistakis, and Michalis Vazirgiannis. 2001. On clustering validation techniques. Journal of Intelligent Information Systems, 17:107-145.
A. K. Jain, M. N. Murty, and P. J. Flynn. 1999. Data clustering: a review. ACM Comput. Surv., 31(3):264-323.
Tumasjan, A.; Sprenger, T. O.; Sandner, P. G.; and Welpe, I. M. 2010. Predicting elections with twitter: What 140 characters reveal about political sentiment. In Fourth Inter- national AAAI Conference on Weblogs and Social Media.