home · mobile · calendar · defenses · 2010-2011 · 

Thesis Defense - Dligach

High-Performance Word Sense Disambiguation with Less Manual Effort
Computer Science PhD Candidate

Supervised learning is a widely used paradigm in Natural Language Processing. This paradigm involves learning a classifier from annotated examples and applying it to unseen data. We cast word sense disambiguation, our task of interest, as a supervised learning problem. We then formulate the end goal of this dissertation: to develop a series of methods aimed at achieving the highest possible word sense disambiguation performance with the least reliance on manual effort.

We begin by implementing a word sense disambiguation system, which utilizes rich linguistic features to better represent the contexts of ambiguous words. Our state-of-the-art system captures three types of linguistic features: lexical, syntactic, and semantic. Traditionally, semantic features are extracted with the help of expensive hand-crafted lexical resources. We propose a novel unsupervised approach to extracting a similar type of semantic information from unlabeled corpora. We show that incorporating this information into a classification framework leads to performance improvements. The result is a system that outperforms traditional methods while eliminating the reliance on manual effort for extracting semantic data.

We then proceed by attacking the problem of reducing the manual effort from a different direction. We also want a system we can easily port to new domains. Supervised word sense disambiguation relies on annotated data for learning sense classifiers, especially for porting to new domains. However, annotation is expensive since it requires a large time investment from expert labelers. We examine various annotation practices and propose several approaches for making them more efficient. We evaluate the proposed approaches and compare them to the existing ones. We show that the annotation effort can often be reduced significantly without sacrificing the performance of the models trained on the annotated data.

Committee: Martha Palmer, Department of Linguistics (Chair)
James Martin, Professor
Michael Mozer, Professor
Lawrence Hunter, University of Colorado School of Medicine
Wayne Ward, Research Professor
Department of Computer Science
University of Colorado Boulder
Boulder, CO 80309-0430 USA
May 5, 2012 (14:20)