|
Department of Computer Science
|
University of Colorado Boulder
|
|
|
|
|
|
|
|
|
home · events · thesis defenses · 2010-2011 ·
|
| |
Thesis Defense - Dligach |
| |
8/19/2010 10:00am-12:00pm CINC 102
|
High-Performance Word Sense Disambiguation with Less Manual Effort
Computer Science PhD Candidate
Supervised learning is a widely used paradigm in Natural Language Processing.
This paradigm involves learning a classifier from annotated examples and
applying it to unseen data. We cast word sense disambiguation, our task of
interest, as a supervised learning problem. We then formulate the end goal of
this dissertation: to develop a series of methods aimed at achieving the
highest possible word sense disambiguation performance with the least reliance
on manual effort.
We begin by implementing a word sense disambiguation system, which utilizes
rich linguistic features to better represent the contexts of ambiguous words.
Our state-of-the-art system captures three types of linguistic features:
lexical, syntactic, and semantic. Traditionally, semantic features are
extracted with the help of expensive hand-crafted lexical resources. We propose
a novel unsupervised approach to extracting a similar type of semantic
information from unlabeled corpora. We show that incorporating this information
into a classification framework leads to performance improvements. The result
is a system that outperforms traditional methods while eliminating the reliance
on manual effort for extracting semantic data.
We then proceed by attacking the problem of reducing the manual effort from a
different direction. We also want a system we can easily port to new domains.
Supervised word sense disambiguation relies on annotated data for learning
sense classifiers, especially for porting to new domains. However, annotation
is expensive since it requires a large time investment from expert labelers.
We examine various annotation practices and propose several approaches for
making them more efficient. We evaluate the proposed approaches and compare
them to the existing ones. We show that the annotation effort can often be
reduced significantly without sacrificing the performance of the models trained
on the annotated data.
|
|
|
|
|
|
|
|
|
| |