|
Department of Computer Science
|
University of Colorado Boulder
|
|
|
|
|
|
|
|
|
home · events · thesis defenses · 2009-2010 ·
|
| |
Thesis Defense - Cer |
| |
7/27/2010 1:00pm-3:00pm CINC 102
|
Parameterizing Phrase Based Statistical Machine Translation Models: An Analytic Study
Daniel M. Cer
Computer Science PhD Candidate
The translation of a sentence from one language to another by a statistical
machine translation system is guided by knowledge sources that score competing
candidate translations. These knowledge sources encode such factors as the
fluency of the translation, the appropriateness of individually translated
words and phrases, the word-order differences between the two languages as well
as other factors such a preferences for long or shorter translations. Obtaining
good translations depends critically on the proper weighting of the scores
provided by these knowledge sources. Such weighting is typically performed
using minimum error rate training (MERT). In this dissertation, I investigate
the effectiveness of different optimization algorithms for MERT and the
properties of system trained to different learning criteria. The results show
that the most common optimization approach to MERT tends to perform worse than
alternatives. The experiments also challenge long standing assumptions about
the relationship between the training criteria used and the actual quality of
the translations produced by the system. Specifically, it is shown that there
are not sizable differences in the performance of systems trained to different
popular surface level criteria, such as systems trained to maximize the BLEU,
METEOR, and TERp scores. A novel method is presented for training to a more
computationally intensive and semantic-orientated training criteria. To perform
the experiments, I developed a new state-of-the-art machine translation system
known as Phrasal.
|
|
|
|
|
|
|
|
|
| |