home · mobile · calendar · defenses · 1997-1998 · 

Thesis Defense - Nix

Machine-Learning Methods for Inferring Vocal-Tract Articulation from Speech Acoustics
David Nix
Computer Science PhD Candidate

The goal of this research is to construct an inverse mapping from speech acoustics to vocal-tract articulatory trajectories. This mapping has the potential to improve speech recognition, speaker verification, low-bit-rate speech coding, and teaching speech production to the hearing impaired.

Previous attempts to construct such a mapping have been limited by the use of primarily vowel-like speech, a lack of real articulatory data for validation, and the use of homoscedastic (unimodal with constant-variance) least-mean-squared regression techniques. Instead, this project considers continuous speech, real articulatory data from five subjects, and multi-modal regression models designed to capture the non-uniqueness of the instantaneous inverse that causes typical regression models to fail.

We introduce Maximum-Likelihood Articulator Trajectories (MALAT) in a theoretical speech recognition framework. After discretely partitioning the speech acoustic data, the articulatory data corresponding to each acoustic centroid are modeled using some probability density function (pdf). The MALAT algorithm takes a time series of these pdfs and, using a realistic smoothness constraint, estimates those articulatory trajectories that maximize the likelihood of the observed acoustic data.

The model is trained using one set of data, and performance is measured on an independent test set using root-mean-squared error (rmse) and correlation between inferred and actual trajectories. MALAT with multi-modal pdfs produces correlations averaged over articulators of 0.90-0.92 and rmses of 1.3-1.5 mm compared to correlations of only 0.79-0.84 and rmses of 1.7-1.9 mm using uni-modal pdfs.

We then use Maximum-Likelihood Continuity Mapping (MALCOM), an unsupervised method previously demonstrated by Hogden (1995) to accurately infer articulatory information for vowels, and obtain correlations only 0.05-0.21 lower than using MALAT. We extend MALCOM to the Two-Observable case (TO-MALCOM) to incorporate phonetic labels into training to find paths that maximize both the likelihood of the phoneme sequence as well as of the acoustics.

TO-MALCOM and MALCOM perform comparably at capturing articulatory information, but TO-MALCOM produces paths with greater phoneme discriminability. Because it does not require measured articulatory data -- only acoustics -- TO-MALCOM has the potential to be applied to speech processing tasks as an articulation-based alternative to Hidden Markov models.

Committee: Michael Mozer, Associate Professor (Chair)
Timothy Brown, Department of Electrical and Computer Engineering
Daniel Jurafsky, Department of Linguistics
James Martin, Associate Professor
Satinder Singh, Assistant Professor
Department of Computer Science
University of Colorado Boulder
Boulder, CO 80309-0430 USA
May 5, 2012 (14:20)