home · mobile · calendar · defenses · 2010-2011 · 

Thesis Defense - Mangalath

The Construction of Meaning - The Role of Context in Corpus Based Approaches to Language Modeling
Praful Mangalath
Computer Science PhD Candidate

This dissertation presents a framework for statistically modeling words and sentences. It focuses on the role of context in learning semantic representations from a corpus. In recent years, approaches like Latent Semantic Analysis (LSA) and Probabilistic Topic Models (LDA) have both enjoyed success with the psycholinguistics community as being theories of meaning and models of language understanding. They serve as important components of information retrieval, machine translation, and document summarization systems, as well as in several other applications. However, sentences have a rich set of semantic and syntactic features which cannot be accurately represented by these models as they are based on an order-independent bag-of-words assumption. This dissertation develops a model which takes these syntagmatic and paradigmatic constraints into account and provides a better model for sentence processing.

The Construction Integration II (CI-II) model is a cognitively plausible computational account of how language is acquired and stored as representations in long term memory, which are then retrieved contextually to generate meaning in working memory. Semantic constraints are modeled using LSA, the Topics Model and context co-occurrence probabilities. Syntactic constraints are modeled using Ngrams and Dependency Grammar. In short, I show how text is structurally decomposed and combined with the comprehenders' prior knowledge in order to understand the text. The model is a dual memory model in that it distinguishes between a gist and an explicit level. It demonstrates how the expressiveness from explicitly modeling context leads to a better word sense disambiguation process.

This dissertation develops a new metric -- Dependency Edit Distance -- that structurally decomposes sentences into dependency relations and measures similarity in terms of the semantic and syntactic cost associated in transforming one to the other. It further applies supervised machine learning techniques to use these measures between labeled pairs of sentences and build models with predictive accuracies that match human raters. The long term goal of this research is to map this model into software that helps students learn in an instructional environment capable of assessing their comprehension. I show data from two experiments in which student responses were automatically graded; the results show great potential towards such a practical realization.

Committee: James Martin, Professor (Chair)
Walter Kintsch, Department of Psychology
Martha Palmer, Department of Linguistics
Wayne Ward, Research Professor
Qin (Christine) Lv, Assistant Professor
Thomas Landauer, Department of Psychology
Department of Computer Science
University of Colorado Boulder
Boulder, CO 80309-0430 USA
May 5, 2012 (14:20)