skip to main content
Department of Computer Science University of Colorado Boulder
cu: home | engineering | mycuinfo | about | cu a-z | search cu | contact cu cs: about | calendar | directory | catalog | schedules | mobile | contact cs
home · events · thesis defenses · 1998-1999 · 

Thesis Defense - Mather

ECOT 831

Enhancing Cluster-Based Retrieval through Linear Algebra
Laura A. Mather
Computer Science PhD Candidate

One of the most common models in information retrieval (IR), the vector space model, represents a document set as a term-document matrix where each row corresponds to a term and each column corresponds to a document. Because of the use of matrices in IR, it is possible to apply linear algebra to this IR model. This dissertation describes two applications of linear algebra to an information retrieval task: text clustering.

The first enhancement to IR is an algorithm that identifies the terms in the term-document matrix that distinguish between the documents in a document set. The singular value decomposition of a projected term-document matrix can be used to find highly correlated terms. Groups of highly correlated terms are analyzed to determine which are useful for distinguishing between documents. It is shown that the distinguishing feature identification algorithm can be applied to the language identification problem, resulting in a small misclassification rate based on relatively small training sets.

The second enhancement to IR is a metric the quantifies the quality of a document set clustering. The metric is based on the theory that cluster quality is proportional to the number of terms that are disjoint across the clusters. The metric compares the singular values of the term-document matrix to the singular values of the matrices for each of the clusters to determine the amount of overlap of the terms between clusters. Because the metric can be difficult to interpret, a standardization of the metric is defined which specifies the number of standard deviations a clustering of a document set is from an average, random clustering of that document set.

Empirical evidence shows that the standardized cluster metric correlates with clustered retrieval performance when comparing clustering algorithms or multiple parameters for the same clustering algorithm.

Finally, a theorem is proven that shows that the ideal case for the cluster metric occurs if and only if the term-document matrix is block-diagonal.

Committee: James Martin, Associate Professor (Chair)
Richard Byrd, Professor
Dirk Grunwald, Associate Professor
Thomas Landauer, Department of Psychology
Joe McCloskey, Department of Defense
Charles Nicholas, University of Maryland

See also:
Department of Computer Science
College of Engineering and Applied Science
University of Colorado Boulder
Boulder, CO 80309-0430 USA
Send email to

Engineering Center Office Tower
ECOT 717
FAX +1-303-492-2844
XHTML 1.0/CSS2 ©2012 Regents of the University of Colorado
Privacy · Legal · Trademarks
May 5, 2012 (13:40)