skip to main content
 Department of Computer Science University of Colorado Boulder
 cu: home | engineering | mycuinfo | about | cu a-z | search cu | contact cu cs: about | calendar | directory | catalog | schedules | mobile | contact cs
 home | quick links | site map | who's where | cs a-z
home · events · thesis defenses · 1998-1999 ·

# Thesis Defense - Mather

11/24/1998
12:00pm-2:00pm
ECOT 831

Enhancing Cluster-Based Retrieval through Linear Algebra
Laura A. Mather
Computer Science PhD Candidate

One of the most common models in information retrieval (IR), the vector space model, represents a document set as a term-document matrix where each row corresponds to a term and each column corresponds to a document. Because of the use of matrices in IR, it is possible to apply linear algebra to this IR model. This dissertation describes two applications of linear algebra to an information retrieval task: text clustering.

The first enhancement to IR is an algorithm that identifies the terms in the term-document matrix that distinguish between the documents in a document set. The singular value decomposition of a projected term-document matrix can be used to find highly correlated terms. Groups of highly correlated terms are analyzed to determine which are useful for distinguishing between documents. It is shown that the distinguishing feature identification algorithm can be applied to the language identification problem, resulting in a small misclassification rate based on relatively small training sets.

The second enhancement to IR is a metric the quantifies the quality of a document set clustering. The metric is based on the theory that cluster quality is proportional to the number of terms that are disjoint across the clusters. The metric compares the singular values of the term-document matrix to the singular values of the matrices for each of the clusters to determine the amount of overlap of the terms between clusters. Because the metric can be difficult to interpret, a standardization of the metric is defined which specifies the number of standard deviations a clustering of a document set is from an average, random clustering of that document set.

Empirical evidence shows that the standardized cluster metric correlates with clustered retrieval performance when comparing clustering algorithms or multiple parameters for the same clustering algorithm.

Finally, a theorem is proven that shows that the ideal case for the cluster metric occurs if and only if the term-document matrix is block-diagonal.

 Committee: James Martin, Associate Professor (Chair) Richard Byrd, Professor Dirk Grunwald, Associate Professor Thomas Landauer, Department of Psychology Joe McCloskey, Department of Defense Charles Nicholas, University of Maryland

 Department of Computer Science College of Engineering and Applied Science University of Colorado Boulder Boulder, CO 80309-0430 USA Questions/Comments? Send email to webmaster at cs.colorado.edu Engineering Center Office Tower ECOT 717 +1-303-492-7514 FAX +1-303-492-2844
 XHTML 1.0/CSS2 ©2012 Regents of the University of Colorado Privacy · Legal · Trademarks May 5, 2012 (13:40)

.