Computer Science PhD Candidate

11/24/1998

12:00pm-2:00pm

One of the most common models in information retrieval (IR), the vector space model, represents a document set as a term-document matrix where each row corresponds to a term and each column corresponds to a document. Because of the use of matrices in IR, it is possible to apply linear algebra to this IR model. This dissertation describes two applications of linear algebra to an information retrieval task: text clustering.

The first enhancement to IR is an algorithm that identifies the terms in the term-document matrix that distinguish between the documents in a document set. The singular value decomposition of a projected term-document matrix can be used to find highly correlated terms. Groups of highly correlated terms are analyzed to determine which are useful for distinguishing between documents. It is shown that the distinguishing feature identification algorithm can be applied to the language identification problem, resulting in a small misclassification rate based on relatively small training sets.

The second enhancement to IR is a metric the quantifies the quality of a document set clustering. The metric is based on the theory that cluster quality is proportional to the number of terms that are disjoint across the clusters. The metric compares the singular values of the term-document matrix to the singular values of the matrices for each of the clusters to determine the amount of overlap of the terms between clusters. Because the metric can be difficult to interpret, a standardization of the metric is defined which specifies the number of standard deviations a clustering of a document set is from an average, random clustering of that document set.

Empirical evidence shows that the standardized cluster metric correlates with clustered retrieval performance when comparing clustering algorithms or multiple parameters for the same clustering algorithm.

Finally, a theorem is proven that shows that the ideal case for the cluster metric occurs if and only if the term-document matrix is block-diagonal.

Committee: |
James Martin, Associate Professor (Chair)Richard Byrd, ProfessorDirk Grunwald, Associate ProfessorThomas Landauer, Department of PsychologyJoe McCloskey, Department of DefenseCharles Nicholas, University of Maryland |

Department of Computer Science

University of Colorado Boulder

Boulder, CO 80309-0430 USA

webmaster@cs.colorado.edu

University of Colorado Boulder

Boulder, CO 80309-0430 USA

webmaster@cs.colorado.edu