Colloquium - Pedersen

ECCR 2-28

A Statistician's View of Information Retrieval
Jan O. Pedersen
Xerox PARC

Information Retrieval is the task of identifying documents relevant to an information need. This is hard because "relevance" is difficult to objectively assess and an information need may include context that is not explicitly represented. Nonetheless, it is possible to build useful information access systems that perform remarkably well using very simple techniques. In fact, it is notoriously difficult to improve on their performance.

I will discuss why this might be the case by analyzing a few classical information retrieval tasks as problems in statistical classification. It will emerge that the high-dimensionality of the feature space will defeat naive attempts to improve performance. However, careful dimensionality reduction paired with appropriate classification technology will yield promising results.

Jan O. Pedersen is a statistician specializing on the quantitative analysis of text for the purposes of information access. His most recent work has focused on the development of fast clustering algorithms as applied to the organization of large document collections and the design of a software architecture for information access. His other interests have included text categorization, thesaurus induction, and document filtering and routing. Jan Pedersen has degrees from Princeton University (AB) and Stanford University(PhD). He joined Xerox Corp. in 1986 and is currently manager of the Quantitative Content Analysis Area of the Information Sciences and Technology Laboratory, Xerox PARC.

Refreshments will be served immediately before the talk at 3:30pm.
Hosted by Andreas Weigend.

