Dr. Susan Gauch

Electrical Engineering and Computer Science
University of Kansas
233 Snow Hall
Lawrence, KS 66045-2228
Phone: (913) 864-8817
Fax: (913) 864-4971
Email: sgauch@tisl.ukans.edu
URL: http://www.eecs.ukans.edu/~sgauch/corpus.html



A Testbed for the Application of Corpus Linguistics to Information Retrieval

(IRI-9409263, $100,000, Aug. 1994 - Aug. 1997)

Searching online text collections can be both rewarding and frustrating. Expanding a user's query with related words can improve search performance, but finding and using related words is an open problem. There are three main sources for related words which vary in their level of specificity:

  1. query specific (e.g., including terms from a particular query's retrieval set with or without relevance feedback);
  2. corpus specific (analyzing the contents of a particular full-text database to identify terms used in similar ways); and
  3. language specific (using generally available online thesauri which are not tailored for any particular text collection).

We have adopted a corpus-specific approach for locating related terms. We are particularly interested in these techniques because the main calculations are done a priori, before the user queries arrive. Also, because the information is built from the specific text collection, the related terms are automatically tuned for the particular database being searched. Finally, since our approach is entirely statistical it should, in principle, be applicable to databases in different languages, although we have not tested this.

We have modified a corpus linguistics approach [Finch & Chater, 1992] that creates a matrix of term-term similarities. For words to be considered similar, they need not actually co-occur, however, they must occur in similar contexts. For example, we could deduce that "color" and "colour" were highly similar words because they are used in similar contexts, even though they are not likely to both appear in the same document. This approach is similar to others in that word usage within a given window is recorded. However, we are able to effectively use a much smaller window size (7 words rather than 50 or more) because we take word occurrence order into account. The fact that the word "the" appears immediately prior to a word carries much more information about w (i.e., it is a noun or an adjective) than just the fact that the word "the" appears within a 7 word window.

Initial experiments with the Wall Street Journal collection from the Tipster collection (0.25 GB, 45 queries) investigated parameter during the similarity calculation and query expansion phases for a single database. The following conclusions were reached:

To validate the results above, we used the same techniques on a different corpus, the Cystic Fibrosis database. Although this is a much smaller corpus, 4.9 MB, we were able to achieve excellent results due to the specialized use of language in medically-oriented research literature. Expanding queries based on our automatically generated similarity matrix resulted in a maximum 28.5% improvement in documents deemed relevant by just one judge to a minimum of 6.7% in documents deemed relevant by all six judges. This confirms our belief that this approach is applicable in corpora collected around a particular topic in which a specialized sub-language is used (and for which there is unlikely to exist a ready-made thesaurus).

Currently, we are extending our work to multiple databases. We have created a similarity matrix for each of the seven databases in the 2.1 GB TREC 5 Category A collection (AP, CR, FR88, FR94, FT, WSJ, and ZIFF). For comparison, we created a single similarity matrix from a sample taken across all the seven databases. We expanded each query by each of the eight matrices, in turn, and submitted the resulting queries to the entire collection. When the queries are expanded using the single matrix formed by sampling across the databases, search performance is degraded over not expanding at all. However, if the best of the seven sub- collection matrices is selected, substantial search improvements are possible (23%). Automatically identifying the best matrix for a particular query remains under investigation.

Five Graduate Research Assistants (two of whom are female) have received support through this grant and one undergraduate (also female) received support through an REU supplement. One Master's project has been completed based on this project [Chong, 1995] and a Master's thesis is currently underway.

Publications Citing NSF Support

  1. Casasola, E. and Gauch, S. "Intelligent Search Agents for the World Wide Web," IEEE Expert. (pending)
  2. Chong, M.K., "Corpus Linguistics Techniques Applied to a Large Data Set," Master's Project Report, University of Kansas, Dept. of EECS, December 1995.
  3. Gauch, S. and Chong, M.K. "Automatic Word Similarity Detection for TREC 4 Query Expansion," Proc. of the Fourth Text REtrieval Conference (TREC4), Nov. 1995, Gaithersburg, MD. pp. 527-536.
  4. Gauch, S. and Wang, G. "Information Fusion with ProFusion", Proc. of WebNet '96: The First World Conference of the Web Society, San Francisco, CA, October 1996. pp. 174-179.
  5. Gauch, S., Wang, G. and Gomez, M. "ProFusion: Intelligent Fusion from Multiple, Distributed Search Engines," Journal of Universal Computer Science, Vol. 2, No. 9, Sept. 1996.
  6. Gauch, S. and Wang, J. "TREC5 Query Expansion Based on Corpus Statistics," Proc. of the Fifth Text REtrieval Conference (TREC5), Nov. 1996, Gaithersburg, MD (to appear).
  7. Gauch, S. and Wang, J. "A Corpus Analysis Approach for Automatic Query Expansion and its Application to Multi-Database Collections," ACM SIGIR '97. (submitted)
  8. Haverkamp, D.S. and Gauch, S. "Intelligent Information Agents: Review and Challenges for Distributed Information Sources," Journal of the American Society for Information Science, (under revision).

NSF "NUGGETS"

Return to ITO Workshop Abstracts

Return to ITO Workshop Home Page