Dr. Susan Gauch
Electrical Engineering and Computer Science
University of Kansas
233 Snow Hall
Lawrence, KS 66045-2228
Phone: (913) 864-8817
Fax: (913) 864-4971
Email: sgauch@tisl.ukans.edu
URL: http://www.eecs.ukans.edu/~sgauch/corpus.html
A Testbed for the Application of Corpus Linguistics to Information Retrieval
(IRI-9409263, $100,000, Aug. 1994 - Aug. 1997)
Searching online text collections can be both rewarding and frustrating. Expanding a user's query with related words can improve search performance, but finding and using related words is an open problem. There are three main sources for related words which vary in their level of specificity:
- query specific (e.g., including terms from a particular query's retrieval set with or without relevance feedback);
- corpus specific (analyzing the contents of a particular full-text database to identify terms used in similar ways); and
- language specific (using generally available online thesauri which are not tailored for any particular text collection).
We have adopted a corpus-specific approach for locating related terms. We are particularly interested in these techniques because the main calculations are done a priori, before the user queries arrive. Also, because the information is built from the specific text collection, the related terms are automatically tuned for the particular database being searched. Finally, since our approach is entirely statistical it should, in principle, be applicable to databases in different languages, although we have not tested this.
We have modified a corpus linguistics approach [Finch & Chater, 1992] that creates a matrix of term-term similarities. For words to be considered similar, they need not actually co-occur, however, they must occur in similar contexts. For example, we could deduce that "color" and "colour" were highly similar words because they are used in similar contexts, even though they are not likely to both appear in the same document. This approach is similar to others in that word usage within a given window is recorded. However, we are able to effectively use a much smaller window size (7 words rather than 50 or more) because we take word occurrence order into account. The fact that the word
"the" appears immediately prior to a word carries much more information about w (i.e., it is a noun or an adjective) than just the fact that the word "the" appears within a 7 word window.
Initial experiments with the Wall Street Journal collection from the Tipster collection (0.25 GB, 45 queries) investigated parameter during the similarity calculation and query expansion phases for a single database. The following conclusions were reached:
- Window size of 7, Sample size of 50 MB, 200 Observed Context words gave the best performance/complexity trade-off.
- Stemming the database made no discernible difference.
- Recording the position of observed context words was extremely important.
- Calculating the similarity of 4,000 fairly frequent words was sufficient.
- Expanding based on a combination of the number of expansion words per query word and similarity threshold was best.
- Normalizing the weights for each query word after expansion was important.
- Expanding the queries as described above, the 11 point average was improved 8.7% over unexpanded queries.
To validate the results above, we used the same techniques on a different corpus,
the Cystic Fibrosis database. Although this is a much smaller corpus, 4.9 MB, we
were able to achieve excellent results due to the specialized use of language in
medically-oriented research literature. Expanding queries based on our
automatically generated similarity matrix resulted in a maximum 28.5%
improvement in documents deemed relevant by just one judge to a minimum
of 6.7% in documents deemed relevant by all six judges. This confirms our belief
that this approach is applicable in corpora collected around a particular topic in
which a specialized sub-language is used (and for which there is unlikely to exist
a ready-made thesaurus).
Currently, we are extending our work to multiple databases. We have created a
similarity matrix for each of the seven databases in the 2.1 GB TREC 5 Category A
collection (AP, CR, FR88, FR94, FT, WSJ, and ZIFF). For comparison, we created
a single similarity matrix from a sample taken across all the seven databases. We
expanded each query by each of the eight matrices, in turn, and submitted the
resulting queries to the entire collection. When the queries are expanded using
the single matrix formed by sampling across the databases, search performance is
degraded over not expanding at all. However, if the best of the seven sub-
collection matrices is selected, substantial search improvements are possible
(23%). Automatically identifying the best matrix for a particular query remains
under investigation.
Five Graduate Research Assistants (two of whom are female) have received
support through this grant and one undergraduate (also female) received
support through an REU supplement. One Master's project has been completed
based on this project [Chong, 1995] and a Master's thesis is currently underway.
Publications Citing NSF Support
- Casasola, E. and Gauch, S. "Intelligent Search Agents for the World Wide Web," IEEE Expert. (pending)
- Chong, M.K., "Corpus Linguistics Techniques Applied to a Large Data Set," Master's Project Report, University of Kansas, Dept. of EECS, December 1995.
- Gauch, S. and Chong, M.K. "Automatic Word Similarity Detection for TREC 4 Query Expansion," Proc. of the Fourth Text REtrieval Conference (TREC4), Nov. 1995, Gaithersburg, MD. pp. 527-536.
- Gauch, S. and Wang, G. "Information Fusion with ProFusion", Proc. of WebNet '96: The First World Conference of the Web Society, San Francisco, CA, October 1996. pp. 174-179.
- Gauch, S., Wang, G. and Gomez, M. "ProFusion: Intelligent Fusion from Multiple, Distributed Search Engines," Journal of Universal Computer Science, Vol. 2, No. 9, Sept. 1996.
- Gauch, S. and Wang, J. "TREC5 Query Expansion Based on Corpus Statistics," Proc. of the Fifth Text REtrieval Conference (TREC5), Nov. 1996, Gaithersburg, MD (to appear).
- Gauch, S. and Wang, J. "A Corpus Analysis Approach for Automatic Query Expansion and its Application to Multi-Database Collections," ACM SIGIR '97. (submitted)
- Haverkamp, D.S. and Gauch, S. "Intelligent Information Agents: Review and Challenges for Distributed Information Sources," Journal of the American Society for Information Science, (under revision).
NSF "NUGGETS"
- Statistical techniques can detect word similarities useful for query expansion in as little as 5 MB of raw text.
- Word similarities calculated from a sample taken across multiple databases degrades retrieval results when used for query expansion. However, if separate similarities are calculated for each of the databases, query expansion based on the correct similarity information may dramatically improve search performance.
- Improvments of over 20% are possible when expanding user queries from a statistically created list of similar words.
Return to ITO Workshop Abstracts
Return to ITO Workshop Home Page