Concept Based Categorization
and Search on Internet:
A Machine Learning Approach
NSF Principal Investigator:
Professor Hsinchun Chen
Management Information Systems
The University of Arizona
Tucson, AZ, USA 85721
(Head, UA/MIS AI Group; Senior Research Scientist, NCSA)
Email: hchen@bpa.arizona.edu
Phone: (520) 621-4153
Fax: (520) 621-2433
URL: http//ai.bpa.arizona.edu
Introduction
Despite the usefulness of database technologies, users of online information systems are often overwhelmed by the amount of current information, the subject and system knowledge required to access this information, and the constant influx of new informati
on. (4) This problem is termed information overload (2). A second difficulty associated with information retrieval and information sharing is the classic vocabulary problem, which is a consequence of system users' diverse backgrounds and ar
eas of expertise (11, 12, 3). The fluidity of concepts and vocabularies in various domains further complicates the retrieval issue (3, 9, 10). A concept may be perceived differently by different searchers and research has shown that different ind
exers, well trained in an indexing scheme, often assign different index terms for the same document (1). A more intelligent and proactive search aid is needed to address information overload and the vocabulary problem in a large info
rmation space, such as the Internet, that is used by searchers of varying backgrounds.
Dr. Hsinchun Chen's Artificial Intelligence Lab at the University of Arizona has been working in conjunction with the University of Illinois Digital Library Initiative to develop techniques to help users' cope with information overload and the <
I>vocabulary problem when searching online databases. These techniques include concept spaces, self-organizing maps, and intelligent agents (spiders).
Research Questions
The main forms of retrieval used on the World Wide Web are keyword searches or hypertext browsing. Keyword searches often result in low precision, poor recall, and slow response time due to the limitation of the indexing and the communicati
on methods (bandwidth), controlled language based interface (the vocabulary problem), and the inability of searchers themselves to fully articulate their needs. Furthermore, browsing allows users to explore only a very small portion of the large Internet
information space. An extensive information space that is accessed through hypertext-like browsing can also potentially confuse and disorient its user, the embedded digression problem, and it can cause the user to spend a great deal of time witho
ut learning anything specific, a difficulty know as the art museum phenomenon.
To address these problems, we have developed several inductive machine learning techniques to:
- Create concept/subject categories of Internet servers (homepages) automatically
(Internet categorization: multi-layered Kohonen self-organizing feature map)
- Create a concept space for each subject category for associative information retrieval
(concept-based search: cluster analysis and Hopfield network association)
- Create a genetic algorithm based intelligent spider (agent) for an individual searcher
(intelligent spider (agent): genetic algorithms)
Kohnen Self-organizing Map
Demo is available at:
http://ai2.BPA.Arizona.EDU/ent/entertain1/
Quillian originally suggested that semantic networks could be used to encode and associate word meanings. Such networks could then be used to visualize or construct mental models of an information space (16). Neural network algorithms, in particular, a
ppear to be a natural starting point for organizing large amounts of information in a manner consistent with human mental models. After examining several neural network algorithms in previous work, our research group concluded that a variant of the Kohon
en self-organizing feature maps (SOM) appears to be a most promising algorithm for organizing large volumes of information. The algorithm can be used to create an intuitive, graphical display of the important concepts contained in textual information (13,
15). The ET-MAP was developed using 110,000 homepages from the Internet Yahoo! Entertainment sub-category. The following is an outline of the algorithm used:
- Initialise input nodes, output nodes, and connection weights.
- Present each homepage in order.
- Compute distances to all node.
- Select winning node j* and update weights to node j* and neighbors.
- Label regions in the map.
- Apply the above steps recursively for larger regions.
Concept Space Approach
Demo is available at:
http://ai.bpa.arizona.edu/cgi-bin/tng/ETSpace
In previous NSF-funded research, we developed a concept space approach to automatic thesaurus generation. Since then, this approach has been used successfully in generating domain-specific thesauri for various applications, including Russian computing
(6, 7), business meetings (5), molecular biology, and engineering (8). The following is a brief overview of our algorithms in the context of the Internet categorization and search project:
- Identify Internet homepage collection.
- Perform automatic indexing and term weighting.
- Perform cluster analysis.
- Use the concept space for associative retrieval.
The Itsy Bitsy Spider
Demo is available at:
http://ai.bpa.arizona.edu/~mramsey/SPIDER/itsy.html
The basic idea behind agent research is the need to develop software systems which engage and help all types of end users (17). Our goal is to create intelligent agents that performs customized, dynamic search (information retrieval) and textual analys
is in the World-Wide Web environment for any user. In order to allow our agents to be goal-oriented, cooperative, and customized, we have adopted a machine learning approach based on our previous work in textual analysis. Although des
igning the system architecture for such an agent is a non-trivial undertaking, we intentionally designed the spider to be compact, dynamic, and friendly -- a ``smart itsy bitsy spider.''
A Web agent system architecture was designed first, before we proceeded to the actual implementation. The system architecture consists of five modular components, each accomplishing a unique task:
- Task requests and search control parameters are solicited from the user.
- A graphical user interface implemented in Java takes user input and generates intermediate results and a final search summary.
- The Jaccard's similarity function computes a link score and a keyword score for each newly explored and indexed homepage.
- A search engine supports the best first search and genetic algorithm, and
- An HTTP protocol based program fetches a remote homepage.
This modular design allows us to experiment with different subcomponents over time and eventually converge on a more compact and efficient system implementation.
Findings
- Kohonen Map
Our results indicate that a Kohonen SOM-based algorithm can successfully categorize a large and eclectic Internet information space (the Entertainment sub-category of Yahoo!) into manageable sub-spaces that a user can successfully navigate to locate a hom
epage of interest. The SOM algorithm works best with browsing tasks that are very broad, and in which subjects skip around between categories. Experienced subjects especially liked the visual and graphical aspects of the map, but the map did not work well
for subjects who tried to do a directed search or preferred to use the more familiar mental models (alphabetic or hierarchical organization) for browsing.
- Concept Space
The results from concept space experiments were especially encouraging. There were no significant differences in the precision measures for the set of documents identified by subject-suggested terms, thesaurus-suggested terms and the combination of subjec
t- and thesaurus-suggested terms. The recall measurements indicated that the combination of subject and thesaurus-suggested terms was significantly superior to the recall ability of the subject-suggested terms. Furthermore, analysis of the homepages indic
ated that there was limited overlap between the homepages retrieved by the subject-suggested and thesaurus-suggested terms. Since the retrieved homepages were different for the most part, this suggests that a user can enhance and improve a keyword-based s
earch by using an automatically generated thesaurus (or concept space). The concept space approach worked best for more experienced searchers, and for searches that began with very broad terms that needed to be narrowed or refined. Subjects especially lik
ed the level of control that they could exert over the search and the fact that the terms suggested by the thesaurus were "real" (i.e. originating in the homepages) and therefore guaranteed to have retrieval success.
- Itsy Bitsy Spider
In benchmark testing, although the genetic algorithm spider did not out-perform the best first search spider, we found both results to be comparable and complementary. In user evaluation, the genetic algorithm spider obtained significantly higher recall
value than the best first search spider. However, their precision values were not statistically different. The mutation process introduced in genetic algorithm allowed users to find other potential relevant homepages that could not be explored via a con
ventional local search process. In addition, we found the Java-based interface to be a necessary component for design of a truly interactive and dynamic Web agent.
Future Directions
In our ongoing effort in the Illinois Digital Library Initiative
project, we are in the process of exploring other general-purpose
search and classification algorithms for Web
resource categorization and searching.
Publications from the Funded Research
- H. Chen, C. Schuffels, and R. Orwig, Internet Categorization and Search: A Machine Learning Approach. Journal of Visual Communication and Image Representation, Special Issue on Digital Libraries, 7(1):88-102, 1996.
- B. Schatz and H. Chen. Building Large-scale Digital Libraries. IEEE Computer, Special Issue on "Building Large-scale Digital Libraries," 29(5):22-27, May 1996.
- H. Chen, Y. Chung, M. Ransey, and C. Yang. An Intelligent Personal Spider (Agent) for Dynamic Internet/Intranet Searching. Decision Support Systems, forthcoming 1997.
- H. Chen, A. L. Houston, R. R. Sewell, and B. R. Schatz. Internet Browsing and Searching: User Evaluation of category Map and Concept Space Techniques. Journal of the American Society for Information Science, Special Issue on AI Techniques for th
e Emerging Information Systems Application, forthcoming 1997.
- H. Chen, Y. Chung, M. Ramsey, and C. Yang. A Smart Itsy Bitsy Spider. Journal of the American Society for Information Science, Special Issue on AI Techniques for the Emerging Information Systems Application, forthcoming 1997.
- H. Chen and B. R. Schatz. Three Approaches to Controlling the Information Explosion on the Internet: Category Map, Concept Space, and the Internet Agent, submitted to ACM Digital Library Conference, 1997.
- H. Chen, Y. Chung, M. Ramsey, C. Yang, P. Ma, and J. Yen. Intelligent Agents on the Internet. Proceedings of the 30th Annual Hawaii International Conference on System Sciences (HICSS-30), Maui, Hawaii, January 7-10, 1997.
- H. Chen, B. R. Schatz, and C. Lin. Concept Classification and Search on Internet Using Machine Learning and Parallel Computing Techniques. World Wide Web 4 Conference, Boston, MA, December 11-14, 1995, p. 58-59, Poster Proceedings.
- H. Chen and B. Schatz. Semantic Retrieval for the NCSA Mosaic. Proceedings of the Second International World Wide Web Conference '94, Chicago, IL, October 17-20, 1994.
References
(1) M. J. Bates. Subject access in online catalogs: A design model. Journal of the American Society for Information Science, 37(6):357-376, November 1986.
(2) D. C. Blair and M. E. Maron. An evaluation of retrieval effectiveness for a full-text document retrieval system. Communications of the ACM, 28(3):289-299, 1985.
(3) H. Chen. Collaborative systems: solving the vocabulary problem. IEEE Computer, Special Issue on Computer-Supported Cooperative Work (CSCW), 27 (5):58-66, May 1994.
(4) H. Chen and V. Dhar. User misconceptions of online information retrieval system. International Journal of Man-machine Studies, 32(6):673-692, June 1990.
(5)H. Chen, P. Hsu, R. Orwig, L. Hoopes, and J. F. Nunamaker. Automatic concept classification of text from electronic meeting. Communication of the ACM, 37(10):56-73, October 1994.
(6) H. Chen and K. J. Lynch. Automatic construction of networks of concepts characterizing document databases. IEEE Transactions on Systems, Man and Cybernetics, 22(5):885j-902, September/October 1992.
(7) H. Chen, K. J. Lynch, K. Basu, and T. Ng. Generating, integrating, and activating thesauri for concept-based document retrieval. IEEE Expert, Special Series on Artificial Intelligence in Text-based Infomation Systems, 8(2):25-34, April 1993.
(8) H. Chen, B. R. Schatz, T. Yim, and D. Fye. Automatic thesuarus generation for an electronic community system. Journal of the American Society for Information Science, 46(3):175-193, April 1995.
(9) J. Courteau. Genome databases. Science, 24:201-207, October 11, 1991.
(10) K. A. Frenkel. The human genome project and informatics. Communications of the ACM, 34(11):41-51, November 1991.
(11) G. W. Furnas. Statistical semantics: How can a computer use what people name things to guess what things people mean when they name things. Proceedings of the Human Factors in Computer Systems Conference, March 1982, Gaithersburg, MD, 251-25
3.
(12) G. W.Furnas, T. K. Landauer, L. M. Gomez,and S. T. Dumais. The vocabulary problem in human-system communication. Communications of the ACM, 30(11):964-971, November 1987.
(13) X. Lin, D. Soergel, & G. Marchionini. A self-organizing semantic map for information retrieval. Proceedings of the Fourteenth Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, October 13-16, 199
1, Chicago, IL, p. 262-269.
(14) G. Marchionini and B. Shneiderman. Finding facts vs. browsing knowledge in hypertext systems. IEEE Computer, 21(1):70-79, January 1988.
(15) R. Orwig, H. Chen, and J. F. Nunamaker. A graphical, self-organizing approach to classifying electronic meeting output. Journal of the American Society for Information Science, forthcoming 1997.
(16) M. R. Quillian. Semantic Memory. Semantic Information Processing, M. Minsky, editor, The MIT Press, Cambridge, MA, 1968, p. 227-270.
(17) D. Riecken. Intelligent agents. Communication of the ACM, 37(7):18-21, July 1994.
Return to ITO Workshop Abstracts
Return to ITO Workshop Home Page