Concept Based Categorization and Search on Internet:
A Machine Learning Approach

NSF Principal Investigator:
Professor Hsinchun Chen
Management Information Systems
The University of Arizona
Tucson, AZ, USA 85721
(Head, UA/MIS AI Group; Senior Research Scientist, NCSA)

Email: hchen@bpa.arizona.edu
Phone: (520) 621-4153
Fax: (520) 621-2433
URL: http//ai.bpa.arizona.edu


Introduction

Despite the usefulness of database technologies, users of online information systems are often overwhelmed by the amount of current information, the subject and system knowledge required to access this information, and the constant influx of new informati on. (4) This problem is termed information overload (2). A second difficulty associated with information retrieval and information sharing is the classic vocabulary problem, which is a consequence of system users' diverse backgrounds and ar eas of expertise (11, 12, 3). The fluidity of concepts and vocabularies in various domains further complicates the retrieval issue (3, 9, 10). A concept may be perceived differently by different searchers and research has shown that different ind exers, well trained in an indexing scheme, often assign different index terms for the same document (1). A more intelligent and proactive search aid is needed to address information overload and the vocabulary problem in a large info rmation space, such as the Internet, that is used by searchers of varying backgrounds.

Dr. Hsinchun Chen's Artificial Intelligence Lab at the University of Arizona has been working in conjunction with the University of Illinois Digital Library Initiative to develop techniques to help users' cope with information overload and the < I>vocabulary problem when searching online databases. These techniques include concept spaces, self-organizing maps, and intelligent agents (spiders).

Research Questions

The main forms of retrieval used on the World Wide Web are keyword searches or hypertext browsing. Keyword searches often result in low precision, poor recall, and slow response time due to the limitation of the indexing and the communicati on methods (bandwidth), controlled language based interface (the vocabulary problem), and the inability of searchers themselves to fully articulate their needs. Furthermore, browsing allows users to explore only a very small portion of the large Internet information space. An extensive information space that is accessed through hypertext-like browsing can also potentially confuse and disorient its user, the embedded digression problem, and it can cause the user to spend a great deal of time witho ut learning anything specific, a difficulty know as the art museum phenomenon.

To address these problems, we have developed several inductive machine learning techniques to:

Kohnen Self-organizing Map

Demo is available at:
http://ai2.BPA.Arizona.EDU/ent/entertain1/

Quillian originally suggested that semantic networks could be used to encode and associate word meanings. Such networks could then be used to visualize or construct mental models of an information space (16). Neural network algorithms, in particular, a ppear to be a natural starting point for organizing large amounts of information in a manner consistent with human mental models. After examining several neural network algorithms in previous work, our research group concluded that a variant of the Kohon en self-organizing feature maps (SOM) appears to be a most promising algorithm for organizing large volumes of information. The algorithm can be used to create an intuitive, graphical display of the important concepts contained in textual information (13, 15). The ET-MAP was developed using 110,000 homepages from the Internet Yahoo! Entertainment sub-category. The following is an outline of the algorithm used:

  1. Initialise input nodes, output nodes, and connection weights.
  2. Present each homepage in order.
  3. Compute distances to all node.
  4. Select winning node j* and update weights to node j* and neighbors.
  5. Label regions in the map.
  6. Apply the above steps recursively for larger regions.

Concept Space Approach

Demo is available at:
http://ai.bpa.arizona.edu/cgi-bin/tng/ETSpace

In previous NSF-funded research, we developed a concept space approach to automatic thesaurus generation. Since then, this approach has been used successfully in generating domain-specific thesauri for various applications, including Russian computing (6, 7), business meetings (5), molecular biology, and engineering (8). The following is a brief overview of our algorithms in the context of the Internet categorization and search project:

  1. Identify Internet homepage collection.
  2. Perform automatic indexing and term weighting.
  3. Perform cluster analysis.
  4. Use the concept space for associative retrieval.

The Itsy Bitsy Spider

Demo is available at:
http://ai.bpa.arizona.edu/~mramsey/SPIDER/itsy.html

The basic idea behind agent research is the need to develop software systems which engage and help all types of end users (17). Our goal is to create intelligent agents that performs customized, dynamic search (information retrieval) and textual analys is in the World-Wide Web environment for any user. In order to allow our agents to be goal-oriented, cooperative, and customized, we have adopted a machine learning approach based on our previous work in textual analysis. Although des igning the system architecture for such an agent is a non-trivial undertaking, we intentionally designed the spider to be compact, dynamic, and friendly -- a ``smart itsy bitsy spider.''

A Web agent system architecture was designed first, before we proceeded to the actual implementation. The system architecture consists of five modular components, each accomplishing a unique task:

  1. Task requests and search control parameters are solicited from the user.
  2. A graphical user interface implemented in Java takes user input and generates intermediate results and a final search summary.
  3. The Jaccard's similarity function computes a link score and a keyword score for each newly explored and indexed homepage.
  4. A search engine supports the best first search and genetic algorithm, and
  5. An HTTP protocol based program fetches a remote homepage.

This modular design allows us to experiment with different subcomponents over time and eventually converge on a more compact and efficient system implementation.

Findings

Future Directions

In our ongoing effort in the Illinois Digital Library Initiative project, we are in the process of exploring other general-purpose search and classification algorithms for Web resource categorization and searching.

Publications from the Funded Research

  1. H. Chen, C. Schuffels, and R. Orwig, Internet Categorization and Search: A Machine Learning Approach. Journal of Visual Communication and Image Representation, Special Issue on Digital Libraries, 7(1):88-102, 1996.

  2. B. Schatz and H. Chen. Building Large-scale Digital Libraries. IEEE Computer, Special Issue on "Building Large-scale Digital Libraries," 29(5):22-27, May 1996.

  3. H. Chen, Y. Chung, M. Ransey, and C. Yang. An Intelligent Personal Spider (Agent) for Dynamic Internet/Intranet Searching. Decision Support Systems, forthcoming 1997.

  4. H. Chen, A. L. Houston, R. R. Sewell, and B. R. Schatz. Internet Browsing and Searching: User Evaluation of category Map and Concept Space Techniques. Journal of the American Society for Information Science, Special Issue on AI Techniques for th e Emerging Information Systems Application, forthcoming 1997.

  5. H. Chen, Y. Chung, M. Ramsey, and C. Yang. A Smart Itsy Bitsy Spider. Journal of the American Society for Information Science, Special Issue on AI Techniques for the Emerging Information Systems Application, forthcoming 1997.

  6. H. Chen and B. R. Schatz. Three Approaches to Controlling the Information Explosion on the Internet: Category Map, Concept Space, and the Internet Agent, submitted to ACM Digital Library Conference, 1997.

  7. H. Chen, Y. Chung, M. Ramsey, C. Yang, P. Ma, and J. Yen. Intelligent Agents on the Internet. Proceedings of the 30th Annual Hawaii International Conference on System Sciences (HICSS-30), Maui, Hawaii, January 7-10, 1997.

  8. H. Chen, B. R. Schatz, and C. Lin. Concept Classification and Search on Internet Using Machine Learning and Parallel Computing Techniques. World Wide Web 4 Conference, Boston, MA, December 11-14, 1995, p. 58-59, Poster Proceedings.

  9. H. Chen and B. Schatz. Semantic Retrieval for the NCSA Mosaic. Proceedings of the Second International World Wide Web Conference '94, Chicago, IL, October 17-20, 1994.

References

(1) M. J. Bates. Subject access in online catalogs: A design model. Journal of the American Society for Information Science, 37(6):357-376, November 1986.

(2) D. C. Blair and M. E. Maron. An evaluation of retrieval effectiveness for a full-text document retrieval system. Communications of the ACM, 28(3):289-299, 1985.

(3) H. Chen. Collaborative systems: solving the vocabulary problem. IEEE Computer, Special Issue on Computer-Supported Cooperative Work (CSCW), 27 (5):58-66, May 1994.

(4) H. Chen and V. Dhar. User misconceptions of online information retrieval system. International Journal of Man-machine Studies, 32(6):673-692, June 1990.

(5)H. Chen, P. Hsu, R. Orwig, L. Hoopes, and J. F. Nunamaker. Automatic concept classification of text from electronic meeting. Communication of the ACM, 37(10):56-73, October 1994.

(6) H. Chen and K. J. Lynch. Automatic construction of networks of concepts characterizing document databases. IEEE Transactions on Systems, Man and Cybernetics, 22(5):885j-902, September/October 1992.

(7) H. Chen, K. J. Lynch, K. Basu, and T. Ng. Generating, integrating, and activating thesauri for concept-based document retrieval. IEEE Expert, Special Series on Artificial Intelligence in Text-based Infomation Systems, 8(2):25-34, April 1993.

(8) H. Chen, B. R. Schatz, T. Yim, and D. Fye. Automatic thesuarus generation for an electronic community system. Journal of the American Society for Information Science, 46(3):175-193, April 1995.

(9) J. Courteau. Genome databases. Science, 24:201-207, October 11, 1991.

(10) K. A. Frenkel. The human genome project and informatics. Communications of the ACM, 34(11):41-51, November 1991.

(11) G. W. Furnas. Statistical semantics: How can a computer use what people name things to guess what things people mean when they name things. Proceedings of the Human Factors in Computer Systems Conference, March 1982, Gaithersburg, MD, 251-25 3.

(12) G. W.Furnas, T. K. Landauer, L. M. Gomez,and S. T. Dumais. The vocabulary problem in human-system communication. Communications of the ACM, 30(11):964-971, November 1987.

(13) X. Lin, D. Soergel, & G. Marchionini. A self-organizing semantic map for information retrieval. Proceedings of the Fourteenth Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, October 13-16, 199 1, Chicago, IL, p. 262-269.

(14) G. Marchionini and B. Shneiderman. Finding facts vs. browsing knowledge in hypertext systems. IEEE Computer, 21(1):70-79, January 1988.

(15) R. Orwig, H. Chen, and J. F. Nunamaker. A graphical, self-organizing approach to classifying electronic meeting output. Journal of the American Society for Information Science, forthcoming 1997.

(16) M. R. Quillian. Semantic Memory. Semantic Information Processing, M. Minsky, editor, The MIT Press, Cambridge, MA, 1968, p. 227-270.

(17) D. Riecken. Intelligent agents. Communication of the ACM, 37(7):18-21, July 1994.



Return to ITO Workshop Abstracts

Return to ITO Workshop Home Page