skip to main content
Department of Computer Science University of Colorado Boulder
cu: home | engineering | mycuinfo | about | cu a-z | search cu | contact cu cs: about | calendar | directory | catalog | schedules | mobile | contact cs
home · undergraduate program · senior project · projects · 
 

Senior Project - AutoCat

 

Automatic Blog Categorization Engine

Senior Project: 2008-2009
Eric Baer, Daniel Delany, Rafer Hazen, Daniel Kopelove and Andrew Noonan
Boulder, CO

Lijit Networks, Inc. is a fast-growing startup company located in downtown Boulder. The company, funded by the leading venture capital firms in Colorado, provides search applications for content publishers (e.g. blogs and blog networks). For a variety of reasons it is desirable to be able to group similar publishers into categories (aka taxonomies). Examples of such categories are "Automotive", "Beauty and Style", "Business", "Sports", "Tech", etc. Furthermore, these categories (or taxonomies) may be hierarchical in nature with sub-categories specifying more specific topics within the parent category. For example, "Sports=>Basketball" and "Sports=>Football" are two sub-categories of the "Sports" category.

For a small number of publishers it is possible to manually select appropriate categories/sub-categories for each publisher; Lijit has done this for its top publishers. However, as the number of publishers grows, this becomes untenable and the need for an automated categorization mechanism arises. The purpose of this project was to create an auto-categorization "engine" for content publishers.

AutoCat is a Java-based blog categorization engine. It is a low memory-footprint tool that accepts blog data in XML format and returns category assignments, also in XML as well. The engine runs on any platform that is Java-enabled and can categorize up to 216GB of blog data per day on the target hardware. The design of AutoCat allows it to scale to any practical data set size and the predictive power of the engine has been measured at over 74%. AutoCat is a thread-safe application that allows multiple concurrent users of the system.

The core classification algorithm of AutoCat uses a naive Bayesian approach. This type of algorithm requires a training set to classify blogs. This training data can be saved and reloaded later to minimize the time spent training the engine. AutoCat also provides a web-based management tool that allows users to categorize blogs by RSS feed, to run batch jobs, to visually review the results and then to save category assignments. AutoCat received a "Best of Section Award" at the Spring 2009 Engineering Design Expo.

Project Concept
Project Concept
Dashboard
Dashboard
 
See also:
Department of Computer Science
College of Engineering and Applied Science
University of Colorado Boulder
Boulder, CO 80309-0430 USA
Questions/Comments?
Send email to

Engineering Center Office Tower
ECOT 717
+1-303-492-7514
FAX +1-303-492-2844
XHTML 1.0/CSS2 ©2012 Regents of the University of Colorado
Privacy · Legal · Trademarks
May 5, 2012 (14:07)
 
.