Automatic Blog Categorization Engine
Senior Project: 2008-2009
Eric Baer, Daniel Delany, Rafer Hazen, Daniel Kopelove and Andrew Noonan
Lijit Networks, Inc. is a fast-growing startup company located in downtown
Boulder. The company, funded by the leading venture capital firms in Colorado,
provides search applications for content publishers (e.g. blogs and blog
networks). For a variety of reasons it is desirable to be able to group similar
publishers into categories (aka taxonomies). Examples of such categories are
"Automotive", "Beauty and Style", "Business", "Sports", "Tech", etc.
Furthermore, these categories (or taxonomies) may be hierarchical in nature with
sub-categories specifying more specific topics within the parent category.
For example, "Sports=>Basketball" and "Sports=>Football" are two sub-categories
of the "Sports" category.
For a small number of publishers it is possible to manually select appropriate
categories/sub-categories for each publisher; Lijit has done this for its top
publishers. However, as the number of publishers grows, this becomes untenable
and the need for an automated categorization mechanism arises. The purpose of
this project was to create an auto-categorization "engine" for content
publishers.
AutoCat is a Java-based blog categorization engine. It is a low memory-footprint
tool that accepts blog data in XML format and returns category assignments, also
in XML as well. The engine runs on any platform that is Java-enabled and can
categorize up to 216GB of blog data per day on the target hardware. The design
of AutoCat allows it to scale to any practical data set size and the predictive
power of the engine has been measured at over 74%. AutoCat is a thread-safe
application that allows multiple concurrent users of the system.
The core classification algorithm of AutoCat uses a
naive Bayesian approach.
This type of algorithm requires a training set to classify blogs. This training
data can be saved and reloaded later to minimize the time spent training the
engine. AutoCat also provides a web-based management tool that allows users to
categorize blogs by RSS feed, to run batch jobs, to visually review the results
and then to save category assignments. AutoCat received a
"Best of Section Award"
at the Spring 2009 Engineering Design Expo.

Project Concept
Dashboard
|