Home Page

This class will cover a range of topics that broadly come under the heading of text-based infomation retrieval (i.e., the technology underlying modern search engines like Google and Bing). We'll begin with the basic techniques used to index and query large collections of documents. From there we'll go on to study more advanced topics including web-crawling, graph-based retrieval algorithms such as PageRank, document categorization/clustering, text-based social network analysis as well as sentiment analysis.  A focus of this class will be on gaining hands-on experience with Lucene, a state-of-the-art open-source IR system.

Nothing beyond the typical undergraduate computer science background is required for this course; familiarity with natural language processing, machine learning, probability, and linear algebra will be helpful.

© James H. Martin, 2011