Closing the User-Model Loop for Understanding Topics in Large Document Collections

Project funded by the National Science Foundation (IIS-1409287, UMD; IIS-1409739, BYU)
PI: Jordan Boyd-Graber (Maryland), co-PI: Leah Findlater (UW), PI: Kevin Seppi (BYU)

Overview

Individuals and organizations must cope with massive amounts of unstructured text information: individuals sifting through a lifetime of e-mail and documents, journalists understanding the activities of government organizations, companies reacting to what people say about them online, or scholars making sense of digitized documents from the ancient world. This project’s research goal is to bring together two previously disconnected components of how users understand this deluge of data: algorithms to sift through the data and interfaces to communicate the results of the algorithms. This project allows users to provide feedback to algorithms that were typically employed on a “take it or leave it” basis: if the algorithm makes a mistake or misunderstands the data, users can correct the problem using an intuitive user interface and improve the underlying analysis. This project jointly improves both the algorithms and the interfaces, leading to deeper understanding of, faster introduction to, and greater trust in the algorithms we rely on to understand massive textual datasets. Furthermore, source code and functional demos will be shared publicly, and tutorials will be shared online and in person in to aid the adoption of the methodologies.

This project enables computer algorithms and humans each to apply their strengths and collaborate in managing and making sense of large volumes of textual data. It “closes the loop” in novel ways to connect users with a class of big data analysis algorithms called topic models. This connection is made through interfaces that empower the user to change the underlying models by refining the number and granularity of topics, adding or removing words considered by the model, and adding constraints on what words appear together in topics. The underlying model also enables new visualizations in the form of a Metadata Map that uses active learning to focus users’ limited attention on the most important documents in a collection. Users annotate documents with useful meta-data and thereby further improve the quality of the discovered topics. The project includes evaluations of these methods through careful user studies and in-depth case studies to demonstrate that topics are more coherent, users can more quickly provide annotations, users trust the underlying algorithms more, and users can more effectively build an understanding of their textual data.

<< back to top

Current Project Team

	Jordan Boyd-Graber Associate Professor, Computer Science/Language Science/iSchool/UMIACS (Maryland)
	Leah Findlater Assistant Professor, Human Centered Design and Engineering (UMD)
	Kevin Seppi Professor, Computer Science (BYU)
	Piper Armstrong MS Student, Computer Science (BYU)
	Stephen Cowley MS Student, Computer Science (BYU)
	Wilson Fearn MS Student, Computer Science (BYU)
	Courtni Byun MS Student, Computer Science (BYU)
	Fenfei Guo PhD Student, Computer Science (UMD)
	Varun Kumar Applied Scientist at Amazon
	Pedro Rodriguez Ph.D. student, Computer Science (Colorado)
	Alison Smith-Renner PhD Student, Computer Science (UMD)
	Thang Nguyen PhD Student, Computer Science (UMD)

<< back to top

Past Members

	Eric Ringger Associate Professor, Computer Science (BYU) Now Senior Director of Maching Learning for Personalization at Zillow
	Paul Felt PhD Student, Computer Science (BYU) Now Software Engineer at IBM
	Ethan Garofolo MS Student, Computer Science (BYU)
	Jeff Lund PhD Student, Computer Science (BYU) Now at Google
	Connor Cook Undergrad, Computer Science (BYU) Now an MS Student at the University of Colorado Boulder
	Tak Yeon Lee PhD Student, Computer Science (UMD) Now Research Scientist at Adobe
	You Lu MS Student, Computer Science (Colorado) Now PhD Student at Virginia Tech
	Viet-An Nguyen PhD Student, Computer Science (UMD) Now Research Scientist at Facebook
	Nozomu Okuda MS Student, Computer Science (BYU)
	Forough Poursabzi PhD Student, Computer Science (Colorado) Now Postdoc at MSR NYC

<< back to top

Publications (Selected)

Jeffrey Lund, Piper Armstrong, Wilson Fearn, Stephen Cowley, Emily Hales and Kevin Seppi. Cross-referencing Using Fine-grained Topic Modeling. North American Association for Computational Linguistics. Minneapolis, MN. 2019.
Jeffrey Lund, Stephen Cowley, Wilson Fearn, Emily Hales, Kevin Seppi. Labeled Anchors and a Scalable, Transparent, and Interactive Classifier. Empirical Methods in Natural Language Processing. Brussels, Belgium. 2018.
Paul Felt, Eric Ringger, and Kevin Seppi. Semantic Annotation Aggregation with Conditional Crowdsourcing Models and Word Embeddings. International Conference on Computational Linguistics. Osaka, Japan. 2016.
Jeff Lund and Chace Ashcraft, Andrew McNabb, and Kevin Seppi. Mrs: High Performance MapReduce for Iterative and Asynchronous Algorithms in Python. Workshop on Python for High-Performance and Scientific Computing. Salt Lake City, Utah. 2016.
Paul Felt, Kevin Black, Eric Ringger, Kevin Seppi, Robbie Haertel. Early Gains Matter: A Case for Preferring Generative over Discriminative Crowdsourcing Models. North American Association for Computational Linguistics. Denver, Colorado. 2015.
Paul Felt, Kevin Black, Eric Ringger, Kevin Seppi, Robbie Haertel. On Multinomial vs. Log-linear Crowdsourcing Models with Mean-field Variational Inference. NIPS 2014 Workshop on Crowdsourcing and Machine Learning. Montreal, Canada. 2014.

Software

Datasets

Topic Labels and Quality Ratings

Media

~~Media:Closing the Loop~~

Acknowledgments

This work is supported by the National Science Foundation. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the researchers and do not necessarily reflect the views of the National Science Foundation.