home · mobile · calendar · colloquia · 2007-2008 · 

Colloquium - Miller

Scalable Middleware for Large Scale Systems
University of Wisconsin

I will discuss the problem of developing tools for large scale parallel environments. We are especially interested in systems, both leadership class parallel computers and clusters that have 10,000's or even millions of processors. The infrastructure that we have developed to address this problem is called MRNet, the Multicast/Reduction Network. MRNet's approach to scale is to structure control and data flow in a tree-based overlay network (TBON) that allows for efficient request distribution and flexible data reductions.

The second part of this talk will present an overview of the MRNet design, architecture, and computational model and then discuss several of the applications of MRNet. The applications include scalable automated performance analysis in Paradyn, a vision clustering application and, most recently, an effort to develop our first petascale tool, STAT, a scalable stack trace analyzer running currently on 1,000,000's of processors.

I will conclude with a brief description of a new fault tolerance design that leverages natural redundancies in the tree structure to provide recovery without checkpoints or message logging.

Department of Computer Science
University of Colorado Boulder
Boulder, CO 80309-0430 USA
May 5, 2012 (14:13)