home · mobile · calendar · colloquia · 2004-2005 · 

Colloquium - Dongarra

Self Adapting Numerical Software (SANS) Effort and Fault Tolerance in Linear Algebra Algorithms
Innovative Computing Laboratory, University of Tennessee and Oak Ridge National Laboratory

A new generation of software libraries and algorithms are needed for the effective and reliable use of (wide area) dynamic, distributed and parallel environments. Some of the software and algorithm challenges have already been encountered, such as management of communication and memory hierarchies through a combination of compile-time and run-time techniques, but the increased scale of computation, depth of memory hierarchies, range of latencies, and increased run-time environment variability will make these problems much harder.

Jack Dongarra photo

Along these lines, we will discuss work on the development of parameterizable and annotatable software libraries in the linear algebra area that will permit performance tuning for a broad range of architectures including grid computing. Self Adapting Numerical Software (SANS) is a software effort that will automatically generate highly optimized numerical kernels for our high performance computers.

In addition, we will describe an implementation of MPI which extends the message passing model to allow for recovery in the presence of a faulty process. Our implementation allows a user to catch the fault and then provide for a recovery. We will also touch on the issues related to using diskless checkpointing to allow for effective recovery of an application in the presence of a process fault.

This talk is sponsored by the National Center for Atmospheric Research Scientific Computing Division and will be held in the Main Seminar Room at the Mesa Lab.

Department of Computer Science
University of Colorado Boulder
Boulder, CO 80309-0430 USA
May 5, 2012 (14:13)