Skip to main content
Indiana University

Grids & Cyberinfrastructure

Column Seperator (Left)

Fault Tolerance in High Performance Computing: MPI and Checkpoint/Restart

Thursday, November 20 - 10:00 AM (US/Central)
Josh Hursey, Open MPI

Modern HPC applications must be able to tolerate inevitable faults if they are to harness current and future HPC systems. Detecting and responding to such failures in distributed systems poses complex and intriguing research questions. Researchers at Indiana University are leading the Open MPI transparent checkpoint/restart fault tolerance development effort, and with a novel architecture are enabling applications to transparently take advantage of fault tolerance services provided by Open MPI, particularly by its support for a variety of interconnects including Infiniband, Myrinet, shared memory, and Ethernet.

Column Seperator (Right)

More at SC08

Multimedia

Stay tuned for slides, video, and more from supercomputing 2008.