The next generation of supercomputers could be crippled by hard drive failures every few minutes, the U.S. Department of Energy has warned, and so it is funding a Petascale Data Storage Institute to solve the problem.
The Los Alamos Laboratory has commissioned RoadRunner, a 32,000 CPU supercomputer from IBM that will operate at petaflop levels -- that is a sustained speed of 1,000 trillion calculations per second. Put alternately, this is a quadrillion, a million billion, operations per second.
Thousands of hard disks will be needed to keep the thousands of CPUs supplied with data. And Garth Gibson, an associate professor of computer science at Carnegie Mellon university, who will lead the new Institute, has warned that this system "likely will require up to hundreds of thousands of magnetic hard disks to handle the data required to run simulations, provide checkpoint/restart fault tolerance and store the output of these modeling experiments. With such a large number of components, it is a given that some component will be failing at all times."
Current teraflop-level supercomputers, operating at trillions of operations per second, have disk failures once or twice a day, according to Gary Grider, a co-principal investigator at the Los Alamos National Laboratory. Once supercomputers are built out to the scale of multiple petaflops, he said, the failure rate could jump to once every few minutes.
Storage systems for them will need to tolerate many failures, mask the effects of them, and continue to operate reliably. "It's beyond daunting," Grider said of the challenge facing the new institute. "Imagine failures every minute or two in your PC and you'll have an idea of how a high performance computer might be crippled." He emphasized: "For simulations of phenomena such as global weather or nuclear stockpile safety, we're talking about running for months and months and months to get meaningful results."