IBM is to sponsor IT research at CERN, the European Organization for Nuclear Research, testing its future storage networking technology, Storage Tank, at the CERN openlab for DataGrid applications, the company said Wednesday.
CERN conducts experiments in nuclear physics. In 2007, it expects to begin operating the Large Hadron Collider (LHC), a particle accelerator which will bring protons and ions into head-on collisions at higher energies than ever achieved before. The experiment, which aims to recreate the conditions prevailing in the early universe, just after the "Big Bang", will generate around 10 petabytes (10 million gigabytes) of data each year, according to CERN.
More interesting for IT managers, though, is how CERN plans to make that volume of data available to the scientific community for analysis. It will do that by building a distributed data storage and grid computing network, accessible to researchers around the world.
"The drive is to get a working grid up that can deal with the petabytes of data coming out of the LHC by 2007," said François Grey, CERN openlab development officer.
"We are investigating techniques that are not yet commercial but will be by the time LHC is up and running," he said. It's also an opportunity for CERN's industrial partners to test their technology in real-world applications, he added.
The first two industrial sponsors, Hewlett-Packard Co. and Enterasys Networks Inc., joined the effort last September. HP contributed a 32-node cluster of computers built around Intel Corp.'s Itanium 2 processors. Enterasys donated a 10G bps Ethernet network to connect them, and agreed to provide engineering assistance and product and technology forums for a total investment it valued at US$1.5 million.
For its part, IBM will supply 20T bytes (20,000G bytes) of disk storage, a cluster of six eServer xSeries systems running Linux, and on-site engineering support to a total value of US$2.5 million, it announced Wednesday. The equipment will be delivered by the end of the year.
That 20T bytes of storage is a long way from the volume that CERN ultimately envisages, but the goal is to bring in more storage progressively, so as to conduct tests with around a petabyte of storage by 2005, Grey said. The data for those tests will come from simulations of hadron collisions based on current theories. Comparing these with the petabytes of data gathered from experimental observations will enable scientists to test their models.
With the collider generating 100M bytes of data per second in operation, the data management task is huge.
"It's really out of the scope of traditional network-attached storage. When you have these quantities of data, managing and organizing them is a problem," said Brian Carpenter, distinguished engineer at IBM Systems Group. "That's where Storage Tank comes in."
Storage Tank uses metadata servers to keep track of where data is located. Network clients ask the servers where to find the data they want, then download it straight from the network storage devices where it is located -- rather like the way the Internet's DNS (Domain Name System) points clients towards hosts, but doesn't intervene in the transfer of data from them, Carpenter said.
IBM plans to use the project as a testbed for this storage virtualization and file management technology, which it says will play a pivotal role in its work with CERN.
This implementation of Storage Tank will use the iSCSI SAN (storage area network) protocol, running over 10G-bps Ethernet, but, "The way Storage Tank is designed, it could be over any SAN in the back end," Carpenter said. The system runs principally on Linux, but the idea is to make the software more widely available than that, particularly the client software needed to integrate with the local filing system, he said.
The Storage Tank client software will work with the Windows, AIX, Solaris and Linux operating systems, according to the Web site for IBM's research center in Almaden, California, where Storage Tank is being developed.
While smashing subatomic particles together may not seem like a great business proposition, there are other applications of considerable economic importance that involve the scientific study of similarly large data sets, such as the analysis of seismological data for oil exploration, Carpenter said.
"If we can scale up to this 10P-byte level that they have as their goal, it will be a good test for Storage Tank," he said.
Grey sees benefits arising from the interoperability stress test of making the system work for many differently-minded users.
"The idea is 8,000 scientists around the world should be able to access the data from their own labs, using all kinds of computer and system technologies," he said. "It's the most anarchic test case you can imagine. If things work in this community, they will work elsewhere."