The computer, once a tool for scientists, is becoming a collaborator
- 28 October, 2008 08:09
Computer science -- it's not just about hardware and software anymore.
It's about oceans, stars, cancer cells, proteins and networks of friends. Ken Birman, a computer science professor at Cornell University, says his discipline is on the way to becoming "the universal science," a framework underpinning all others, including the social sciences.
An extravagant claim from someone with a vested interest? The essence of Birman's assertion is that computers have gone from being a tool serving science -- basically an improvement on the slide rule and abacus -- to being part of the science. Consider these recent developments:
"Systems biologists" at Harvard Medical School have developed a "computational language" called "Little b" for modeling biological processes. Going beyond the familiar logical, arithmetic and control constructs of most languages, it reasons about biological data, learns from it, and incorporates past learning into new models and predictors of cells' behaviors. Its creators call it a "scientific collaborator."
Microsoft Research (MSR) is supporting a US-Canadian consortium building an enormous underwater observatory on the Juan de Fuca Plate off the coast of Washington state. Project Neptune will connect thousands of chemical, geological and biological sensors on more than 1,000 miles of fiber-optic cables and will stream data continuously to scientists for as long as a decade. Researchers will be able to test their theories by looking at the data, but software tools that MSR is developing will search for patterns and events not anticipated by scientists and present their findings to the scientists.
Last year, researchers from Harvard Medical School and the University of California, used statistical analysis to mine heart-disease data from 12,000 people in the Framingham Heart Study and learned that obesity appears to spread via social ties . They were able to construct social networks by employing previously unused information about acquaintances that had been gathered solely for the purpose of locating subjects during the 32-year study.
Computer scientists and plant biologists at Cornell developed algorithms to build and analyze 3-D maps of tomato proteins . They discovered the "plumping" factor that is responsible for the evolution of the tomato from a small berry to the big fruit we eat today. Researchers then devised an algorithm for matching 3-D shapes and used it to determine that the tomato-plumping gene fragment closely resembles an oncogene associated with human cancers. That work would have taken decades without computer science, researchers say.
Page BreakWhile these applications might seem to have little in common, they represent a class of scientific problems involving experimental data that is voluminous and complex. In fact, the raw information is so overwhelming that scientists are often at a loss to know where to begin to make sense of it. Computer science is pointing the way.
"A trend that is becoming increasingly clear is that computer science is not just a discipline that provides computational tools to scientists," says Jon Kleinberg, a Cornell professor who won a MacArthur "genius" grant in 2005 for his work on social networks. "It actually becomes part of the way in which scientists build theories and think about their own problems."
Kleinberg, who discovered the underlying rules that govern the widely publicized "six degrees of separation" phenomenon, says that computer algorithms will be to science in the 21st century what mathematics was in the 20th century. He says tackling a problem algorithmically allows scientists to change the question from "what is" to "how to," he says.
For example, the "small world" principle -- in which any two people are connected by short chains of acquaintances -- was demonstrably true, but no one understood just how these chains worked or why they were so short. "Looking at it as a computer scientist, I saw there was really an algorithm going on, a subtle algorithm based on distributed routing," Kleinberg says. His predictions about how friendships are formed at different distances, based on those algorithms, have been borne out by experiments.
In another example, biologists struggled with something called the Leventhal Paradox: From an astronomical number of possibilities, proteins fold in the optimum way far faster than can be explained by trial and error. Biologists and computer scientists working together developed algorithms that in essence showed how the proteins find shortcuts to optimum folds without trying every possibility. "That turned out to be a very nice 'how to' problem," Kleinberg says.
Tony Hey, Microsoft's vice president for external research, speaks of "e-science," a set of technologies for supporting scientific projects where there is a huge amount of data (often distributed), the data and multiple collaborators are networked, and multiple disciplines, including computer science, converge. These projects tend to be enormously complex, and sorting them out is what the tools, algorithms and theories of computer science can help do, he says.
Page BreakHey says a "fourth paradigm" in science is emerging. For thousands of years, we have had experimental science, he says. Since Newton, we have had theoretical science, by which experimental results can be predicted by equations. Then, in the second half of the 20th century, we added simulation science, enabled by fancier equations and supercomputers. Now, Hey says, we are entering the era of "data-centric science."
The essence of data-centric science is to aggregate data, often in large quantities and from multiple sources, and then mine it for insights that would never emerge from manual inspection or from analysis of any one data source. He cites as an example a project called Galaxy Zoo, in which the public was invited to help classify millions of galaxies as either spiral or elliptical based on a million detailed images posted online by the Sloan Digital Sky Survey.
The work behind Galaxy Zoo is simple, boring even, and the goal was just to establish a large-scale inventory that would help scientists derive theories about how galaxies evolve. But a year ago, a strange and wondrous thing happened. A high school teacher and Galaxy Zoo volunteer in the Netherlands discovered what would become known as Hanny's Voorwerp, an enigmatic object of a type never seen before. No one is sure just what the distant green cloud is -- perhaps an extremely rare type of quasar -- and it is now getting intense scrutiny from astronomers.
Mountains of Data
Roger Barga is a researcher at MSR who is developing tools for e-science, which he calls "in silico science -- science done inside the computer." He says two technological developments are driving e-science. "The first is that our ability to capture data -- through bigger machines, bigger colliders, more sensors and so on -- is outpacing our ability to analyze it by conventional means."
The second is the emergence of new tools for pattern recognition and machine learning -- algorithms that improve over time as they deal with more and more data, without human programming -- and other new ways to organize, access and mine vast amounts of data. For the Neptune ocean observatory, MSR is building a "scientific workflow workbench" on top of Microsoft Windows Workflow, to save, systematize and catalog all the data. It will help scientists visualize oceanographic data in real time and compose and conduct experiments.
Page BreakThe workbench work recognizes that it isn't enough to just analyze data. When data is distributed, complex and voluminous, just getting organized and keeping track of progress is a daunting job for the scientist. The days of microscope, pencil and notebook research are long gone.
Barga says e-science will profoundly affect the practice of science. "Scientists will have to ask themselves if they are theoreticians or bench scientists or one of these new computational scientists in their area. You'll see the branding of a new kind of scientist."
The availability of petabytes of data from the Internet will transform the practice of sciences involving human behavior, Kleinberg says. "For millennia, social interaction has been transient, ephemeral and essentially invisible to the standard techniques of scientific measurement," he says. "It's hard to go around measuring people's friendships and conversations, or why they make decisions. But now we have these digital trails that were never available before. Google is not just looking for simple correlations; all that data is being passed through very sophisticated probability models."
He says the vast data stores and analytical techniques now available mean that scientists no longer have to formulate detailed theories and models and then test them on experimental data. Sketchy ideas can be tried against the data, with the data and tools fleshing out the model, in effect collaborating with the researcher to develop a theory. "The mass of data lets you fill in the details whose broad outline you have created," he says. "Then you run massive amounts of data through it and discover that in the specifics, certain things matter much more than we thought and certain things much less."
Jeremy Gunawardena, director of the Virtual Cell Program at Harvard Medical School, says an emerging model of the cell likens it to a computer -- with inputs and outputs and logical decision-making processes.
"A number of biologists with significant stature in the field really feel this is the new way forward for biology," he says. "But we are still in the very early days."