As part of the National Geographic Society's global Genographic Project, IBM distinguished engineer Peter Rodriguez submitted his own cheek swab for genetic analysis to trace his family's roots. The results surprised him.
"I'm Hispanic, but the reality is that when I got back my DNA results, it showed I am associated with people basically from the Balkans, the Vikings and others from Norway, and maybe the Celts," he said yesterday. "It turns out that some of them migrated 20,000 years ago to Spain."
IBM is announcing Wednesday that it has begun to deploy custom data-gathering software developed with the National Geographic Society as part of their partnership on the massive, five-year Genographic Project. Under the initiative, hundreds of thousands of human DNA samples will be gathered to map how the Earth was populated and how tribes and groups may have migrated through the ages.
The project has just begun and will be daunting in its complexity, said Ajay Royyuru, senior manager of IBM's Computational Biology Center.
For example, blood samples and personal surveys from more than 100,000 indigenous people will be gathered and stored by thousands of researchers, many using ruggedized laptops equipped with fingerprint readers for security to record information gathered in jungles and deserts, Royyuru said.
When the project was first revealed in April, project director Spencer Wells, an explorer in residence for the National Geographic Society, dubbed it the "moonshot of anthropology," designed to fill in gaps in our understanding of human history.
Royyuru said the project will also aid in understanding the origins of languages, which are taught independent of genetic makeup.
The data gathering is so massive that it poses an interesting case study for IT managers, Rodriguez said. Ten universities around the world will work together to gather and analyze the data, but all have been using their own custom spreadsheets, which had to be unified, he explained.
"We tend to think scientists are very advanced, but they are not necessarily advanced in the different ways they collect data," he said. "We see ourselves beating them into submission to play with one another."
The custom software unifying all the systems relies on standards such as BioSystem Markup Language, a subset of XML. Linux will run the laptops that store and forward the data to laboratories throughout the world, which will then transport it to a 2TB data center at the National Geographic Society's headquarters in Washington.
At the field labs, phenotypes from subjects -- such as hair color and skin color -- will be married to genetic sequencing from blood samples, data that will be converged into an XML object and packaged for transmission to Washington, Rodriguez said. Accompanying the data will be the geographic coordinates where the subject was interviewed.
Separate from the field research, the project also involves online participation, whereby members of the public can order a Participation Kit for US$100 and submit a cheek swab sample to learn their own migratory history. The results are stored securely and anonymously. Already, the National Geographic Society has received 60,000 kits.
Royyuru estimated the total cost of the project at US$40 million, primarily to cover years of salaries for thousands of researchers. But the software development has also been a data mining challenge, he said. "The lessons we have learned are clearly something we will replicate in other projects."