SAN MATEO (06/12/2000) - When Human Genome Project (HGP) statisticians were looking for a way to mine billions of bases on the human genetic code and find out which ones cause certain diseases, they turned to SAS Institute Inc. and its Enterprise Miner for help.
"We've had a long relationship with SAS and pointed out to them that this was an area of application that probably needed their expertise," says Dr. Bruce Weir, a statistician at North Carolina State University, in Raleigh, who works on the HGP.
The HGP has been trying for more than a decade to decipher the complete sequence of the human genetic code. Three billion bases pair up to form a single piece of the genetic message, so the process is time-consuming. But the hope is that the research eventually will reveal what genetic patterns cause diseases, with the possibility of cures to follow.
"This is where data mining comes in," Weir says. "We're going to look at data, like a million bits of information per person, maybe a thousand people -- some of which have a disease and some of whom don't -- and then we'll compare those patterns," Weir says.
With the race to decipher the human genome heating up, and in order to maintain a high level of productivity, now more than ever the HGP is resting its hopes on SAS and its technology to help mine and warehouse data.
"Now we're getting into really quite a mess of data mining issues," Weir says.
"We're doing some things, but [we're] certainly having to handle a lot of data, and products like Enterprise Miner, with their warehousing capabilities, are well set up [to handle the task]."
In particular, scientists are using features such as SEMMA (Sample, Explore, Modify, Model, Assess), a built-in guide to help extract data, and decision-tree analysis of SNPs (single nucleotide polymorphisms), genetic markers on the human genome. The data must be managed in a way to keep it logically structured and capable of being easily matched and organized.
Enterprise Miner's decision-tree analysis creates a graphical representation of the data, making it easier to see where all the data fits phenotypically.
To further aid HGP scientists, SAS Institute has developed several features specifically for mining genomic data, providing a filter for false alarms, or "noise" of SNPs that may look like they are markers signaling an abnormality but really aren't. The company also has provided a tool that looks at how different markers relate to each other, and a tool that helps factor in family history patterns.
"Those people [at SAS] are used to seeing lots of data, and in the medical community we're not, so we're feeling a bit overwhelmed," Weir concedes.
The hope is that once the genome project is complete, doctors will be able to match drugs specifically for gene types and, taking gene therapy to new levels, cure or predict genetic malformations.
Make no mistake about it: The race to find cures to diseases such as Alzheimer's may rest in the hands of these scientists and the data mining solutions that SAS provides.