CSIRO brings open source data mining to business
- 15 January, 2004 07:19
With Linux and open source software, CSIRO’s mathematical and information sciences division is now able to model information for business benefit without relying on proprietary software, according to principal computer scientist for enterprise data mining Dr Graham Willliams.
“Data mining is all about building models of the world which can be used to reason the world and identify things that can be used for business benefit,” Williams said. “And anytime data mining is about getting an answer now.”
After experiencing commercial data mining software, Dr Williams’ team now uses a variety of open source software running on the Debian GNU/Linux operating system.
“A lot of government departments use SAS for data mining,” he said. “At around $100,000 per seat per year it is a good product but once you get over the ‘woo’ features you hit a brick wall because you can’t customise it.”
For the data mining CSIRO uses a number of “toolkits” including R, GNOME, and Python scripting.
“By taking the open source option we have data mining software that is free and can be modified,” Williams said. “Commercial software is available but there are quality assurance concerns about correct implementations, additional functionality is required for individual requirements, and who knows if they are going to be around in fives years time.”
Williams cited the Health Insurance Commission and the NRMA as organisations using CSIRO’s open source and custom developed data mining applications to “identify groups of data according to certain characteristics”.
“Data mining is used at the NRMA for vehicle insurance premium setting which involves analysis of several million transactions annually,” he said. “At the HIC, some patients lodged all their Medicare claims at once creating a regular pattern of fraud. Hot spots are identified which are classified by clusters, rule induction, and then interestingness.”
CSIRO is working with the Department of Health and Ageing’s research group for the data mining activities which has a “secure data mining facility”.
“The Department of Health and Ageing has a 200 CPU cluster running Debian Linux,” Williams said. “Debian is a stable server operating system that is easy to maintain and we also use it on desktops.”