PEERING INTO THE FUTURE: Text mining promises to cull answers from random text

If it is true that everything has already been written, then it's also true that every question has already been answered. The catch is creating a search engine that can find those answers regardless of what sort of document they were written in.

The use of data mining technologies lets a company easily extract knowledge about its data via well-formed schemes such as relational tables - and it is beginning to be a common business practice.

A few organisations are attempting to create a similar tool for mining from a much more adventurous source: unstructured text. This is great news for virtually all organisations that have large and ever-increasing numbers of online documents, e-mail messages, and customer requests, which often contain information of great value.

In short, the technology promises that by applying some of the same types of analysis used in data mining - such as knowledge discovery or trend analysis - to the uncategorised textual data, a user or application could simply examine the text in an attempt for structure to get to the information buried within.

Although the task seems simple enough, the technology is still evolving and far from perfect. One of the biggest hurdles to overcome to date has been the fact that words, as opposed to single letters, were never designed for use with computers. Computers only deal with them for human convenience.

The only thing a computer understands about text is its ASCII assignment. A word such as hi has the same ASCII representation regardless of language, even though it does not have the same meaning. In fact, to a computer it has no meaning at all; it is simply the letter H and the letter I with no space in between. Therefore, making it searchable, or at least meaningfully searchable, is problematic to say the least.

But what if you were to take that combination of letters and put it in a database that could be referenced with another combination of letters that spells hello or even bonjour? For example, you could now search all your e-mails to create a list of everyone you greeted, regardless of what language.

The potential for this technology is staggering. Unfortunately, so is the amount of effort and thought required to get it going.

Computer vendors ranging from IBM with its Intelligent Miner to SAS and its Enterprise Miner are beginning to offer a wide range of text-analysis tools, full-text retrieval components, and Web-access tools or extending knowledge management and business intelligence solutions.

Although currently we are impressed with what we see, you can count on the technology improving to a point at which organisations will quickly realise the business benefit of culling their otherwise spent information to create an "intelligent base" of information for any purpose.

If done right, text mining will become as important in the next five years as data mining was in the 1990s. As the world moves away from the traditional client/server technologies to more Web- and wireless-based approaches, the ability to search data regardless of its original purpose will be of massive benefit to virtually every organisation and company.

Future watch: Text mining

Organizations that choose to augment their current business intelligence capabilities with text mining may find considerable amounts of actionable information buried within the otherwise useless unstructured text such as notes and documents. Conceivably, that unstructured text could be categorized and placed in indexes that would make it infinitely useful and reusable in a myriad of business applications.

Join the newsletter!


Sign up to gain exclusive access to email subscriptions, event invitations, competitions, giveaways, and much more.

Membership is free, and your security and privacy remain protected. View our privacy policy before signing up.

Error: Please check your email address.

More about IBM AustraliaSAS

Show Comments