Like everyone else who's used an online search engine, you're well acquainted with the sorry state of content retrieval. It doesn't seem to matter how carefully you craft your query - some results will be wholly unrelated to your topic. The inaccuracy of content retrieval is merely frustrating to casual users, but it costs professionals real money.
When a computer delivers an erroneous response to an information request, the fault may lie in any number of places, such as short circuits and programmer error. But the focal point for most content retrieval failures is the data itself.
Currently, databases use keyword lists and indexes to speed content retrieval. Instead of searching for the data you want to find, you're supposed to build queries that consist primarily of keywords and indexed data.
To make content retrieval work as it should, systems must alter the way they represent and manage data. The current row/column/index scheme is necessary to overcome the limitations of today's computers. But future systems will hold all of your content in memory and scan it with a huge array of supercomputers, so no indexes or keywords will be necessary. Brute force searches will execute in real time, obviating the need for index generation. RAM buffers and solid state disks are already employed to speed content retrieval. Part of the long-term solution requires increasing the use of high-speed data storage.
With this in mind, today's solution architects should structure data in a way that works with today's retrieval methods, but that will also take advantage of ultra-efficient servers.
Future-ready content retrieval applications hang on to all original data, without attempting to quantify each data element's current relevance or searchability. Nor should an application discard data simply because it doesn't map well to the database's rigid structure. If you think some bit of data may be important, find a way to store it and worry about retrieving it later.
Today's solutions will give way to hierarchical databases with flexible schema. This transition is already under way with the rapid acceptance of XML. It has all the qualities needed to enable a powerful content retrieval system, save one: performance. XML data is hierarchical, a more realistic arrangement than the two-dimensional structure of relational databases. It creates order by enforcing structural rules, but XML uniquely permits changes to the structure. That builds in a degree of adaptability that's hard to manage with current database technology.
Because of its flexibility, XML captures application data for later retrieval by more advanced systems. Current applications can readily churn through XML data to populate relational databases, as IBM's Websphere does to create e-commerce shopper profiles. Some databases can even use relational facilities to store hierarchical data. Microsoft's Data Shaping technology is capable of converting relational data to hierarchical results sets on the fly.
Future Watch: Content retrieval
Erroneous responses to search engines are responsible for everything from lost business to lawsuits. Prepare your company for the bigger and faster content retrieval systems of the future that will eventually permit efficient full-content searches of data. Until then, use XML as a model for the structure of your business data and prepare for hierarchical databases, which are due to market in two to five years' time.