With the proliferation of electronic documents and the archival pressures that various industry regulations have been exerting on companies, enterprise search has become an important IT requirement during the past few years. Many search solutions -- including search appliances and more-robust, federated search engines, such as those from IBM and Verity -- have come to market recently to meet the demand. Specialty products such as Vivisimo Velcocity fill additional niches.
I spent a day at IBM's San Jose, Calif., office putting through its paces an enterprise implementation of IBM's federated, enterprise search engine: the recently renamed WebSphere Information Integrator, OmniFind Edition, v. 8.2. The product first rolled out late in 2004 under the DB2 database brand name.
OmniFind is a true enterprise-scale search engine that IBM itself uses to find items in its databases, e-mail archives, and its 10,000 Web sites. The product comes in one of two configurations: on a single server as well as on a four-way system comprising a crawler, a parser-indexer, and two redundant run times that provide client interface services.
Clients most commonly interact with OmniFind through a browser, but they can also do so through a Java API. The latter enables a department to query search results for specific items, with the results returned to the application as Java objects. The Java API is useful for handling custom software, such as a knowledge-base search facility embedded in a help-desk application.
OmniFind uses a crawler to spider through a company's online assets. Results are parsed into individual words and links, which are then reassembled into an index. This index fields the search queries.
The crawler is a highly configurable piece of software. Its underlying technology uses two mechanisms. The first combs through databases and extracts searchable data; the second searches through unstructured data, including e-mail archives, content management systems, and a variety of document files.
There is also the pure intranet crawler, configured to adjust its spidering dynamically. The crawler tracks how often documents or data changes and computes how frequently certain venues need to be reindexed. Exclusion lists and tools such as robots.txt files, which specify what can and cannot be accessed on given sites, can keep the crawler out of specific files and Web sites. You also have the option of identifying sites or resources that are difficult or slow to access, which can keep the crawl from overwhelming low-speed connections.
Data from the crawl is fed into the indexing engine, which relies on a specially configured, embedded instance of DB2. After parsing, categorizing, and weighting the data from the crawl, the engine generates a large index that becomes the file from which queries are answered.
The weighting algorithms are as important in enterprise searches as they are on Web engines -- perhaps even more so because enterprise users will often know that a specific document does exist somewhere but won't know exactly where. As a result, OmniFind uses algorithms that are distinct from those used by Web search engines. The latter depend heavily on the number of links pointing to a specific page to judge relevancy. Intranets, however, are rarely linked extensively to other intranet sites; they are often silos unto themselves, and so link counts are much less useful.
Instead, OmniFind weights its searches with data such as how often a keyword appears in the page, whether it appears in the title or subheads, and how often it appears in anchor text. OmniFind also uses a dynamic mechanism that tracks how often previous searches on a specific keyword have resulted in clicks to a particular page. So, as more searches are performed, the quality of the ranking improves significantly.
Users have limited access to the ranking mechanism: They can specify links that must show up first for a given keyword, but they can't do much more to tweak rankings. This could prove a limitation for companies that have considerable material for a given keyword and want to make specific documents more salient. Indexing can also be administered so that reindexing can be scheduled when systems will be least affected.
The results display shows a broad capability of selecting and choosing search items. A keyword search is the base level. A user, however, can ask for specific records or data items using an SQL-like query language. If the results derive from a database, they are shown in complete field detail.
OmniFind's security currently is coarse-grained. The display mechanism checks authorization levels before displaying data to make sure an employee is entitled to see a given result. Unfortunately, OmniFind lacks document-level security. Moreover, no mechanism exists to support an LDAP directory to automate access to an employee's credentials, although this feature is forthcoming.
OmniFind is an impressive tool in terms of the sheer volume of data that it can federate. It is clearly designed for enterprise use and scales to handle huge amounts of data.
I was surprised, however, by some limitations. For example, the crawler doesn't open .zip or .tar files. Help files, which frequently contain a wealth of searchable information, are also skipped.
Performance was hard to assess. IBM claims a minimum of 30 dps (documents per second) for crawling and the same rate for indexing, with bursts of 100 dps. My experience was that these numbers were aggressive: Indexing is gated by disk I/O and, in the demo I saw, it wasn't near 30 dps. The test system was not set up to simulate a true crawl -- as all the documents were local -- so crawl performance was more difficult to ascertain.
OmniFind is, for the most part, an easy-to-run, configurable, scalable, and intelligent enterprise search engine. However, the lack of document-level security, the absence of LDAP support, and the ignored file types suggest OmniFind's first release needs some tweaks.