Opinion: Mining the intranet

One of my early uses of Web services, back in 1999, predated SOAP and WSDL. It was a script to calculate what I called Web mindshare. It combined Yahoo's capability of enumerating sites in a category with AltaVista's capability of counting inbound links to each of those sites. It was a primitive version of what Google, then in beta, went on to prove dramatically: Links measure authority. What interested me even more, though, was how easily that little script was able to compose a novel service -- ranking everything in a category -- from two existing but unrelated services.

I was reminded of the mindshare calculator this week when I noticed that the new book Spidering Hacks by Kevin Hemenway and Tara Calishain includes an updated version that works with Google. Naturally, I had to try it out. But first I needed to find my API key because only registered users can call Google's Web services. Then I had to install Perl's SOAP::Lite module, which wasn't on the machine I was using. Then I had to find a copy of Google's WSDL file. All this to do exactly what the 1999 script had done without any of this paraphernalia.

This isn't just an old-fart story about how simple things used to be simpler. In many cases, simple things still are (or can be) simple. My recent plunge back into the primordial soup that became Web services reminded me why simplicity is a good thing: Neither Yahoo nor AltaVista offers a SOAP/WSDL interface. So when I decided to rerun the old script to compare previous results with current ones, it was reasonable to expect a disaster. Conventional wisdom says HTML screen-scraping is a poor excuse for a formal API and won't survive test of time. And yet in this case, the Yahoo format was unchanged, and there was only one trivial tweak for AltaVista. (It used to report "about 43,000 pages," now it reports "found 43,000 results.")

With current AltaVista data in hand, I decided to look at comparable results for Google and AllTheWeb. Because Google now discourages screen-scraping, I used the SOAP/WSDL method. But for AllTheWeb, which like AltaVista offers no formal API, I used the original technique, tweaking the URL of the search engine and the pattern of the result count. You'd have to make two analogous tweaks in order to specify a SOAP end point and an XML result. But if I'd had to register for an API key and locate WSDL documentation for each of the three services whose results I compared, I probably wouldn't have bothered.

Of course, sites such as Amazon and Google have reasons to create formal APIs and control access to them. But on an enterprise intranet the threat is disuse, not overuse. You're publishing information that you want people to find, exploit, and recombine. When it's appropriate to use SOAP and WSDL -- for example, when queries require fancy authorization or complex inputs -- then do so. But when a simpler strategy will suffice, don't be ashamed to use it. Between the primordial tag soup of HTML and the formal realm of Web services exists a large and fertile middle ground: XHTML.

Information that you publish in XHTML can be directly consumed by browsers, and it is much friendlier to spiders than ill-formed HTML. It's true that creating XHTML pages requires more discipline than hacking out HTML, and it may incur some retraining costs. But if you hope people will mine your intranet, make the job as easy as it can be.

Join the newsletter!

Error: Please check your email address.

More about AltavistaGoogleMindshareYahoo

Show Comments