XML: The Right Tool for Odd Jobs

SAN FRANCISCO (04/21/2000) - As regular readers of this column know, I am a big fan of self-describing data files. In "A Lazy Afternoon" and "Smart Data", I covered some aspects of this technique.

Consequently, I have been watching XML (Extensible Markup Language) with a great deal of interest. XML has a good shot at being a really pervasive technology that may cause changes in all parts of the computing community.

Let's look at some XML fundamentals, to see how this could be.

XML looks quite a bit like HTML. The languages employ very similar syntax, though XML is distinctly pickier about compliance. Fundamentally, however, HTML and XML have rather different goals and some corresponding design differences.

The goal of a typical HTML file is to present information to a human reader, mediated by a browsing program. Although HTML documents can be written in a very abstract way, coders often use tricks to specify the exact appearance of the resulting document.

This makes Webpages interesting to look at, to be sure, but it greatly increases the difficulty of writing programs to read the data. Worse, the HTML organization of a page may change at any time, subject to the whims of the developing organization.

XML documents, in contrast, are optimized for processing by computer programs.

Their tight rules of syntax allow both consistency checking by the creator and ease of access by random clients.

Further, the semantics (and high-level syntax) of XML files can be defined by a document type definition (DTD) file or an XML schema. These can provide a programmer (or a particularly deft program) with a guide to reading and parsing the actual XML file. (HTML has DTDs, as well, but they tend not to be as detailed.) XML has many other interesting characteristics, but these should get us started. Let's explore some possible XML-based applications.

Book catalogs

Consider the task of parsing publishers' Webpages to generate a comprehensive list of books on a given topical area. Each publisher uses a different format, of course, and every time a publisher rearranges a page or adds a feature, some programmer must figure out (again) how to parse the format.

Big companies such as Amazon.com Inc. simply step around this problem, requiring publishers to give them listings in a specified format.

Unfortunately, this means that each publisher now has to generate a different listing for each online reseller.

Wouldn't it be more reasonable for publishers and resellers to agree on a single listing format? The pages could be transmitted privately or posted on the World Wide Web for more general access. In either case, however, the target audience would be programs, rather than humans.

If the format were well documented, special-purpose search programs could be hacked up in Perl, etc. As an occasional book reviewer, I would love to have a program that could generate lists of books on specified topics!

XML is aimed at precisely this kind of problem. Publishers and resellers could easily (from a technical perspective) define a common vocabulary and structure for XML-based catalogs.

Although this could be accomplished by a prose description, a DTD or XML schema really should be used to specify the exact format. Existing DTDs (e.g., BiblioML and MARC) cover very similar problems, so the publishing community could probably adopt (or adapt) one for its own use.

Once agreement has been reached on the DTD, each publisher must find a way to convert its local catalog format into (out of) the XML format. This is a relatively trivial effort, however, compared with generating formats for an arbitrary (and steadily increasing) number of resellers.

I would love to be able to tell you that the publishing industry is well on the way to having such a system in place. Sadly, even publishers that have myriad books about XML haven't (yet) published their catalogs in XML form. I predict that it will happen, however, and probably sooner than later.

Software building and distribution

In the Unix community, software builds are commonly controlled by a version of the make utility. make files describe dependency relationships between files (e.g., "foo is built from foo.c and foo.h"), using a largely declarative syntax supplemented by snippets of shell code.

Because make is a very flexible language, wizards can cause it to do spectacular things. The FreeBSD Project's Jordan Hubbard, for instance, has created a 2,500+ line make file as the basis for the FreeBSD Ports Collection.

In concert with a small specification file for each package, Jordan's make file automates the downloading, patching, building, and installation of given open source packages. About 3,000 of these specification files currently exist, covering a very wide range of packages.

Unfortunately, the system depends heavily on Berkeley-style make, as well as having a variety of FreeBSD dependencies. Consequently, adapting the system to support Solaris (let alone Linux) might be a challenge.

I have speculated about the possibility of using XML as the basis for a rewritten system. In the new system, the description files would be both abstract (no OS dependencies) and totally declarative (no embedded snippets of code).

Looking around a bit, I discovered that I was not alone in considering this approach. The Open Software Description, developed by folks at Marimba and Microsoft, proposes XML as the foundation for a complete software packaging and distribution system.

Apple is also reported to be making heavy use of XML in the software build and distribution mechanisms for Mac OS X. And, of course, XML plays a large role in Apple's WebObjects system.

Oh, yes, Webpages

Although I have discounted the use of XML for Webpages, there are some really interesting possibilities here, as well. In an effort to make Webpages more interesting and dynamic, programmers are stuffing all sorts of executable code (e.g., Java, JavaScript, Perl, and Tk) into HTML pages.

This makes me more than a bit twitchy, as I have no way of knowing the real intentions (or, for that matter, simple competence) of the programmers who wrote the code. So, I tend to leave these facilities off in my browser, missing pizzazz in return for a bit more safety.

Instead of sending executable code, however, programmers could send declarative descriptions of items, along with possible presentation modes. These modes, defined by style sheets, can support interactive graphics, multimedia, and more. What they do not do is pump arbitrary code into the viewer's machine.

Although I suspect that evildoers could find ways to subvert even XML, the opportunities are more limited. So I look forward to upcoming uses of XML that will take advantage of "trusted" presentation code to give me both pizzazz and safety.

There are many books on XML that deal with assorted aspects of the standard.

The ones I have listed in the Resources section are simply the ones that I found useful as introductions.

Rich Morin operates Canta Forda Computer Laboratory, a computer consulting firm specializing in open source software, and serves as a consulting advisor to Addison Wesley. He lives in San Bruno, California, on the San Francisco Peninsula.

Join the newsletter!

Error: Please check your email address.

More about Amazon.comMarimbaMicrosoft

Show Comments