As founder and former CEO of Tandem Computers -- a vendor of very high-availability systems and now a Compaq Computer subsidiary -- James Treybig knows a lot about what it takes to achieve high levels of application uptime. Today, he's a partner at Austin, Texas, venture capitalist Austin Ventures and invests in high-tech start-ups.
Treybig recently spoke with Computerworld senior editor Jaikumar Vijayan about high-availability issues on the Web.
Q: What are some of the biggest challenges companies face in building reliable, scalable Web environments?
A: Ensuring data integrity. The hardest problem is making sure that when something fails, you don't lose data. For many companies, as long as you can get back on the air quickly, failure is OK if you can do two things: a [system] dump to find out what caused the problem, and [making sure] no data got corrupted. Failure always raises the problem that you lose data. ... Over time, it's like cancer in your database. ... You have a huge crash, and you can't recover any data.
Q: There has been a spate of high-profile service outages recently. Why?
A: Some of the companies doing e-commerce are new ones. They start without much money and without having a way to address all these issues. They build systems; they explode; they build them again. They don't have good application testing; they don't do failure analysis; they don't do stress tests.
Then you have the brick-and-mortar companies who have been around a long time -- but not necessarily online. When you look at e-commerce, your business revolves around the Web. That means changing systems, upgrading them, doing new software releases. ... These are all problems.
Q: So what should companies do?
A: Fault tolerance is like having a dial tone. You can't look at only the [hardware] system anymore. The architecture of the whole complex is really key to availability, reliability, scalability and data integrity.
Q: Isn't that expensive to achieve?
A: It is not. You want to be cost-effective. You may have all your databases on Unix boxes, you may be running your applications on NT boxes. You can partition your data over lots of systems that are reliable so that if something fails, you don't lose data ... or you have duplicate data running on separate systems. ... The architecture of the site is how you achieve this, not individual systems. What mattered in the old days was having one system that was scalable, reliable, etc.
Q: A few vendors are saying they might soon be guaranteeing better than 99 percent availability on their Unix boxes.
A: I don't believe that for a second. There is a kind of naïveté when people talk of things like 99.99 percent uptime and fault tolerance -- you know it's not possible. There is no stand-alone Unix box that is anywhere near 99.99 percent availability -- and there is no NT box for sure.
If you don't have underlying box, database and application protection, you are not going to get anywhere near that.