As the recently appointed CIO of eBay, Maynard Webb's first task is to stem a series of embarrassing service disruptions that have already cost the world's largest online auctioneer millions of dollars in revenue and stock value. Webb, a former CIO of PC maker Gateway and an IT executive at companies like Bay Networks and Quantum, recently spoke with Computerworld senior editor Jaikumar Vijayan about how companies like eBay face Fortune 100-like IT problems, though a fraction of their size.
Q: Is this job very different from your previous ones?
A: Absolutely. That is why I am here today. eBay, I believe, has the hottest and biggest technology challenge in the Internet space.
Q: How is that?
A: If you take a look at a lot of the other e-commerce sites, even though they have a lot of volume and activity, the actual intensity can be pretty low. You do a lot of batch updates to a back-end database and it looks pretty recent, but you still have to fly packages around the world, and getting updates on the packages isn't a hard thing to do - though it is important.
If you take a look at eBay, you are talking about an extremely tight integration between all the Web transactions and the database. The volume and the intensity put it in the Fortune-100 kind of transaction volumes.
Q: What's been causing all those service problems at eBay recently?
A: The good news is these are high-class problems to have. We have an extremely scalable and tight application that is all written in C++ and has a lot of headroom and legs left to run. What we didn't do so well was to put as much focus on reliability and availability of our platform. As a result, with some of the bigger outages we didn't have as much recovery and flexibility as we needed. We didn't have hardware redundancy and failover. So if our database server crashed for any reason, we had to fix all of the elements of the server itself to be able to roll back and get the site back up.
Q: What are you doing about it?
A: We already have a warm backup situation where we should be able to get back up pretty quickly - within two to four hours of an outage - at any time. By the middle of October, we will have a high-availability backup (with fully redundant servers) that will have us back up within an hour. At the same time, we are working on our next-generation architecture plan to (eliminate) any single point of failures. We are looking at distributing the application and database over multiple servers to make sure we can handle the 100X growth in database activity we are experiencing.
Q: What kind of testing are you doing?
A: Building a test environment to simulate all this is not a trivial thing. As an old pro at this, I would like to spend more time testing [applications], but we've got time-to-market issues. We are doing things in Internet time where everything moves at warp speed. We've spent a lot of time improving our quality assurance capability. I think we have done a reasonable job of testing a lot of the changes.
Q: So how do you figure out how much capacity you need and how much is too much?
A: It is an art, not a science. I was just at a meeting where we were talking about (immediately) adding more DASD (storage) than we would have in a six- to eight-month period. You got to get more disciplined in what you do. You simply have to get tighter, simpler and be smarter on things like archiving and DASD management. The demand grows and grows and grows. The volume of traffic from our site has almost doubled from our June outage and grown by 35 per cent just in the one month I've been announced.
Q: Are there any tools or metrics that help you do this?
A: There are a lot of tools out there, but most of them are not geared to deliver what I need. So we have to write a bunch of what we need ourselves. Most of the time we get an off-the-shelf product and then we've got to customise it. This is not rote behavior we are talking about. This is breaking new ground.
Q: How do you figure out how much to spend on upgrading your site?
A: We know exactly what downtime can cost us in lost revenue. We have a very strong and very loyal user community and the biggest roadblock is our inability to scale. We will spend cost-effectively and prudently. It would be silly for us not to buy the capacity we need and to stay ahead of our wildest dreams on capacity.
The architecture that we lay both in the near term and long term has to be about two times our wildest estimation, so that we can stay way the heck ahead. We think we are pretty big now, but a year later we are going to grow 10X. We are sitting here 100 times the size we were last year.... The judicious thing to do is to pick solution that let you scale almost infinitely.
Q: What advice do you have for companies grappling with similar issues?
A: I think you need to bring an elephant gun to kill a mouse. Hardware is cheap, the pace of the game is frenetic and being the first mover in an industry like this is very important. You really need to figure out what the business plan is, do a what-if scenario that is beyond your wildest dreams and build an architecture that lets you scale beyond your wildest estimation. You quickly need to be a world-class organisation to handle your technology. It is like managing a Fortune-100 kind of computing environment when your company is still a toddler.