Armando Fox believes that, if you can't build fail-proof systems, you should at least build systems that can recover so quickly that service blips become negligible. A Research Associate with the University of California Berkeley's Reliable, Adaptive Distributed systems laboratory (RAD Lab), Fox was one of the leads on the joint Berkeley/Stanford Recovery-Oriented Computing (ROC) Project that investigated techniques for building dependable Internet services that emphasized "recovery from failures rather than failure-avoidance."
Fox has since brought some of the ROC lessons forward into the RAD Lab, which was launched in 2005 with US$7.5 million in funding from Google, Microsoft and Sun. Affiliate members include IBM, HP, Nortel, NTT-MCL and Oracle. The RAD Lab is focused on problems that plague large Internet-based businesses because the environments represent an extreme, but Fox says the lessons learned should ultimately trickle down to enterprise users. Network World Editor-in-Chief John Dix asked Fox to explain the vision.
Let's start with a review of ROC. What was that all about?
The philosophy of the ROC Project was stuff happens. Despite our best efforts to design and debug these complicated Internet systems, they inevitably end up failing in ways we didn't expect. Hardware is not perfect. Software has bugs. Even really, really well tested software like Oracle, you find bugs in it after it's been out in the field. And, you know, humans are in charge of running these systems and sometimes they make mistakes.
So the ROC Project philosophy was, let's accept that those things are going to happen and start thinking about designing for fast recovery as opposed to designing to avoid failure, which is not really a realistic goal. One way to improve system availability is to never fail. But, another way to improve is to make recovery from failure so fast that the contribution to availability is negligible.
Why do you start off with the assumption that you can never build systems that won't fail?
Because we don't think we're smart enough to counter the last, what, 60 years of computer science history. There are a lot of people working on design by correctness and other techniques to improve systems and minimize bugs. And that's a good thing. But so far, despite our best efforts, I cannot think of a single computer system ever designed in which no bugs were ever found once it was in the field.
So, I suppose we could take the position that, somehow in the future that's all going to change. But we've been saying that for decades. And it's not that we're stupid, right? I mean, in terms of performance, storage density, network communications speeds, look what we've been able to do in 30 years. But then compare that with what have we have been able to do in terms of reliability. The complexity of these systems has gotten to the point where it's very difficult for any one individual to understand how one of these systems works.
Plus, market reality being what it is, it's not as if you'd polish the whole thing, deploy it and then leave it alone. Systems have to evolve. You add new features, get more users, scale your system up. All of those processes work counter to reliability. Some of the most reliable software out there is the software that runs the Space Shuttle, and ask those guys how they make changes in their software. They have to write thousands of pages of documentation and have hundreds of hours of design reviews before a single line of code gets touched. So they have super reliable software, but it comes at a price.
And the reality is most Internet companies can't pay that price. Amazon can't have hundreds of hours of design meetings before deciding whether it can roll out a new feature. So the ROC Project basically said, look, we need to find a way to deal with this issue in the context of what commercial realities are. Because, yes, these systems evolve rapidly. And, yes, that's bad for reliability. But that innovation is where a lot of the value of these systems comes from. And we're not going to, as academics, propose an approach to the problem that says, you can fix your systems, but at the cost of rapid innovation.
So, that was the philosophy of ROC. And we actually made a fair amount of progress in identifying a couple of things. We identified some specific techniques that could be built into software systems that would help recover from certain kinds of common problems, really fast. In fact, so fast that sometimes you might not even notice it, except a minor blip in performance. So, that was an important finding. And, those ideas are starting to find their way into some commercial products.
How about an example.
Sure. One idea we worked on was called micro rebooting. When you have a weird, unexpected, unrecoverable bug and don't know what else is wrong, you reboot your machine. Sometimes that's enough to fix the problem. But rebooting takes a long time. So, given that applications have evolved to this componentized architecture using things like Enterprise Java Beans (EJB), our idea was to apply this concept of rebooting to a small number of components at a time. So instead of rebooting the whole EJB server, which can take minutes, you micro reboot only the EJB components that appear to have been failing. So, you reset the thing that was failing but you do it at much lower cost, because you're only doing it to the EJB component you believe was the actual source of the problem.
ScrumMaster offers tips on how to play in a winning dev team
How spyware nearly sent a teacher to prison
Open source identity: Asterisk founder and Digium CEO Mark Spencer
Fighting e-waste one mobile phone at a time
MIT's JoAnne Yates on information overload, 'CrackBerry' addicts and the 'always online' life
Read up on the latest ideas and technologies from companies that sell hardware, software and services. Mimosa™ NearPoint™ for Microsoft® Exchange Server: Email Archiving 101
Refresh your AUP: Top tips to ensure your acceptable use policy is fit for purpose
Gaining Competitive Advantage Through Enterprise Planning
Solve Exchange Mailbox Storage Issues Once and for All
Email Archiving Implementation: Five Costly Mistakes to Avoid
CRM your salespeople will love
Delivering the Power of Choice with Microsoft Dynamics CRM
Best Practice in Building an Integrated Information Management Strategy
Zones provide focussed content from Computerworld and leading technology partners.Discover how SOA can create smarter outcomes for your business.
Attend and learn:
- How SOA is helping leading companies to become more agile
- Where you should be applying SOA processes in your company
- The top SOA implementation mistakes to avoid
Click here for more information.
- +
Computerworld Live Podcast #97: The Future of Enterprise Networking 25/07/2008 09:45:36
This week CW Live chats with Mark Thompson, global sales and marketing manager for HP ProCurve, on the future of the enterprise networking. Mark discusses the trends we can expect to see in the near future and how the right infrastructure can ensure your enterprise network is secure. - +
Computerworld Live Podcast #96: Security at the Edge 11/06/2008 09:22:22
CW Live speaks with Amol Mitra, HP ProCurve Director of Marketing for Asia Pacific and Japan. Today's topic: how enterprises are starting to shift away from simply controlling security via server logins, firewalls and moving to more adaptive security frameworks. - +
Data Management Edition #10: Multi-Petascale Systems 02/05/2008 09:12:33
This week we look at sustainability and the development of multicore technologies to build multi-petascale systems. - +
IT Security Edition #11: How to poison the Storm botnet 01/05/2008 08:51:55
This week CW Live presents a case study on how to poison the notorious Storm botnet . Plus we take a look at Cisco's plans for Ironport. - +
IT Security Edition #10: Cyber-battles fought and won 24/04/2008 11:09:47
Vendors bow to end user pressure to improve product security, and we take a look at the latest concepts shaping the cyber-battlefield of the future.
Borderless corporate networks to shift focus to secure content management in Australia in 2009 2008-12-04 16:06:00+11
IDC Says Asia/Pacific Excluding Japan IT Market Will Remain The Bright Spot... 2008-12-04 15:04:00+11
MySpot SOS "Panic Button" Smartphone Application could save lone worker lives 2008-12-04 13:34:00+11
Charles Sturt University Commences Unified Communications Deployment With Interactive Intelligence 2008-12-04 08:30:00+11
AOC Launches 18.5” Widescreen Green 16:9 LCD Monitor in Australia and New Zealand 2008-12-03 15:30:00+11
Mimosa™ NearPoint™ for Microsoft® Exchange Server: Email Archiving 101
Email archiving is emerging as a critical new application for managing email. Learn how to reduce and manage online and offline email storage, add powerful tools for legal discovery and compliance and extend native exchange recovery capability by reading on.












