Customers praise Microsoft's 'no BS' explanation of cloud service outage

A Microsoft executive last week gave what some users called a "refreshingly direct" and "no BS, straight-talking" explanation of a several-hour outage of the cloud-based Visual Studio Online, collecting kudos for acknowledging mistakes.

A Microsoft executive last week gave what some users called a "refreshingly direct" and "no BS, straight-talking" explanation of a several-hour outage of the cloud-based Visual Studio Online, collecting kudos for acknowledging mistakes and spelling out what happened.

In an Aug. 22 blog post, Brian Harry, a Microsoft Technical Fellow, corporate vice president, and product unit manager for Team Foundation Server, detailed the Aug. 14 outage of Visual Studio Online, the cloud service designed to help development teams manage complex projects.

Visual Studio Online was offline in some regions late Wednesday and early Thursday, Aug. 13-14, but troubles mounted Thursday morning until they became a total outage that lasted five-and-a-half hours. "This duration and severity makes this one of the worst incidents we've ever had on VS Online," Harry wrote.

In his 1,750-word blog -- not nearly Nadella- or Sinofsky-esque in verbosity -- Harry by turns apologized for the outage, dove into a technical explanation of what triggered the blackout, and laid out some steps the team planned to take to stymie a repeat.

But what got the attention of commenters was Harry's candor.

"We've gotten sloppy. Sloppy is probably too harsh. As with any team, we are pulled in the tension between eating our Wheaties and adding capabilities that customers are asking for," said Harry. "In the drive toward rapid cadence, value every sprint, etc., we've allowed some of the engineering rigor that we had put in place back then to atrophy -- or more precisely, not carried it forward to new code that we've been writing. This, I believe, is the root cause."

That got applause from customers.

"As a fellow who has trod some similar ground over the years, let me simply say: nice analysis write-up, that was refreshingly direct," said Benjamin Treynor in a comment appended to Harry's piece.

"A perfect template for no BS straight talking. Well done, very impressed," added someone identified only as "Craig" in a latter comment. "Lots of good lessons in there, too, that we can all benefit and learn from."

"I always enjoy reading these retrospectives. I believe they are critical to a healthy customer relationship and maintaining trust," chimed in Tyler Jensen.

Harry's admission that Microsoft's push for a faster pace was behind the outage struck a chord because customers and other observers have questioned the tempo, worried that quality will suffer as speed becomes the top priority. And Microsoft is on a mission to accelerate development and its release schedule as it shifts to, as CEO Satya Nadella has put it, a "mobile-first, cloud-first" strategy.

For some people, nearly every recent Microsoft misstep, including a flawed security patch that crippled tens of thousands of PCs worldwide, has an easy explanation: the faster pace. If "agile," a buzzword not just at Microsoft, has become the new black in development, then there are many who don't want anything to do with the results.

"You may need to slow down the agile pace and make the product solid," cautioned another commenter, Weijie Jin, on Harry's blog. "I've seen several outages of VS Online in just several months, please don't make it worse."

Wes Miller, an analyst with Directions on Microsoft, was on the side of the concerned. "It's hard to evolve and ship big products fast," said Miller, who like many of the analysts at Directions, once worked at the Redmond, Wash. company. "We've seen this with recalled updates across almost every Microsoft product over the last two years."

Microsoft's accelerated cadence runs counter to decades of practice, noted Miller, implying that that in itself makes a transition difficult. "Microsoft worked for over 20 years on the same premise as an automaker. Start with an idea. Develop. Test. Release. Monolithic and multi-year," said Miller in an email reply to questions.

But another contributor to problems like the Visual Studio Online outage has been the sheer weight of the past. Historically, Microsoft has prided itself on backwards compatibility, the ability to, even as it moves ahead, support older technologies and software.

"As we learned with Windows when the componentization effort began -- during XP, but then really evolved during Longhorn [which launched as Windows Vista] after the reset -- the dependency stacks in Microsoft technology are very complex, and effectively unknown by anyone on the team," Miller argued. "As a result, it's like the game pick-up sticks. You can't see the complex interdependencies unless something you do affects them."

Code changes, Miller contended, may seem simple and on the surface, but the reality is that the burden of the past has created such a complex snakes' nest that it's impossible to predict their impact.

"Unless you completely rewrite and refactor the software to get rid of the legacy completely, [that complexity] is always there," Miller said. "As a result, any time you short-circuit testing, and assume that unit testing of just the new feature covers everything, you'll eventually discover that it doesn't."

"Unit testing" is pretty much what it sounds like: The practice of testing discrete units of code, not the whole, to see if things work.

Harry was on the same page as Miller.

"Developers can't fully understand the cost/impact of a change they make because we don't have sufficient visibility across the layers of software/abstraction, and we don't have automated regression tests to flag when code changes trigger measurable increases in overall resource cost of operations," Harry said in his Monday blog. "You must, of course, be able to do this in synthetic test environments -- like unit tests -- but also in production environments because you'll never catch everything in your tests."

Harry's commentary was in stark contrast to Microsoft's typical silence on many of its blunders and miscues. Two weeks after an Aug. 12 security update caused some Windows 7 PCs to lock up with the dreaded Blue Screen of Death, Microsoft has yet to officially offer an explanation.

"I completely agree that this was unique and refreshing to see," said Miller of Harry's mea culpa. "But we need to see [the same from] the Office and Windows teams. Alas, they're generally proceeding forward and seemingly ignoring the complexity of the legacy stacks and complex hardware their customers live upon."

Tags Microsoftinternetcloud computing

More about Microsoft

Comments

Comments are now closed

Toshiba Chromebook 2 review: An attractive Chrome OS experience

READ THIS ARTICLE
MORE IN Money and Markets
DO NOT SHOW THIS BOX AGAIN [ x ]