Amazon Web Services has published a post-mortem of an outage within its Sydney Region that knocked offline a number of Australian customers.
In the midst of wild weather a range of the cloud provider’s services in an Availability Zone within the Sydney Region went down.
In its post-mortem, AWS said that its utility provider suffered a loss of power at a regional substation during the severe storm that gripped Sydney.
“In one of the facilities, our power redundancy didn't work as designed, and we lost power to a significant number of instances in that Availability Zone,” AWS said.
The cloud provider said its DRUPS setup had not functioned as expected:
The specific signature of this weekend’s utility power failure resulted in an unusually long voltage sag (rather than a complete outage). Because of the unexpected nature of this voltage sag, a set of breakers responsible for isolating the DRUPS from utility power failed to open quickly enough. Normally, these breakers would assure that the DRUPS reserve power is used to support the datacenter load during the transition to generator power.
Instead, the DRUPS system’s energy reserve quickly drained into the degraded power grid. The rapid, unexpected loss of power from DRUPS resulted in DRUPS shutting down, meaning the generators which had started up could not be engaged and connected to the datacenter racks. DRUPS shutting down this rapidly and in this fashion is unusual and required some inspection. Once our on-site technicians were able to determine it was safe to manually re-engage the power line-ups, power was restored at 11:46PM PDT [4:46pm AEST].
Compounding the issue was the load placed on the data centre’s DNS servers as they handled the recovery load once power was brought back online. In addition, a bug in the company’s instance management software caused slower than expected recovery for some instances. Calls to the cloud provider’s APIs were also affected, AWS said.
“For this event, customers that were running their applications across multiple Availability Zones in the Region were able to maintain availability throughout the event,” AWS said.
“For customers that need the highest availability for their applications, we continue to recommend running applications with this architecture. We know that it was problematic that for a period of time there were errors and delays for the APIs that launch instances. We are working on changes that will assure our APIs are even more resilient to failure and believe these changes will be rolled out to the Sydney Region in July.”