William Hill Australia has rolled out PagerDuty’s SaaS offering and is using it to manage its critical infrastructure alerts while the online bookmaker focuses on delivering 100 per cent uptime during major events.
William Hill’s head of infrastructure and operations, Alan Alderson, said that when it comes to availability his team has two different targets: “One is for what we class as tier 1 events — basically all the Group 1 racing days during the year and major sporting events as State of Origin and the Australian Open — and we have to be up 100 per cent of the time for those events during the day,” Alderson told Computerworld. At other times, the goal is 99.99 per cent availability, he said.
He said the rollout of PagerDuty is an effort to make sure the right alert is being routed to the right team — “We had a team that was just constantly getting targeted with alerts, and the fatigue was becoming evident” and things were being missed.
The bookmaker uses CA Unified Infrastructure Management and has started feeding its CDM alerts — for CPU, disk and memory utilisation — into PagerDuty
“If it’s a critical alert, there’s an immediate phone call to the support team — which right now is the service desk — and they will investigate and troubleshoot, triage and then escalate if necessary,” Alderson said.
“Before we would be relying on an email coming in and someone watching for the email to say there’s a CPU threshold being breached.”
Since rolling out the new system in late October, mean time to acknowledge has dropped from minutes to seconds, he added.
“You just could be away from your desk for a couple of minutes and you’ve missed a critical email coming in — whereas now you’ll get a phone call,” he said.
“I’m confident now if that person is away from his desk for a couple of minutes, it will then get escalated on to the next level of support quickly if it’s not acknowledged.”
Although PagerDuty is currently only managing William Hill’s CDM alerts, the plan is to expand it to encompass a much broader range of alerts, including those generated by Splunk and its APM system.
“We want to pick up all our critical alerts across all our monitoring platforms straight into PagerDuty and then PagerDuty will then do what it needs to do to get it out to the right team, which can start working on the problem,” the infrastructure manager said.
“We just want to get the CDM process absolutely nailed down and then we’ll start working on other mature alerting processes that we’ve got and start building that into the teams,” Alderson said.
“Once the CDM process absolutely has been nailed down then we’ll start working on other mature alerting processes and pipe them through PagerDuty,” he added.
“Eventually we want PagerDuty to push these alerts straight to the support teams, further automating the process and therefore reducing resolution times.”