A technical investigation into the signalling issues that caused delays to 40 per cent of the Sydney metro rail network in early April has recommended a complete review of RailCorp asset management protocols to more clearly define refresh cycles for critical equipment.
In a post-mortem report released Friday (PDF), the government department attributed the problems to an eight-year-old Cisco 3550XL network switch forming part of the Advanced Train Running Information Control System (ATRICS) LAN at Sydenham station.
The report indicated intermittent failures in the switch just after 7.30am on 12 April this year caused the entire network to reconfigure itself, ultimately leading to a failure that wasn’t completely solved until after 4pm that day.
The failures ultimately caused 240 trains to be cancelled and 847 trains delayed over most of the rail network at an average waiting time of 27 minutes.
The age of the switch, which was designed to last at least another seven years, was not called into question. However, the report found the failed switch was one of a batch of Cisco routers the manufacturer had warned could fail due to a fault in the power supply capacitor.
The failures were noted up to two months preceding the failure, but were not acted on. In a patch bulleting released on the issue in 2003, the manufacturer suggested the switch be replaced on fail.
“Although ‘replace-on-fail’ may be appropriate for an enterprise network, processes need to be introduced to consider the risk this approach poses in a high criticality application,” the report reads.
Among seven recommendations outlined in the report, the technical investigation indicated a more clearly defined refresh cycle and asset management protocol was required to prevent future accidents. The team also recommended a review of the ATRICS software’s ability to manage fail-over scenarios and the slowness of the network at Sydenham, which prevented engineers from applying fixes more quickly.
The faulty switch was quarantined and analysed as part of the investigation, while fixes were made to discovered issues at both the Sydenham ATRICS network and Revesby.
Follow James Hutchinson on Twitter: @j_hutch
Follow Computerworld Australia on Twitter: @ComputerworldAU