A data centre brings together the best of light, cooling, power and telecommunications into one place, and by sharing the space with others you gain the benefits of buying only the amount that you need and often at a higher quality level than you could afford. However, there are some hidden costs.
The ‘lights out’ data centre — a data centre approach that, as the name suggests, has few if any lights on and often also features a climate controlled, enclosed computer room with very limited access — is essentially a place for machines not humans.
There are significant advantages in using this approach as it eliminates human error (through restricting physical access), such as other clients of the data centre unplugging your equipment or overloading your power when they plug in the monitoring cart; however, these advantages can quickly become a negative when disaster strikes.
For example, a Lights Out Management (LOM) card cannot change a faulty Ethernet card, so redundancy needs to be built in to the system. Redundancy can be added in many ways, including: Cold spares, warm spares, hot-standby and automatic failover. Cold spares are simply machines sitting in storage ready to take configuration and be used; it is critical to check that cold spares still work — switch mode power supplies are notorious for powering up after a long rest then breaking down either on the next power cycle or a few days later.
Warm spares are generally unconfigured, but kept in a working state. Hot-standby machines can be remotely configured to take over a failed machine’s role. Automatic failover is a warm machine that takes up a job when required — this becomes more complex when state and partial transactions are thrown into the mix.
Each step in enhancing the availability of the system increases the complexity of the problem to be solved and the interactions between the parts of the system.
Ironically, sometimes adding redundancy reduces the overall reliability of systems. An analogy is the difference between single engine and twin engine aircraft. Twins come with a spare and one would presume that they would be overall more reliable because of it; unfortunately, twin engine aircraft tend to be more difficult to fly and have more parts to maintain. Single engine aircraft can be more reliable than twins.
In the case of redundant computing services there are typically additional layers of networking, storage and management that can be misconfigured resulting in disabling an otherwise working system. Also, there may be additional costs for either your technicians or the data centre’s technicians to do work on the equipment. In addition to the monetary costs, there can be a significant difference in the experience and capability of a random data centre employee and your staff.
You may be unlucky and get stuck with a Microsoft engineer for your Unix systems after the flu has swept through the network operations centre (NOC) and everyone else is ill; in this case it really helps to have practiced talking a person through the grub boot screen or using fsck (file system check).
Finally, lights out data centres encourage the users to put their equipment in remote locations. When taken to extremes this can form an additional risk. Even data centres can have major failures that affect many customers at one time. At these times they may relax their rules to help things up and running; this doesn’t help you.
Maurice Castro is a member of SAGE-AU, a not-for-profit IT operations and system administrator professional organisation.