What went wrong?

Actually, we know very well what went wrong in each of the following horror stories …

14 steps to total Infrastructure Meltdown.

  • A large company had computer rooms, holding key email and business information, in each of their London offices.
  • It was decided that they should be replaced by a single commercial co-location facility.
  • Our advice was that, though the decision to concentrate was correct, the selected facility was flawed and lacking in business continuity measures.
  • Our advice was ignored.

Here’s what happened next:

  • A general power failure of the local electricity supply.
  • The resulting power surge blew a 600 Amp ceramic fuse.
  • The spare could not be found.
  • The Uninterrupted Power Supply (UPS) was designed to carry the load for 20 minutes.
  • The SLA with the customer called for the generator to start within 6 minutes.
  • When it did, its exhaust flue ignited waste material.
  • The smoke was drawn into the generator enclosure, triggering the fire detection system and shutting down the generator.
  • After 20 minutes, the UPS shut down.
  • All power was now lost to the data centre for 4 hours.
  • The operators panicked but failed to alert their customers.
  • Complex operational systems and Databases failed and crashed.
  • To make matters worse, the move into the facility had seen a lot of corners cut, to meet a tight deadline.
  • The new systems had been created manually, without backup disks.
  • It took 4 days to restore most of the corporate email system.

ECA had foreseen all these problems and had advised the customer not to place their business critical systems in jeopardy. Here are a few of our comments:

Reliance on a single generator is poor practice.

  • Multiple generators start as soon as they sense a mains power failure and reach full designed electrical load within seconds.
  • Thereafter, the switchgear should manage load shedding, shutting down spare generators, while maintaining full load.
  • A UPS capacity of 20 minutes is inadequate if you only have one generator.

Digger shuts down hospital.

When a completely even mains supply is crucial, it can be ‘smoothed’ via a UPS, consisting of large battery banks that also maintain power if the mains supply fails. It is essential that the entire end-to-end supply - UPS, backup generator(s), and associated fuel and controls are fully tested on installation and after maintenance. Sometimes, the UPS is only designed to hold the electrical load for a very short time, until the generator kicks in. This is a minimalist design, not suited to essential installations like hospitals.

This is what happened in a recent incident:

  • A contractor’s JCB dug up the electricity main.
  • The UPS took over and - as designed - switched off when the generator started up.
  • Some time later, the generators ran out of fuel, because their fuel pump depended on mains electricity!
  • The hospital lost all power and suffered a blackout.
  • Fortunately, no lives were lost.

As can be seen, this design contained several single points of failure.

  • It should have been checked thoroughly, during design, installation and commissioning.
  • It should have been run under service load conditions, to prove its operational effectiveness.
  • It should have been regularly tested.
  • The fact that the generators start is no proof that electrical supply will be maintained.

Software shuts down at midnight.

This was an avoidable chapter of events!

  • The system was built in a hurry.
  • The main software vendor installed the software - an early implementer version - by download from their own website.
  • They failed to supply discs at build stage.
  • However, the system went live, settled down and operated well.
  • Then, one night - sharp at midnight - it failed.

And so did a similar system in Singapore.

  • It transpired that for two weeks, the application software maintenance people had been receiving ‘evaluation licence about to expire’ notices.
  • They didn’t believe them.
  • Having left a ‘logic time bomb’ in their software, the vendor’s development staff had corrected it - but they had forgotten to tell their own field staff.
  • Results: damaged business; damaged reputations.

 

Back to main Insights page