Posted by Matt Edgley on 31-May-2017 12:07:34

What the BA outage tells us about data centre resilience

What the BA outage tells us about data centre resilience

The IT issues that affected British Airways (BA) over the late May bank holiday weekend offer a sobering lesson in the implications of downtime to a modern digital business. The outage meant the airline had to ground hundreds of flights from Gatwick and Heathrow, was unable to communicate with 75,000 stranded passengers, and now faces a compensation bill that could reach £150 million.

But what do we know about the outage itself? What went wrong with BA’s IT infrastructure, and how can other firms prevent theirs from succumbing to the same fate?

At the time of writing, details are thin on the ground – but from what we’ve heard so far, the incident highlights the importance of redundancy in the data centre, as well as the way that redundancy is managed. And this is a commonly misunderstood area.

(Recommended reading: The data centre services buyer's guide)

When failover systems fail

According to statements from BA, the outage wasn’t the result of a sophisticated cyber-attack or malware infection (which many firms are still jumpy about after last month’s WannaCry debacle), but simply a power supply in one of its UK data centres.

This alone shouldn’t be enough to cause a three-day outage for one of the world’s biggest airlines, of course. Just about any private data centre or colocation facility will have failover systems in place (such as UPS units) to ensure service continuity through unplanned outages.

And BA’s was no different – only on this occasion the backup systems also dropped out, turning what should have been a minor issue into a cascading failure and a genuine catastrophe for the airline.

The most detailed summary of the incident to date comes from Telegraph Business, which reported on Monday 30th that a UPS unit at one of BA’s Heathrow data centres had been “shut down – the reasons for which are not yet known”. The report continues:

“Under normal circumstances, power would have been returned to the servers in Boadicea House slowly, allowing the airline’s other Heathrow data centre, at Comet House, to take up some of the slack.

But, on Saturday morning, just minutes after the UPS went down, power was resumed in what one source described as ‘uncontrolled fashion.’ ‘It should have been gradual,’ the source went on.”

This failure to restore power correctly after two unplanned outages reportedly caused catastrophic damage to other parts of BA’s infrastructure, adding significantly to the amount of work the airline’s IT staff had to carry out to resume normal services.

The lessons for colocation buyers

So what should colocation-reliant firms take away from the bank holiday BA outage? At this stage, there are a few observations we can make.

Firstly, the incident drives home the message that redundancy in the data centre is a complex and mission-critical area, and one that you should ignore or brush over at your peril.

Many colocation buyers who aren’t data centre professionals themselves have a rather unsophisticated view of redundancy – they check for labels like N+1 or N+N (or Tier 3 or Tier 4) and feel this tells them enough about the facility for peace of mind their infrastructure is protected against outages. In reality, failover systems can fail, and it takes a more sophisticated understanding of resilience to be really sure your redundancy measures are sufficient and risk-appropriate.

(To find out more, see our previous blog: 5 questions to ask your colocation provider about resilience.)

Secondly, the BA outage highlights the importance of testing your business continuity plan. Obviously, we don’t know how often or how comprehensively the airline tested the process of restoring power after an outage, but the Telegraph’s version of events suggests this was the point at which the issue became a catastrophe. Even with the best redundancy measures in the world, a business continuity plan can be useless if it isn’t followed to the letter.

Finally, a lot has already been said about BA’s decision last year to outsource hundreds of IT jobs to overseas specialists.

Again, it’s unclear whether this was a contributing factor in the outage – the airline insists it wasn’t – and it’s not really uncommon for firms to reduce their IT spend through outsourcing. Nonetheless, there’s much to be said of the value of internal and external experts who have years of experience in the complex infrastructure and systems on which your business depends – particularly in a crisis on the scale of this one.

Find out more about data centre resilience in our buyer's guide.

FREE download: The data centre services buyer’s guide >

Free guide: Choosing the right data centre for cloud hosting or colocation

Topics: redundancy, colocation, resilience, data centre