Posted by Matt Edgley on 11-Jul-2017 15:44:16

Why it’s time we stopped blaming human error for data centre downtime

Why it’s time we stopped blaming human error for data centre downtime

Since we last commented on the British Airways outage back in May, the dust has started to clear and we’ve learned more – both from the airline and other sources – about the root causes of the IT failure. Their verdict: the outage was the result of human error in the data centre.

On June 9th, however, we saw a conflicting view from Lee Kirby, president of the Uptime Institute. Speaking to Computer Weekly, he described human error as little more than a convenient way for data centres to brush incidents under the carpet, hiding any number of problems in the design and maintenance of their facilities:

“We have collected incident data and conducted root-cause analysis for more than 20 years … [and] one thing we have noticed is that human error is an overarching label that describes the outcomes of poor management decisions.”

There’s a lot of truth in this statement. Human error is an easy excuse to hide behind, and an easy way to paint the occasional outage as inevitable on the basis that “accidents happen”. In reality, as any expert will tell you, many data centre best practices have been developed and refined over the years precisely for this reason – to address the root causes of accidents, and to ensure their impact on performance and availability is as limited as possible.

(Recommended reading: The data centre services buyer's guide)

So perhaps Kirby is right – perhaps it is time we stopped blaming human error for data centre downtime, and focus more on the steps we can take to tackle the problem at the root. Here are three examples below.

1. Staff selection, training and development

It should be no surprise that one of the most important steps you can take to avoid downtime as a result of human error is to be more rigorous in the way you hire, train and manage staff, as well as how you assess their performance and manage their professional development.

A stringent selection process from the get-go will filter out individuals that don’t fit the bill in terms of skills, competencies and attention to detail. What’s more, it’s unfair to blame employees if they aren’t given the skills they need to carry out their jobs, so investing in high-quality staff training is essential. 

You should also think about the areas of your business where downtime presents the biggest risk, and prioritise those areas for investment in skills and training. Otherwise, you may find that issues like staff absences or departures leave mission-critical infrastructure in the lurch without a quick or simple fix.

Limiting the time that staff work on their own is also important – there’s less chance of them making mistakes while working in a team or alongside a colleague, and morale should be higher, too.

2. Policies and processes

Good day-to-day management of the data centre environment can prevent any number of errors that would otherwise result in downtime. Specifically, rigorous and well-documented policies and processes will help to avoid the gaps in visibility and accountability that allow accidents to happen.

Tasks shouldn’t be handed out to staff on an ad hoc basis, either. They should be assigned to individuals based on their skills, competencies and even qualifications, reducing the chances that high-risk work is handed out to inexperienced staff, or that responsibilities change hands too often for staff to understand them fully and carry them out to the right standard.

3. Technical resilience

Finally, one of your best lines of defence against human error – and unplanned outages in general – is through a high level of technical resilience. By definition, concurrently maintainable infrastructure delivers continuity of services even when certain components in the data centre are shut down for maintenance – or, say, shut down as a result of human error. You may not think you need this level of redundancy for your day-to-day work, but what about when accidents are factored in?

It’s easy to assume human error is discrete from other areas of downtime risk – again, because “accidents happen”. In reality, any decision you make about technical resilience should take the risk of human error into account – your business depends on it.

In the market for high-resilience data centre solutions? Click the link below to download a copy of our buyer’s guide.

FREE download: The data centre services buyer’s guide >

Free guide: Choosing the right data centre for cloud hosting or colocation

Topics: data centre

Comments