Those puzzling over the cause of British Airways’ recent IT systems failure are missing the point. The real issue is why the airline took so long to get back up and running again.
It’s possible we’ll never know exactly what caused British Airways’ crippling IT outage, which disrupted more than 75,000 passengers around the world late last month.
But that’s a distraction from the real issue. Whether the cause was a power outage, disk failure, human error, out-of-date IT or scheduled maintenance that went awry, the fallout should never have been on such a scale. Outages happen; it’s the recovery plan that determines the impact.
So why did BA take so long to reboot its systems and resume normal service? Why wasn’t system recovery much more immediate?
We can only speculate, but the fault possibly lies in an inadequate disaster-recovery plan.
It could have happened to anyone
That’s not to suggest that BA doesn’t have all of the expected backup facilities and failover plans common to other airlines and major enterprises – mirroring systems across more than one data center, with the means to switch from one to the other in the event of a site-specific problem.
Indeed, these days most big companies have what they think are robust business-continuity plans. But some may have been lured into a false sense of security with regard to their backup systems.
Devastating events such as Hurricane Sandy in 2012, which took out data centers along the east coast of the US, were a stark reminder of the need to place distance between primary and secondary sites – maximum geographical diversity being preferable. Cloud-based facilities have enabled this for a reasonable cost because of the economies of scale associated with using someone else’s existing, and very flexible, infrastructure.
But these provisions alone do not guarantee a speedy recovery and a resumption of business as usual.
Modern airlines process terabytes, if not petabytes, of data every day – data which is changing continuously. Unless the data at the disaster-recovery/secondary-data center site is a pristine, real-time copy of that which exists on primary systems, timely system recovery cannot happen.
Backup systems need to be fit for purpose
Recovery isn’t ‘just’ a matter of switching over from one set of systems to another, all other conditions being equal.
Traditionally it hasn’t been possible to maintain two copies of live data in different places, which means that recovered systems have to roll back to the last time the two sites were synchronized. Unless the recovery-time objective (RTO), as defined within the company’s business-continuity plan, is zero – or as close to that as possible – then some data will have been lost. The recovered system will not be up to date. In an airline context, that could mean a danger of double-booking seats or a failure to reflect the latest changes to flight schedules.
The longer the outage, the greater the impact. It used to be that 15 minutes of IT service loss could be tolerated fairly well; but in a Big Data world where huge volumes of data continue to multiply by the second, even just a few minutes of downtime has the potential to bring mighty organizations such as BA to a grinding halt.
That’s why airlines need to look seriously at active data replication as a means of keeping primary and secondary data centers continuously in sync something we have made possible. Enable this and an IT outage should be undetectable.
There’s a reason why aircraft have two parallel sets of all critical on-board equipment, which mirror each other exactly and could be used interchangeably at any moment. It’s so that the unthinkable does not happen.
As the world’s most vital services come to depend ever more heavily on IT, the same approach needs to be replicated on the ground.