Cloud storage undermines ‘fail-safe’ design of the internet

The internet was designed to not have any single point of failure.

Supposedly this was to ensure a communications network would survive a nuclear attack, a presumption that has since been cast into doubt. It’s certainly the case that down at the invisible-to-us level of the internet’s basic plumbing, its early engineers came up with an extraordinarily flexible structure that didn’t require the net’s data – be it a single email or all the information that produces a web page full of text, video and images – to be lumped together and sent down the same digital road.

Instead, information is split up into many tiny data “packets” and scattered down major routes and minor byways in milliseconds, rerouting around occasional roadblocks. It doesn’t matter if a few bits get lost along the way, because enough arrives to turn those intangible packets into a cat video or The Irish Times homepage.

But this week, a major internet outage demonstrated that even if this structural invincibility was a design principle, much has changed. Slowly, without most of us understanding the implications, a small set of virtual overlords has established structural choke points.

These are an unwanted adjunct of data moving inexorably into what is vaguely termed "the cloud" – not a fluffy data nirvana, but a growing empire of steel, concrete, cable and wires data centres. Yes, those very same anonymous warehouse boxes Ireland has scattered along motorways and nestled in bland industrial estates across the State.

And as in all empires, power is concentrated in a few at the top. Despite the unimaginable vastness of the 21st century internet, most of it is now delivered to us from the cloud structures controlled by of just a handful of companies, primarily Amazon via Amazon Web Services (AWS), Microsoft and Google, but there are others.

Configuration error

The outage this week was caused by a configuration error at Fastly, a company in a cloud subspeciality known as a CDN, a content delivery or distribution network. These cloud intermediaries reduce the time it takes web pages to load by hosting data across numerous data centres, rather than just one.

Fastly manages 10 per cent of global website traffic. Just two other CDN companies carry the majority of the rest – Cloudflare and Amazon's CloudFront. That is, until they don't, and web pages don't load at all.

Many media companies reply on Fastly’s service, and for about an hour on Tuesday morning, numerous media websites failed, including those of The Irish Times, the New York Times, the Financial Times and the Guardian. But other sites were affected, too, including Amazon (ironically) and GitHub, a site where millions of software developers manage their code.

Years ago, when we all relied less on the internet, outages were more common. Sites might go down due to problems with a web-hosting company, or a network, or a glitch in a company’s own web-hosting servers, or a hacking attempt. Cloud services have long been touted as a solution to that lack of reliability, on the basis that they offer a more professional, concentrated hosting, security and management service.

That might have been true for a while. But cloud services have consolidated into a few powerful apex specialists, with further ranks of cloud service providers arrayed within their fiefdoms, in an opaque sector subject to little regulation. Much of what such companies claim to be able to do is taken on faith.

Risks

Increasingly, we are seeing the weaknesses and risks inherent in reducing the net’s diversity and creating huge potential points of failure and vulnerability. A network outage now doesn’t take down one or 1,000 websites, it can take down thousands, even millions, at enormous collective cost to the organisations affected and their service users.

We know the largest cloud services are fallible. Last year, an outage at AWS collapsed websites for many hours all down the US’s east coast, while an earlier failure at CDN Cloudflare took out much of the web.

While the internet overall still has redundancy – that ability to route traffic elsewhere when there are basic plumbing problems – the narrowly controlled, highly concentrated cloud-based infrastructure built on top of that plumbing has created single points of failure for much of the visible internet. Almost everything we interact with is beholden to the few.

Rather than mitigating risk, the web’s infrastructure now concentrates risk, and not just from the inadvertent glitch. Even if the glitches and networking errors have been fixed relatively swiftly, anyone can see the disasters waiting to happen.

In particular: what happens when one of these few points of cloud concentration goes down due to a cyberattack? When a single point of failure causes millions of global knock-on points of failure? When much of the net collapses (AWS alone carries about 40 per cent of the net)? When our cloud-stored data leaks?

The aftermath of the Health Service Executive cyberattack should make us all aware of just how catastrophic that would be – and should compel Ireland to push for answers to this global infrastructure problem from its current position on the United Nations Security Council.