Lessons learned from the Facebook outage? Tech can break, and it will.

During and after the blackout at Facebook last week, a lot of social media, IT companies, and people on LinkedIn gave their views on the matter. Between disdain and sincere compassion about the cause of these issues and the time required to fix these, there was a lot to do about BGP (and the misrepresentation of the meaning of that networking protocol acronym).
Experts rushed to explain the way the internet worked and how Facebook should have built and managed their network better. Vendors started their marketing machine, claiming that their BGP implementation is better and that your service would never suffer this fate if you were to buy from them. Facebook shared in an official statement that it was a human-error that triggered the multi-hour blackout, due to the heavily automated network infrastructure this mistake propagated rapidly throughout the Facebook global network. Similar problems happened around the world, for example, recently to Lumen (a big global ISP), and there are plenty of examples just like this.
It really all shows only one thing – technology can fail – and it will.
Stuff will break, people will make mistakes, and processes can have loopholes that nobody thought about. Even when you have automated all of your systems with the correct approval and audit-trails, things will slip through. Then, because of automation, a simple mistake can turn into a massive outage very quickly, which historically would only impact a single node or part of the network.
Knowing that, what can you do about it? As a person or an enterprise?
Our tip would be to do a thought exercise where, for each critical application, system, or connection in your IT network, you envision it suddenly breaks and takes hours or days to fix. Start by considering how failure domains are isolated in your network (can a single server, rack, office, datacenter site or country/continent fail?). Furthermore, make sure there are proper procedures in place for either out-of-band reachability (via 4G, separate connections, etc.) or on-site recovery (it’s probably cheaper to have a 4G router in your rack than have datacenter smart hands resolve your unreachable server in the middle of the night).
Most importantly, try not to rely on a single partner, carrier, vendor, or solution – and if there’s really no practical way of doing so (there always is, it’s a matter of costs), make sure you have an understanding of the impact on your business. Because that’s the trade-off.
On the connectivity side (which is our thing), there are plenty of options to make yourself more immune to big outages – it helps that we’re neutral and that everything is managed by GNX People (BGP).

We are the leading provider of global internet and private connectivity solutions, here to guide you on your next steps. Get in touch with our team to learn more.