Lessons learned from the Facebook outage? Tech can break, and it will.
During and after the blackout at Facebook last week a lot of social media, IT companies and people on LinkedIn gave their view on the matter. Between disdain and sincere compassion about the cause of these issues and the time required to fix these issues there was a lot to do about BGP (and the misrepresentation of the meaning of that networking protocol acronym). Experts rushed to explain the way the internet worked and how Facebook should have built and managed their network better, vendors started their marketing machine claiming that their BGP implementation is better and that your service would never suffer this fate if you were to buy from them. Facebook shared in an official statement that it was a human-error that triggered the multi-hour blackout, due to the heavily automated network infrastructure this mistake propagated rapidly throughout the Facebook global network. Similar problems happened around the world, for example recently to Lumen (a big global ISP), and there’s plenty of examples just like this.
It really all shows only one thing – technology can fail – and it will. Stuff will break, people will make mistakes and processes can have loopholes that nobody thought about. Even when you have automated all of your systems with the correct approval and audit-trails, things will slip through. Then because of automation a simple mistake can turn into a massive outage very quickly, which historically would only impact a single node or part of the network.
Knowing that, what can you do about it? As a person or an enterprise?
Our tip would be to do a thought exercise where for each critical application, system or connection in your IT network you envision it suddenly breaks and takes hours or days to fix. Start by considering how failure domains are isolated in your network (can a single server, rack, office, datacenter site or country/continent fail?). Furthermore make there are proper procedures in place for either out-of-band reachability (via 4G, separate connections, etc.) or on-site recovery (it’s probably cheaper to have a 4G router in your rack than have datacenter smart hands resolve your unreachable server in the middle of the night).
Most importantly try to not rely on a single partner, carrier, vendor or solution – and if there’s really no practical way of doing so (there always is, it’s a matter of costs) make sure you have an understanding on the impact on your business. Because that’s the trade-off.
On the connectivity side (which is our thing) there’s plenty of options to make yourself more immune to big outages – it helps that we’re neutral and that everything is managed By GNX People (BGP).