I’ve been writing about the effects of the IT-revolution on organisations and society for a while now. It must have been about 7 years ago that I first seriously contemplated what would happen to an organisation if it would lose all of its IT overnight. The scenario in question was an old nuclear reactor blowing up, Chernobyl-style, ending access to a twin-data center setup. Such an event is not likely enough to warrant setting up an entire very costly third data center (or moving both to another location, where of course other disasters can happen), but it’s a risk for which you at least want your backups to be in a third location at quite a distance. But there is more to continuity than having good backups.
TL;DR
The world is waking up to the systemic vulnerabilities of our massive use of interdependent large logical (IT) landscapes. These not only lead to inertia — change becomes harder and harder —, but also to a brittleness of our organisations — and thus society — we do not find acceptable. One such is the No-IT scenario. Be it a Chernobyl-style near-meltdown of a nuclear reactor next to your high-available twin data centers, or an extremely effective and insidious ransomware attack where you have to turn off your IT and then have to address the question: “how do we get up and running again?”, followed by: “within days, please, or we will go out of business.”
The #1 concern here is that you need to address prevention. Really. You need to make your landscape more resilient. Prevention is many times as effective as a ‘cure’. Think ‘Zero-Trust‘. But even then, invulnerability doesn’t really exist. So, you need to think about recovery from a (catastrophic) No-IT situation anyway. This requires fixing two issues in the following order:
• Out-of-Systems: you need to rebuild your infrastructure and application landscape;
• Out-of-Sync: you need to make sure that the data in your landscape is complete and above all in sync again. It will not do to have your sales system running ahead two weeks on your accounting system.
The first one is relatively doable: infrastructure as code, CI/CD pipelines, for many elements we have the technology. But the second one is a nasty, messy, nightmare which potentially isn’t really solvable. It is not that you cannot really do it, (probably) nobody can. The problem is too unpredictable, the situation too complicated. Preparing for this essential element of recovery may require you to think about your ‘Minimum Viable Organisation’, the landscape that comes with that, and for each element, what scenarios there are of getting back to business as usual. None of these will be simple or easy, of course, but then again: what in real IT is?
In many organisations, this kind of a disaster is now taken much more seriously. Not because of exploding nuclear reactors, but because of what organisations have seen as the result of ransomware attacks. The most extreme version of a successful ransomware attack produces the same kind of ‘armageddon’ as that nuclear disaster scenario: suddenly: your IT is gone. Entirely.
And these attacks are all too real. They happen. We hear and read about them. It may be a large harbour one week, or elements of health care in another. They are apparently much more real than this nuclear disaster. And this makes everyone, from regulators and insurers to management of organisations very alert. Granted, most attacks are still effectively limited in scope as generally, only a limited number of systems will become compromised. But even then, on average, recovery from a simple ransomware attack takes 3 weeks. Of course there are situations where organisations have lost access to that data for years (more or less permanently), but that does simply mean that company did not have decent backups of that data (i.e. no backups or the backups were ransomed as well).
You won’t see organisations talking about this publicly, and there is a good reason for that: when that nuclear reactor explodes, you’re not to blame. But when you suffer from a successful ransomware attack, it is your security that failed. It makes you look bad. And discussing that you are vulnerable — even if everybody is — in the open thus makes you look bad. Because the world around you will be going: “is your IT that bad that you can be hacked?”. Trust is brittle too, and if it is lost, damage can be huge. So, it’s very much ‘hush-hush’ everywhere.
Security professionals — and if they listen: management — know that you simply cannot rule out a breach, a new previously unknown vulnerability can appear out of nowhere, and humans make errors, to name a few unavoidable elements. That is why security professionals have been promoting ‘zero trust‘ architectures, architectures of landscapes that assume a breach, and try to make the landscape more resilient against such a breach. Such landscape architectures focus on limiting access, e.g. via segmentation/compartmentalisation, strict identity and access management, etc.. Turning existing — often performance, cost, and functionality-driven — landscapes into ‘zero-trust’ landscapes is pretty hard, by the way. A simple — “they are so fast these days, you won’t even notice” — firewall put in somewhere can create an unacceptable hundredfold drop in performance of a critical function, simply because of the cumulative effect of that almost unnoticable latency added to traffic. Such effects may actually not only slow down something, it may effectively kill it. Prevention really is a hard problem, because security seldom is a free lunch.
But that still leaves you to consider the — deemed not entirely avoidable — situation in which your measures failed (which can happen even if your landscape is ‘perfect’) and someday you wake up to a landscape that has been hijacked, the data encrypted by evil parties, and they demand ransom. What if you have obligations that give you very little leeway in postponing action? What if it is the 24th of the month and next day you need to pay salaries for thousands of families? What if you have massive financial obligations where risk management requires collateral movements daily, and not providing these within days will potentially bankrupt you in a jiffy?
This leads to the questions regulators, customers, shareholders, and management are increasingly asking the IT-people: “Create a solution that will let us recover within a certain number of days from a complete No-IT scenario.”
The nasty, messy issue of recovery
Let’s assume a really worst case scenario. The attackers have been in your systems for an extended period, undetected. They have wreaked havoc and you find out in the late evening of December 31. The situation is dire: You do not know which systems have been compromised, so every system that is turned on can reinfect the rest unless you know it is clean. So, the first thing you have to do is turn everything off to make sure the infection doesn’t spread further.
No you truly are in a No-IT situation. What is required to recover your organisation? In fact, there are two separate elements:
- You need your systems running again. That means all your infra, your platforms, your applications. The No-IT situation has left you in a situation of Out of Systems;
- When you have succeeded in getting your systems back up, you need your systems to have correct data and above all this means that the data of these systems must be in sync. Having your sales system with data from two weeks ago and your accounting system with data of today is not a workable situation, and your Identity and Access with data of a week ago. The No-IT situation has left you in a situation of Out of Sync;
Of these two, the former is relatively easy. With the advent of ‘infrastructure as code‘, ‘CI/CD pipelines’ your systems have become ‘data’ already and as luck will have it: that data is pretty non-volatile compared to what you normally see as ‘data’. It is still easy enough to underestimate all the stuff that is not as easy to restore (i.e. what about those appliances where you have little control? what about configuration, can you recover these too as easily? what about all those containers you deploy when your own secure, scanned repository isn’t available?).
While it is no simple matter, it is at least conceivable that you can mostly make this manageable. Hosting providers that specialise in availability are getting into this market too, though your local IT-professional will try to explain to you that continuity and availability are really two different things.
But the second element is truly nasty and messy. It has aspects of being potentially fundamentally unsolvable. Large complex landscapes — and that includes all the SAAS you use, by the way (that cloud automatically solves most of this is a simplistic myth) — where everything is logically dependent on everything else can fall down if even one part is not available or corrupted.
It is a prime example of a point I have been making for a while now: the IT-revolution has increased our productivity several times over, but IT-landscapes are built on ‘unforgiving’ logic and as a result are brittle (and thus also very hard to change). The productivity gain from the IT-revolution is no free lunch, the price we pay is in part brittleness (of landscapes, and thus organisations, but also of societies) and with that comes a lack of agility, but also this not completely avoidable vulnerability.
Supposing you want to consider this complete IT-armageddon and you must be able to recover — and I can guarantee you that many of you have such key functions in society to fulfil — think banks, health care, electricity, etc. — that ignoring this is not an option — you need to put in quite a bit of work to get to a situation where recovery is possible.
The first thing to do remains of course to up the level of your prevention-resilience if you still need to. Everything you can do to limit the extent of that breach — that one day will happen. ‘Zero-trust‘ really is a key architectural requirement for your landscape. I would put it on #1, whatever else you do. Prevention is much, much, much cheaper than a cure. (Note, this is true also for human-induced ecological disasters, and our track record here should give us quite a bit of pause regarding human intelligence, but I digress, as usual).
Anyway, the issue we are discussing here is to make sure you can recover in case it does go amiss. For that, it is probably best to imagine some sort of ‘Minimum Viable Organisation’ (MVO) that can be up in days and keep running for a longer period (months, maybe even more than a year). Maybe your MVO can do without some systems and temporarily use extra human work, for instance.
It is probably best to imagine some sort of ‘Minimum Viable Organisation’ (MVO) that can be up in days and keep running for a longer period (months, maybe even more than a year).
And with that MVO-architecture in mind, you need to imagine how you first can recover from Out-of-Systems.
Let’s assume you have done that.
Then comes getting your data back into order, i.e. recovering from Out-of-Sync. This is a much harder nut to crack. I wouldn’t be surprised that — maybe with the exception of sectors like in the military or nuclear industry which are used to thinking about vulnerability and recovery and extreme scenarios — there is any ‘normal’ organisation in the world who is ready for this at a scale where starting up your ‘MVO’ is required.
I can imagine that for your MVO-systems, you need ‘runbooks’ to recover from outdated backups (as more recent ones may be compromised) and other information. E.g. if you have to recover an Out-of-Sync accounting system, you might want to recover using information from your (potentially unreliable) internal systems, your bank records, your suppliers or customers (i.e. prepare to make deal with external parties to support each other in case of No-IT).
The whole complicated situation is ripe for many simplistic — silver bullet type — suggestions. Like presenting a solution for only Out-of-Systems as something that solves all of No-IT (which it really doesn’t, not even close).
And especially, one potential ‘solution’ that merits some critical attention is the maximum use of SAAS (Software as a Service, where your application runs in the infrastructure of your vendor, maybe that infrastructure in turn running in some public cloud landscape.
Superficially, moving as much as you can to SAAS seems a no-brainer. After all it separates that system from potential breaches in the rest of your landscape and you do actually lose some complexity as an organisation. But have you considered the other way around? A No-IT scenario for one of your SAAS-providers? If you move 9 key systems to SAAS, even if you do not have problems with latency etc., suddenly you have increased your attack surface from 1 stack to 10, each with a different set of vulnerabilities… And while this tenfold increase comes with a limited effect per breach, your entire operation really falls flat when that single key system — say your accounting system — is down. And recovery from your vendor’s No-IT might be a nightmare. You don’t have access to anything. Maybe ‘recovery from a vendor’s No-IT’ must be something to add to your selection criteria…
It is important not to forget that with all the technical separation we can do, we cannot separate logically. It is nice that your web shop doesn’t crash when your warehouse system is down, but it won’t be able to fulfil its function either.
Maybe it is good to especially warn about plans that are based on the idea that this is a problem that should not exist. Poor prevention is a problem you should address immediately and vigorously. But not having a good chance to get back to business as usual within days after a catastrophic No-IT event is something (a) almost the entire world has and (b) almost the entire world will not solve anytime soon. Our IT simply has never been set up with that in mind, nor is it by definition possible to set it up that way. Assuming this problem can be solved completely and soon is going to backfire, as much as demanding a better prepared Titanic when you are already speeding along in arctic waters.
All of this is part of the phase of the IT revolution that is the ‘complexity crunch’. Forget ‘the singularity’ (really, it is simplistic nonsense). The IT-Revolution too has characteristics of an S-curve. This No-IT stuff is part of that. And like vulnerabilities, it is unavoidable. I strongly suspect the best future performers will be those that understand this before anyone else and adapt their strategy. And if you don’t, paraphrasing Steve McConnell
“As Thomas Hobbes observed in the 17th century, life under mob rule is solitary, poor, nasty, brutish and short. Life on a poorly run software project is solitary, poor, nasty, brutish, and hardly ever short enough.
Steve McConnell, Software Project Survival Guide
The same is true for good enterprise architects and IT strategists at the scale of the organisation itself, if realism is not one of the organisation’s core strengths.
Oh, and did I mention that addressing all of this is going to be (a) unexpectedly expensive and (b) unexpectedly time-consuming? I didn’t, did I? 😎
Maybe we can become open about this and do something about if we accept that we’re all in the same boat.
Photo by Crawford Jolly on Unsplash
[You do not have my permission to use any content on this site for training a Generative AI (or any comparable use), unless you can guarantee your system never misrepresents my content and provides a proper reference (URL) to the original in its output. If you want to use it in any other way, you need my explicit permission.
2 comments