Definition of Ready, Done? What about a ‘Definition of Broken’?

As the IT world has been largely taken over by Agile methods, the concepts of Definition of Ready and Definition of Done have become mainstream. While these concepts were introduced at the story/sprint level in Scrum, they have taken on a wide role and are generally used at all levels these days, not just on stories, but also on features and epics, the larger items in the agile-tree. This is not the first time in the IT-revolution that concepts from software engineering have been simply moved to other areas, but I digress (as usual, but we can hope this gets the digressions out of the way…). There is, however, a new concept that maybe very helpful at the higher levels that we might use: a Definition of Broken.

TL;DR

When we create new things, they may go wrong in unexpected ways. There may be unintended consequences. It may even go horribly wrong, fail, etc. While we do try to prevent this when starting a new endeavour (e.g. enable a new platform in our organisation), our focus naturally is on ‘what success will bring’. It is hard for humans to have a really good look at ‘what can go wrong’.
A little managerial gem called a pre-mortem helps by backcasting from the ‘broken’ end result to the choices of today. To help the pre-mortem process, we might work with an initial Definition of Broken, just as your standard increment works with a Definition of Done.

[Update 29/4 because I noticed a unclarity in a discussion on LinkedIn that followed: the Definition of Done generally is used as the definition of the result of an epic or feature (or as we talked about it in the past: the result of a project). The Definition or Broken is meant to be used at the level of what we get if we use the result of that ‘done’ epic in our organisation for years. And from that Definition of Broken we can analyse back things for our Definition of Done which might otherwise have been overlooked. So, you can reach your Definition of Done successfully, but the unintended long term consequence can be something really bad. The pre-mortem is for trying to identify these consequences.]

A Scenario

Suppose you are developing a platform in which all kinds of agile teams in your organisation will develop their applications. The platform offers them a self-service of the deployment of various components they can use inside an environment you offer. Now — just for the sake of argument — let’s assume these software engineers are not that focused on cybersecurity. It’s not their thing. For instance, let’s assume we are talking UX (User eXperience) specialists who love to code seamless experiences in React and are of the type that they are highly focused on giving the user their best experience. And let’s also assume this is a landing zone where it is very hard to separate infrastructure deployment (that interesting stack of platforms) from application deployment. So, instead of offering them a perfectly controlled environment where you can actually narrowly manage what they can and cannot do (read: break), you are forced to move from a ‘no, unless explicitly allowed’ to a ‘yes, unless explicitly forbidden’ governance. For instance: you create deployment templates for them to use and which they can adapt to their own needs and you set a few guardrails to limit the things you have imagined beforehand that would be wrong.

No way that could go horribly wrong, right?

Actually, this can go horribly wrong of course. You clearly cannot prevent all the things that could be wrong — that list is endless —, which is of course why ‘yes, unless explicitly forbidden’ is much less in control than ‘no, unless explicitly allowed’.

In the previous scenario, suppose that is some AWS based setup and suppose you have allowed the app-builders all the freedoms they have that Amazon gives them. It is easy enough to imagine a scenario where they have opened up their environment to the internet and have so little security in place that at some point you find out your data has been leaked to and ransomed by some evil operator. The news will not say, “Amazon had a leak”. It will say “Company X had a leak, because they had made errors in their AWS setup”. After all you own what you can change.

Or, in other words, your nice design choices may have unintended consequences and a ‘yes, unless explicitly forbidden’ has more ways to go wrong than you can ever can prevent beforehand. So, we’re helpless, here? Not quite.

Enter the Pre-Mortem

A little gem of a managerial strategy, the pre-mortem, can help us here. Normally, people will try to answer the question ‘what can go wrong?’ But that question is very vulnerable for the effects of ‘happy flow’ focus and ‘groupthink’ and missed aspects. A pre-mortem turns the question on its head. Instead of asking ‘what can go wrong?’ it says ‘Imagine it has gone wrong. What led to that state of affairs?’ It is like the backcasting element of strategy, but from a failure perspective, not a success perspective.

The pre-mortem idea is of course taken form a post-mortem, which generally is a root cause analysis performed after something has gone wrong. A post-mortem literally comes from: examine what has died to find the cause of death.

The outcome of that pre-mortem exercise are scenarios. By spelling out those scenarios you can identify risks. Some of these scenarios will be extremely unlikely and you can then classify them as ‘acceptable risk’. Remember: working safely is consciously taking acceptable risk. There may be scenarios which are all to realistic and believable. There, you need to address them now before they blow up in your face later.

For instance, in the scenario above, we might have a pre-mortem that contains this: “Our platform has been taking over by evil people who have stolen and ransomed our data”. And the pre-mortem contains a scenario where the ‘happy UX hackers’ have downloaded all sorts of nice plugins from the internet and used them in their application. One of these contained a trojan. That is clearly an unacceptable risk and a design decision can than be added to not accept any code unless it has been scanned by a vulnerability scanner, or — if you do not trust technological fixes alone — only stuff that is in your own repository may be used and for getting stuff in that repository, there is a strict process in place (including vulnerability scanning). And maybe, for good measure, the ‘happy UX hackers’ are all trained on vulnerabilities and required to grow up.

A key element of the pre-mortem is your choice of ‘when’ in the future. This needs to be far enough in the future for longer scenarios to play out. I would suggest choosing a moment that you consider your new platform has been in use in full and everyone is comfortable with it. It has become part of the ‘furniture’ of your organisation. A period of 4-5 years into the future sounds right. In any case it will be way further in the future than your Definition of Done. Way, way much. Note, the lessons learned from a Definition of Broken (4-5 years in the future) may of course end up in your Definition of Done (at the end of the increment).

[Update 29/4 because I noticed a unclarity in a discussion on LinkedIn that followed: the Definition of Done generally is used as the definition of the result of an epic or feature (or as we talked about it in the past: the result of a project). The Definition or Broken is meant to be used at the level of what we get if we use the result of that ‘done’ epic in our organisation for years. And from that Definition of Broken we can analyse back things for our Definition of Done which might otherwise have been overlooked. So, you can reach your Definition of Done successfully, but the unintended long term consequence can be something really bad. The pre-mortem is for trying to identify these consequences.]

A Definition of Broken

The Pre-mortem process starts with identifying the ‘broken’ situations you want to attempt to create scenarios for. These situations together make up your definition of ‘broken’. For instance, if we have 10 agile teams, if they all use completely different techniques, our organisation becomes vulnerable in several ways. It makes it hard for for something like Security Operations to manage it all, for one. But there is more than just security: using 10 different programming languages, 10 different sets of frameworks and runtimes and databases, etc. also makes it impossible to adapt quickly if one are needs more work at the expense of another. You have to add the ‘cost’ of retraining. When your organisation has 100 agile teams and uses 10 different languages, however, this training issue may become less of a problem.

From an organisation’s perspective (and let’s not forget: that is the only perspective that matters in the end), broken can mean much more than just a security breach. It can mean a loss of control and assurance, a loss of agility, cost, a loss of reputation, and much more. So, when you start your pre-mortem the first thing is to get a Definition of Broken, a description of “it all went pear shaped and here is what we ended up with”.

Let’s sum up a few examples of what could go into that Definition of Broken. Some entries are of course quite obvious (security breaches for instance), but some others maybe unintended consequences that need attention. Here is a start:

It is 5 years from now. For each of the following situations in 5 years time, the question to answer is “how did this state of affairs come to pass?”. Our (starting) Definition of Broken is:

(The big #1) Our data has been stolen and/or ransomwared through or new platform;
We need to change the way teams use our platform. But we cannot because the teams are unable to adapt;
The teams need a change in the platform, but we are unable to adapt;
Cost for the organisation has spiralled out of control thanks to our platform;
We can no longer give assurance of control to our regulators thanks to our platform;
We built the platform, but in the end nobody started to use it and the innovation is scuttled;
There has been a long, slow accumulation of debt and the platform has become a millstone around our collective necks.

Then, for all the scenarios, you either accept the risk if it is small enough (or too big to do anything about it) or you mitigate in your design decisions. The above items of course may lead to common causes. So much the better.

P.S.

I happen to like the division of strategy into 4 elements:

Forecasting — the stuff that is going to happen to us;
Backcasting — the future state we want;
Uncertainty (scenario) planning — taking uncertainties into account;
Strategic Agility — the fact that IT is slowly petrifying us because IT has a fundamental inertia

You could perhaps say that a pre-mortem is an important enrichment of backcasting. And yes, that backcasting item (mostly including the first one) — without the pre-mortem — is what dominates orthodox Enterprise Architecture and IT Strategy (and that is one of the main problems of orthodox EA).

Image: Rembrandt – The Anatomy Lesson of Dr Nicolaes Tulp (post mortem, but what the heck, how do you illustrate a pre-mortem?)

[You do not have my permission to use any content on this site for training a Generative AI (or any comparable use), unless you can guarantee your system never misrepresents my content and provides a proper reference (URL) to the original in its output. If you want to use it in any other way, you need my explicit permission]

Definition of Ready, Done? What about a ‘Definition of Broken’?

TL;DR

A Scenario

Enter the Pre-Mortem

A Definition of Broken

P.S.

1 comment

Leave a comment Cancel reply

TL;DR

A Scenario

Enter the Pre-Mortem

A Definition of Broken

P.S.

Share this:

Related

1 comment

Leave a comment Cancel reply