From Dark Scrum to Broken SAFe — some real problems of Agile-at-scale. And a way out.

In 2016, Ron Jeffries — one of the founders of Extreme Programming, a precursor to modern Agile approaches — wrote an influential (in some circles at least) article titled Dark Scrum. In it, he dissects how Scrum (a form of Agile software development) can go horribly wrong, both for the organisation (which does not get the results it expects) and the poor slobs doing the work. Entirely in line with what Steve McConnell once said:

“As Thomas Hobbes observed in the 17th century, life under mob rule is solitary, poor, nasty, brutish and short. Life on a poorly run software project is solitary, poor, nasty, brutish, and hardly ever short enough.

Steve McConnell, Software Project Survival Guide

Now, there is quite a bit of naïveté in the origins of Agile (such as the idea that teams will automatically do the right thing if you tell them the ‘why’, that all of them automatically come up with good architecture, or that YAGNI is a always good idea) but as far as I can see, Agile methodology has become essential to be able to keep changing in an ever more complex environment. Agile really works better. Waterfalls and top-down approaches are seldom feasible in the crazy complex worlds of massive volumes of IT we now live in — though they do sometimes have their use, especially when incomplete states are not an option (think: a new networking and network security implementation).

We are experiencing a tipping point in the information revolution because all the IT volume is starting to show ever more signs of inertia: the landscape becomes harder and harder to change. Adding is relatively easy, changing is extremely hard. So we have no choice. In the end, we do Agile because we have to, and we flexible humans start to organise ourselves around the demands of (inflexible) IT, in Agile organisational structures: product teams, value streams. From Conway’s Law to Inverse Conway’s Law.

And this is what many organisations are now doing. And it can easily lead to something we could call Broken SAFe (or Dark Scrum at scale)

Now, Ron pictured an extremely dark version of scrum, one where ‘the power holders’ just use agile rituals as a way to do more and detailed planning, pile up the work for the developers, instead of letting the team decide how much can be taken on. His answer is: just double down and work in an agile fashion: regular working releases, measure actual speed, and let ‘the power holders’ use that to get a more realistic sense of possibilities, and just use the backlog (“It [is] best to have us work on the most important backlog items first. It [is] best to trim each item down to be as lean as possible.”).

There is of course a bit of a conflict between ‘power holders’ that are supposed to be happy above all with new features and fixed defects on the one hand, and the fact that sometimes work needs to be done that doesn’t bring any features at all (refactoring for better architecture for instance) on the other. A happy agile team needs a mature product owner. In that sense, nothing has changed from the upfront-design period. Oh, and while we’re at it: creating designs is still a thing. Agile teams are not like monkeys typing code that accidentally ends up being something with a recognisable ‘design’.

But even if — at the team level — the team plays ‘planning poker’, thinks of solutions themselves, is taken seriously when they tell the owners what is required for a good product, even then, something akin to Dark Scrum may be happening at the organisation level. The fact that agile methodologies — which were invented to keep teams of programmers happy and productive — is now scaling up to organisations is a repeat of a pattern we have seen many times in the past decades. The fact that this scaling-up is often just theoretical (not yet a ‘practice’, let alone ‘best’) is a repeat of older architecture approaches that tried to scale up from the software world to the organisational work (yes, I am talking about you, ISO/IEC/IEEE 42010 and friends like TOGAF). These never really worked or produced the expected benefits.

Problems of Agile at scale

Agile/DevOps organisations are mostly organised around the products in product teams. These product teams have product owners and the owner and the team together decide how to produce a good product. Two-week sprints and all that. But in complex landscapes, there are all kinds of dependencies between teams. If a new feature requires the availability of something from another team, it is a bit useless to create the possibility if the other team is not doing its part. Hence, in SAFe for instance, we organise quarterly PI-plannings (PI = Program Increment) where all teams and all owners of a value stream come together and coordinate what needs to be done in the next PI. Out of this comes a ‘PI-planning’, in fact an overall planning of a quarter year, generally 6 two-week sprints with some other work surrounding these, like retrospectives etc..

But reality bites: while agile-at-scale thinks in ‘value streams’ and assumes dependencies are handled inside a ‘value stream’, in practice there are dependencies between value streams as well, and sometimes even between business units. So, the PI-planning may be organised BU-wide and other business units might partake in the PI-planning as stakeholders as well.

One step up we find the Board of Directors and they generally steer on the basis of strategic initiatives. There is always more to be done than what we have capacity for, so we see all kinds of mechanisms directed at prioritising. The prioritising idea itself is good, but as there is always more than can be done, the pressure from the top results in a lot of pressure in PI-planning to do as much as possible. You might ask; “What is wrong with that? It’s not that we should be slacking, right?”. No we shouldn’t, but there still is a problem. In fact, there are a couple.

Problem 1: If all these teams in their value streams in their business units are like a machine with many cogs, putting a lot of pressure on it is akin to pressuring the mechanism so hard that all the lubrication is squeezed out. The organisation is going to move less efficiently, anything unexpected has large effects, and there is a lot of friction and ‘pain’. The way to solve this is actually removing pressure (or growing resources), but this is the opposite of what the organisation-wide system naturally does: everybody fights for priorities and extra structural cost is often a taboo of sorts, as the organisation itself is always under external pressure to keep cost down. When the pressure is too high, though, efficiency diminishes — a bit of ‘penny wise, pound foolish’.

Problem 2: Both top-down tree-like reductionist structures (“epics — features — stories” and “business units —value streams — teams”) assume a form of reductionism that is not reality. Our management and ‘change-subject’ structures are trees, but the actual dependencies are often webs. So the mechanisms are not a good fit for the dependencies they need to solve. Basically, this means that everywhere in the organisation there is a constant stream of unexpected demands. SAFe suggests that you need to keep room to manoeuvre in your PI’s, which is wise, but it does not extend that really to the ‘current’ PI. So, in the here and now, the stream of unexpected demands play havoc and produce a lot of friction and waste.

Problem 3: One of the basics of agile is a form of team autonomy. While this is partly naive, it is required in modern complex settings. But our tree-like, reductionist planning approaches in fact turn the formally ‘agile’ organisation into some sort of a continuous series of organisation-wide 3-month projects with little autonomy; our Agile Release Trains and Architectural Runways and Program Increments. While we wanted to become ‘agile’, we can end up with one huge organisation wide ‘project’. Yes a single project. After all, it has effectively a single planning (and in the case of SAFe is based on a framework that now comes in a book the size of a whopping 320(!) pages — see PS below).

So Agile-at-scale can thus be seen to morph an organisation into one very big top-down steered project. Rather ironic, given that state is about the opposite what people who understand the essence of Agile — and why it is needed — want.

The teams ‘suffering’ in these circumstances pile pressure on top management to ‘do something about prioritising better’. Top-down planning fits common belief systems, additional top-down efforts on prioritisation follow — which can makes matters even worse. Agile teams demanding more top-down planning is of course also rather ironic.

Solution?

Agile methods try to optimise by measuring ‘speed’ and thus create a sort of feedback loop to be able to plan better, and that is good, but it doesn’t solve the problems (and even can make them worse by trying to optimise the hell out of the change capacity of organisation). There is in my view a basic insight that should guide us:

If our planning is perfect, we should be as often early as we are late. If we aim to be never late 100% of the time, we are creating pressure which has negative effects on overall speed because of the organisational friction that comes with it. The same is true for ‘scope’ (with some exceptions, such as ‘hard requirements’) and ‘budget’.

Me. Now. The first part taken from early criticism on Standish Group reports about success rates of IT projects, see PS below. I just don’t remember where I picked it up. [Adapted]

Given the negative effect of too much pressure from top-down prioritisation schemes and the negative effect of the stream of the unexpected when this is done at scale, the answer is quite simple: create room (and not just on paper, but in reality). Here is a suggestion for a mechanism to do that:

Suppose your organisation has 4 layers in its agile setup:

  • Board of Directors
  • Business Unit Management
  • Value Stream Managers
  • Product Owners

We decide to give every layer (except the BoD, the buck has to stop somewhere…), say, 20% ‘off’. ‘Off’ meaning: no work is being planned here, but work is being done based on immediate need and if there is no need: additional improvement. Additional, because improvement is also part of your real planning. The number 20% is just for illustration of the effect. Now, this means that every higher layer will have less it can put pressure on:

LayerProduct Team PrioritisesValue Stream PrioritisesBusiness Unit PrioritisesBoD Prioritises
Product Team Produces20%80%
Value Stream Produces20%80%
Business Unit Produces20%80%
Example 80/20 Basic Prioritisation/Autonomy Approach

The effect is that, for instance, the Business Unit prioritises 80% of the Value Stream, which prioritises 80% of the Product Team, or, in other words: the Business Unit prioritises 80% of 80% of what the Product teams do. That is 64%. And the strategic initiatives prioritise 80% of 80% of 80% of what the Product Teams do, or 51% (gasp!). Which looks like this:

LayerProduct Team PrioritisesValue Stream PrioritisesBusiness Unit PrioritisesBoD Prioritises
Product Team Produces20%80%64%51%
Value Stream Produces20%80%64%
Business Unit Produces20%80%
Example 80/20 Agile Prioritisation/Autonomy Matrix

Of course, it is possible to play with this. In some organisations with more independent Business Units, a BoD might only prioritise 30% of what a Business Unit does. Keeping the rest equal, we get:

LayerProduct Team PrioritisesValue Stream PrioritisesBusiness Unit PrioritisesBoD Prioritises
Product Team Produces20%80%64%15%
Value Stream Produces20%80%24%
Business Unit Produces70%30%
Example More Autonomous Business Units Prioritisation/Autonomy Matrix

Conclusion

In Agile-at-scale, SAFe or otherwise, we need to guard for the effect of too much pressure. This requires a ‘letting go’ attitude of higher levels, and really thinking about levels of ‘autonomy’. This is really hard. It flies in the face of all the responsibilities that are heaped upon those higher layers (really, these often people have difficult jobs to do). Telling a Board of Directors that they should not go beyond 50% of the change capacity of the organisation when deciding on strategic initiatives is truly a hard sell. Of course, the reality is that if they try to prioritise 100% of the change capacity, the results will be both late and less, the effect probably even worse than the (partly) ‘letting go’ suggested here. Inverse Conway’s Law requires more and more autonomy, but — again — it will be a hard sell.

Very hard.

Image (c) 2014 dogrando

PS. In 1994, the Standish Group reported that of all IT Projects, 31% failed with nothing to show for it, with only 16% fully succeeding. The problem with their analysis was (and maybe is) that they defined a project a success when it was on time, within budget, and delivered all the planned results. If we are perfect estimators/planners, the results should in 50% of the cases be too negative, and in 50% be too positive. In that sense, we would expect that ‘successful projects’ would make up 12.5% of the total — 50% (time) of 50% (budget) of 50% (delivered) — and the result was 16%, which actually is a good result. Of course, that 31% total failures should have been 12.5% as well, so the message was still valid. Standish has produced such reports roughly every two years after 1994.

PPS. SAFe 5.0 is a book of a whopping 320 pages. Not yet the even more gargantuan 542 pages the orthodox TOGAF EA framework has grown in to, but still. Basically, this too is a pattern of the IT revolution:

As IT governance and management frameworks are logically structured and reality is not, in an attempt to catch-up with reality such frameworks tend to balloon, until they become unwieldy and not really executable.

Me. Now.

Watch out SAFe, it might be happening to you too…

5 comments

  1. How do you derive the assertion that “If our planning is perfect, 50% of the time we should be early and 50% of the time we should be late.”? Is this assuming a normal distribution, or probabilistic estimate that a perfect plan as Time=X would mean random chance would result in being under and over X 50% of the time?

    Like

    1. I did not derive it, it came from some scientific publications I read about the Standish methodology. But yes, it assumes we cannot be perfect, so there must be a distribution. Planning is always an estimate (we’re in the real world, not in a perfect logical abstraction) and both underestimating and overestimating thus arise and the optimum lies in the middle.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: