The world has been getting a lot more complex, most people will agree to that. A major element in that rising complexity has been the insanely huge amounts of machine logic we human species have been adding to the world. Both that logic itself, as well as what it enables — think globalisation of trade and communication — has made most of our lives more complex and complicated in one way or another. And while it has brought us much, it also has a serious number of unwanted side-effects.
We run into the boundaries of our ability to handle that complexity on a daily basis. Be it the large IT projects that invariably run late, cost too much, maybe even straight out fail. Or how we must try to stay secure in a digital world full of brittleness of logic and the weaknesses of humans.
[Note 1: This article is meant to be understandable (with a bit of effort) by non-specialists. There is quite a bit of jargon in it, but I try my best to explain all of it, including a table below with extensive explanation of a number of key terms. If I don’t explain a term, it is safe to ignore it if you don’t know what it is (e.g. when I mention a ‘tomcat server’ as an example), it is helpful for those that know, but not really necessary to know what it is to understand the story]
So, it isn’t a surprise that in IT, a constant drive over the last decennia has been the drive to reduce complexity. ‘Reducing complexity’ sells. Managers in IT are especially sensitive to it, as complexity is — generally — their biggest headache. Hence, in IT, people are in a perennial fight to make the complexity bearable. One method that has been popular for decennia has been standardisation and rationalisation of the digital tools we use, a basic “let’s minimise the number of applications we use”. This was actually part 1 of this story: A tale of application rationalisation (not). That story from 2015 explains how many rationalisation efforts were partly lies. (And while we’re at it: enjoy this Dilbert cartoon that is referenced therein.) Most of the time multiple applications were replaced by a single platform (in short: a platform is software that can run other software) and the applications had to be ‘rewritten’ to work ‘inside’ that platform. So you ended up with one extra platform, the same number of applications and generally a few new extra ways of ‘programming’, specific for that platform. That doesn’t mean it is all lies. The new platform is generally dedicated to a certain type of application, which makes programming these applications simpler. But the situation is not as simple as the platform vendors argue. As Frederick Brooks had already told us in 1986: There Is No Silver Bullet.
Another drive has been to encapsulate (hide) complexity and access it through simpler interfaces. And a third has been to automate IT itself, creating complex ‘management IT’. All three play a role when we start to outsource IT to cloud services like Microsoft Azure or AWS.
[Note 2: This article has gotten out of hand. Totally. Quite long while I don’t digress a lot — as I often do. But exposing hidden complexity cannot be done without presenting it to you. And not understanding how complex the real IT world is leads to bad outcomes. When the software for supporting the Covid-19 vaccination campaign was a few weeks(!) late because testing wasn’t done yet, I read about a leading politician stating something like: “Come on, a bit of testing, how hard can it be?”. That is a cringeworthy display of not understanding how complex IT is. And that is partly why I write this. Because until our leaders actually start to understand this, they will create more and more disasters out of ignorance. Anyway, back to the story.]
Cloud services have generally been explained (sold) to us with a graphic like this:
For non-technical people, here is a basic explanation of terms used in the above figure and in the text.
On Premises: Using your own hardware
Application: Software that you use, e.g. Microsoft Word.
Data: The data in your program, say your text in Word.
Runtime: Software that is required for other software to run, e.g. basic functionality for Java programs (‘Java runtime’). (Java is a programming language.)
Middleware: More complex software that is required for other software to run but may also have its own function. E.g. a database.
Operating System: The lowest layer of software that sits between the machine and all other software. Like Windows or macOS at home.
Virtualisation: A way to turn one very big real ‘machine’ (computer) into many virtual machines by arranging multiple operating systems to share the big machine. Sharing increases efficiency because not all operating systems are busy at the same time. It also has other advantages.
Server: The big ‘real’ machine. Like your computer at home but much bigger with multiple processors and lots of memory so it can be shared.
Storage: Separate machine that is optimised to provide storage (disks), can be used by multiple servers. At home people sometimes have this too in the form of a NAS. These are generally ‘appliances’, that is specialised hardware (in this case with a lot of disks) with specialised software to manage them.
Networking: Separate machine that enables data traffic between systems. Like your modem, router and Wifi Access Point at home. Again, an appliance with specialised hardware (in this case network interfaces such as wifi antennas and sockets for network cables) that has specialised software to manage these.
The suggestion is this: as we move away from our own IT ‘on-premises’ (which includes whatever co-hosting data center you use, in this context it just means you own your own IT hardware) to more and more in the cloud, we are outsourcing more and more, we are responsible for less, our life simplifies. More cloud is cheaper, simpler and more flexible. What is there not to like?
What there is not to like is that this suggestion is for a large part a lie. And a nasty one.
Take for instance networking. According to the graphic, as soon as you move to the cloud, it’s no longer your responsibility. But that is a lie, except for the rightmost option (SAAS — more about this later). If you set up your IAAS or PAAS in the public cloud — say Microsoft Azure — you have to manage quite a bit of networking. In fact, while Microsoft runs the underlying hardware, much of what has to be managed, will be managed by you. You decide on segmenting, networking, VPNs (virtual private networks — a way to protect traffic between networks), routing firewalls, etc., you’re just using Azure tooling to set it up. It’s easier, but it’s far from all gone.
It is best explained by using an example. Suppose you open up some of your cloud-based systems to access from the public internet? You can do that. And suppose you shouldn’t have, because these systems contain sensitive data? And suppose this data is stolen in a very public breach? Who is to blame? Microsoft for providing you with enough rope to hang yourself, or you? It is clear that if this is a big news story, the heading will not be “Microsoft was lax with its security and management”, but “Company X was lax with its security and management in the cloud“.
So, the reality of the situation is therefore more like this:
The hardware — the iron — is indeed something that is completely handled by a cloud provider. This includes things like connecting the server to networks, power, etc., and replacing disk drives, fans, etc. In a fully self-sourced setup this actually turns out to be a limited affair. Most of the work is not hardware these days, it is software. Companies that run their own on-premises data centers don’t have a lot of data center hardware operators. Take the networking engineers. They may lay a few cables to a switch, but after that they quickly move to the management console and manage the appliance through its management interface — in other words: software work. Networking engineers, storage engineers, and compute engineers alike, their main tool is not a screwdriver, their main tool is a keyboard. Only the basic servers for virtual machines have little in terms of configuration. Networking and storage are appliances, specialised hardware with specialised software. The cloud provider has an interface on top of these that gives you that ‘enough rope to hang yourself with’, i.e. much of this is actually set up and maintained by you. Microsoft doesn’t create or manage your firewall settings, it only offers you an interface to create a virtual firewall running on their appliances and manage that yourself. So it is a shared responsibility, and especially in networking: you do most of the work in much of the same way you would have to do when you were running your own appliances. Using a firewall in Azure is Microsoft spinning up a virtual appliance for you. And from that moment on, the work is all yours.
The only form of cloud where you really get rid of a lot of responsibility is Software-as-a-Service, or SAAS. That is because SAAS actually simplifies matters… …for the vendor. As explained in the EAPJ article Vertical Integration versus (horizontal) standardisation, the big advantage of SAAS is that the vendor of an application doesn’t need to support a myriad of technical landscapes out there, no myriad of different Linux versions, Java versions, as well as their configurations (security baselines, anyone?), just a single stack they fully manage themselves. That brings a huge standardisation for the vendor, and the advantage of that can be sold (in part) to the customer. Your responsibility is generally limited to a bit of Application tinkering (maybe add plugins, do some configuration) and of course your content. (And even the almost total outsourcing of SAAS is a little lie as it leaves out whatever you need to do to get your data in that application available to other applications, which often includes some stuff in the lower layers, as you can see, some complexity is even hidden by me here).
The interesting services are thus IAAS and PAAS.
As you can see, with Infrastructure-as-a-Service, or IAAS, there is actually not much you get rid of. Except for hardware and virtualisation, you need to do the work for everything else. This is why so-called “lift and shift” operations (move your infra as-is to the cloud) have mostly failed. There was little gain in the work you had to do— which is most of your cost — and cloud resources that are in constant use are much more expensive than your own, that is, if you have a bit of scale yourself. So, simplistic over-enthusiasm of Cloud-First policies quickly gave way to huge headaches. Especially financial ones.
Platform-as-a-Service, or PAAS, does bring a lot of advantages. A platform is software that is required for other software to run. The cloud platforms are either completely (OS) or largely (others, like ‘application servers’ such as tomcat) managed and there is much less that you have to do. They are patched and life cycle management is done on them. They have basic configurations with some ways for you to tweak details. Still some work, but far less than when you would install the platform itself.
There also is an important misleading factor in IAAS that is not in the picture. If you do IAAS in the cloud, you end up with not just an empty virtual machine, you end up with a virtual machine (VM) with an Operating System (OS) on top of it. When setting up the VM you have to tell the cloud provider which ‘image’ (a file) to use to put on the virtual machine. That image means you install an OS on top of it. You may provide your own image, but the cloud provider also provides a few ones. This gives the illusion that a running OS is part of IAAS, but it isn’t. This becomes clear when you look at things like licenses (you have to provide your own) and above all maintenance/operations. You are fully responsible yourself for the OS. For its life cycle management, security patching, logging, monitoring, identity and access management, and configuration in general. Basically, that OS image that they provide just hides that you are providing it yourself (by copying theirs). And it also makes people forget that the initial install is just a most minimal fraction of what it means to use an OS.
A more realistic picture of what is happening might be this one:
This picture shows more of the things you are responsible for when you use IT (and who doesn’t, these days?). Starting from the bottom up:
|Hardware||The actual ‘iron’, the Real Machines (as opposed to the Virtual Machines below). These may be dedicated appliances, optimised for a certain task with specialised hardware and software (networking, storage), or they may be more generic computers, extremely souped-up versions of your PC at home. These contain the bits that make up the heart of IT: the execution of machine logic. E.g. CPUs (processors) and RAM (working memory).|
|Foundational Infra||The hardware is separated from these, the actual management of all the software is what the boxes stand for. It is good to mention here that all kinds of (hyper)converged solutions exist that offer combinations of compute, storage, and networking in one appliance.|
|Directory||Generally overlooked. But in large and complex infrastructure worlds you need a setup that maintains basic information that all the components can access with information about each other. Example: to store identities, groups of identities and credentials, and give these identities and groups access rights (after all, we want this to be secure, right?) we need some way to store and access that information.|
|Networking||The function to set up network addressing, routing, switching, firewalling, in other words everything you need to arrange that systems can actually copy data to each other. Specialised software that runs on specialised hardware, e.g. with sockets for network cables or antennas for wireless communication.|
|Storage||The function that provides ‘raw’ storage. E.g. if your system used a shared network drive, somewhere under that is a storage provider. But also that virtual machine uses this storage. Generally software that runs on specialised hardware (e.g. with lots of disks).|
|Compute (VM)||The function that provides generic Virtual Machines (VMs). The generic hardware mentioned above runs ‘virtualisation software’ or ‘hypervisors’ (a term IBM coined already in the 1960’s, because the Operating System (below) is the ‘supervisor’). A machine (real or virtual) requires an Operating System (like your Windows or macOS at home) to do something.|
|Platforms||Software that can run other software. In reality a theoretically endless stack and sometimes even a web (see below).|
|Operating System (OS)||The lowest platform level. It is deployed on a (virtual or real) machine. In the cloud, it is by definition deployed on a virtual machine. A VM with just an OS and nothing else is the ‘thinnest’ version of PAAS. If you use PAAS (or SAAS of course), the OS is fully managed by the PAAS provider. As these are very complex systems, they are hard to operate well and keep in good order. E.g. to keep such systems secure and running requires some attention. No machine can run without an OS, the software that ‘operates’ the machine.|
|Runtime||A collection of machine logic functions that other system can use (and thus does not need to provide itself). .Net, Java, Objective-C on macOS, are computer languages for which the execution requires a runtime. Runtimes in this definition have no independent existence (they only run when the application is run) whereas middleware generally runs even without any other system deployed on it.|
|Operations||Also: ‘Run’. Mostly ignored aspect by all, except those that provide it, and all others when they suddenly find out they need it. Because all that IT requires some sort of operation. For instance noticing and solving problems. Because what IT doesn’t have them? So, there is Logging, Monitoring, Event Management, Incident Management, etc.. Example: suppose that PAAS service has a problem and it doesn’t run. Somebody should find out and do something about it and not after the web shop has been down for 10 hours. Also: disaster management (backup, recovery) or ‘continuity’. Requires entire stacks (applications, platforms) itself.|
|Applications||The functionality most people only think of. Either bought or built, it is the machine logic that makes use of everything below it to provide that function.|
|Development & Deployment||Also: ‘Change’. Applications need to be deployed on a platform before they can be used. If you build your own they are developed and deployed. Part of this is for instance testing and promoting from testing to production. Development may include all sorts of source code control and development environments. Requires entire stacks (applications, platforms) itself.|
|Identity & Access||It is essential that logic and data are ‘safe’, that is that confidentiality (only access by those that should have it), integrity (data is only changed if it should change), and availability (you don’t want the emergency services to be unavailable because some server crashed) are guaranteed. Managing identities and access is an important aspect of confidentiality and integrity. Often makes use of the Directory (above) for storage.|
|Content||Always and uniquely yours. Even in SAAS, what you have put as information into the system and get out of it, is why you use it in the first place. This is one aspect where SAAS is often slightly more complicated because getting SAAS systems integrated with all your other systems may involve complex setups by itself.|
A shorthand version of which would be:
Believing that moving to the cloud rids you of much work is only true for SAAS. In all other cases, such a belief is a form of Data Center Myopia and it is likewise based on ignoring a terrible lot of IT that has to do with making that single bit — the application — work. In a way, we have the mathematicians that started IT (and introduced the misleading concept of ‘non-functionals’) to thank for that.
Aside: Antique layers and an important misunderstanding
What really bugs me, by the way, is these three layers (a more detailed explanation of these below):
This comes from a time when life was much simpler and even then it wasn’t true. It provides a simple but false way of understanding. In reality, the following patterns exist too:
In fact, theoretically we have an unlimited number of platforms running inside other platforms. In practice, performance penalties tend to limit the depth level. And not only that, some systems are in fact mixtures of Applications and Platforms, a situation I call Complex Application Stacks. Examples are SAS, Tibco, or many others. They consist of a mixture of platforms to deploy your own ‘code’ in, and applications to manage and use them. Did I already mention reality is much more complicated than the simplistic pictures suggest? So, the reality is more like this:
So: Lies, Big Lies
So, that first graphic, popular in many board rooms is misleading. It lies. It lies by ignoring the reality that you remain responsible for much of what the graphic says you don’t. It lies by leaving out all the things moving to the cloud hardly touches. And as such it frames matters in an unrealistic way. And framing works. Take these two graphs, unrelated to this story:
Both display exactly the same thing. But on the left, by leaving a lot out on the y-scale, the growth seems far more dramatic than on the right. The original IAAS-PAAS-SAAS picture does the opposite: by leaving out all the stuff that remains the same, it suggests a simplification that isn’t really there.
All of this is a form of what many lies in IT have in common: the suggestion simplification is easy. The reality is that complexity grows constantly in the information revolution. While there is a lot of encapsulation going on, people act as if encapsulation means that complexity is gone. Or, we tend to act that if we can’t see it, it isn’t there. And then we act surprised when things do not work perfectly (or at all).
Addendum: Cloud: Private, Public, Hybrid, Virtual Private…
Public Cloud. Private Cloud. Hybrid Cloud. Virtual Private Cloud. Reverse Cloud. Um… what? Here are some helpful definitions:
|Cloud||Ambiguous. Originally a term for the public internet as opposed to organisation-owned fixed communications (Wide Area Network or WAN). Is now used for both “cloud providers” and for “cloud pattern”. As “cloud providers” it stands for any organisation that offers IT services via the internet. The “cloud pattern” stands for ‘self service’ access to fully automated provisioning of IT-components and services (PAAS and IAAS). Because of the self-service, the underlying hardware capacity must exceed peak demand. Within these limits, there is elasticity (on-demand).|
|Private Cloud||The “cloud pattern” on hardware that you manage yourself. As you will not have unlimited hardware, there is generally a noticeable limit on the elasticity. If you run into that limit, the slow process of adding hardware must be performed before you can exceed the existing limits. Cost is fixed for your organisation (the capacity defines the cost).|
|Public Cloud||The “cloud pattern” on hardware owned by specialised cloud providers (e.g. Microsoft Azure, Amazon AWS). Because of their gigantic size, they have effectively unlimited elasticity for each client. Hardware is not dedicated for you (your IT runs on the same hardware as that used by other organisations), but the cloud provider guarantees a certain performance. Because they want to make a profit, they use pay-as-you-go models. Scale up your IAAS and SAAS, pay more per time unit. Scale down, pay less. If you are unable to scale down (e.g. if your software cannot run that way), Public Cloud is more expensive than owning your own hardware, except for very small organisations.|
|Hybrid Cloud||A — surprise — mix of various forms of cloud, mostly private and public. This will be the reality for most organisations except very small ones.|
|Virtual Private Cloud||Cloud services running on hardware that is owned by a cloud provider but that is dedicated for you. Amazon and Azure call this “dedicated hosts”.|
Addendum: IAAS+ (or ‘Managed OS’ — the useful lowest PAAS level that cloud providers cannot provide)
Repeating what was explained above, there is a confusion stemming from the technicalities of using IAAS in the cloud: people tend to think that IAAS means that you get an OS, e.g. a Windows or Linux machine from the cloud provider. But that is not the case, it only looks that way.
What you do get with IAAS is a virtual machine, more or less a time-share on underlying compute hardware (as well as the underlying networking, storage, etc. that is required for it to run — note again: generally these have to be set up (in the cloud still mostly by you) before you can actually create a virtual machine). Such a machine, just like at home, is useless without an OS. At home this is generally Windows or macOS. But the OS is not really part of the IAAS service you get. What you get is nothing more than an initial (licensed) install because without that OS image written on its storage you cannot use the machine. After that, you are on your own. It is not managed, maintained, monitored, patched, etc.. All of that is your own responsibility. Compare that with using an application server (PAAS). That PAAS may be simply a tomcat server. But patches, updates, etc. of that tomcat server are done by the PAAS provider (as well as for anything under it such as the OS). Basically ‘as-a-service’ means that the service provider keeps the service in good order. So what you get with IAAS is a time share on underlying compute hardware, not an OS, even if the time share is initially seeded with an OS image (possibly your own). IAAS might better be called VHAAS (Virtual Hardware as a Service).
There is a good reason the cloud providers do not offer you a managed OS. Operating systems are complex beasts, and there are so many variables that it is impossible to automate it good enough for it to be useful for all different uses in the world. For instance, one company may have different security requirements than another. And providing an interface through which all these customers can make their own settings in a way that does not interfere with your responsibility as a service provider is undoable.
But having a managed OS is exactly what is useful for business application teams in many organisations. Infrastructure departments deliver managed OS for their IT colleagues who then can install and run platforms and applications on them, e.g. those systems that have been purchased from vendors. While those colleagues can install and maintain software, the IT department takes care of security patches, security baselines, updates, logging and monitoring, etc. All those “extras” people tend to forget about. And, in the classic world of non-self-service (the IT department delivers a new OS) this is generally what happens. It is not what you get with IAAS in the cloud.
So, where I work, we have what we call a DevOps-Ready Data Center (DRDC) which is in fact an on-premises PAAS provider. It provides all kinds of platforms as self-service to end users, such as that tomcat server, database, or an IIS server, but in a managed way. These PAAS are fully maintained (patches, security baselines) all done as Infrastructure as Code (using a.o. Puppet, ServiceNow, Icinga2, ELK, GitLab). And not only are these machines managed, they are fully set up, so there is monitoring, logging, development tooling (on-premises: XLD/XLR), identity & access, continuity, and most importantly: change management (ITIL). So one of the PAAS levels we can actually deliver is just a managed OS. We call this IAAS+, but in fact it is the lowest level of PAAS you can do. In a picture:
The optimal solution for the DevOps teams is of course a ‘fatter’ form of PAAS, say a tomcat application server you can deploy Java applications on, or a Mule runtime you can deploy APIs on. The more that is automated for them, the better it is. But if a ‘fat’ PAAS is not an option, IAAS+ is a good second. The most time-costly aspect of maintaining the platforms is after all the OS. By providing a ‘Managed OS’ instead of a ‘Managed Hardware Device’ (like the cloud providers do with IAAS) you are providing something the cloud providers cannot. And of course, using ‘Outsourced IAAS’ as the foundational infra under ‘Managed OS’ is a logical next step. In other words: the capability of doing ‘Managed OS’ leads to the possibility to do ‘Managed OS’ or ‘IAAS+’ in the cloud after all. Nifty, right? But difficult, though (think alone about all the networking, storage, identity & access, and so forth that needs to be automated before you can do that). The value is great though: it enables the combination of agility and DevOps without either losing control or becoming hugely inefficient maintaining control. Still, if we want both agility and control in this world of incredible amounts of connected machine logic, there is no other option. To fight the complexity of all that machine logic, there is little we can do but add even more machine logic. And yes, there is a limit there somewhere…
Thanks to Henk Dado and Michel Jongmans for their contribution to this via our discussions and MarkThePotato lmao for proofreading.