The Risks Of Big-Bang Deployments And Techniques For Step-wise Deployment

February 17, 2014

If you ever need to persuade management why it might be better to deploy a larger change in multiple stages and push it to customers gradually, read on.

A deployment of many changes is risky. We want therefore to deploy them in a way which minimizes the risk of harm to our customers and our companies. The deployment can be done either in an all-at-once (also known as big-bang) way or a gradual way. We will argue here for the more gradual ("stepwise") approach.

Big-bang or stepwise deployment?

A big-bang deployment seems to be the natural thing to do: the full solution is developed and tested and then replaces the current system at once. However, it has two crucial flaws.

First, it assumes that most defects can be discovered by testing. However, due to differences in test/prod environments, unknown dependencies, and the sheer scale of a typical larger system there always will be problems that are not discovered until production deployment or even until the application runs for a while in production (which applies even to airplanes). The more parts have been changed, the more of these production defects will happen at the same time. A gradual deployment makes it possible to discover and handle them one by one.

Second, the more complex the deployment, the higher chance of human error(s), i.e. the deployment itself is a likely source of serious defects.

Some of the drawbacks of a big-bang deployment in more detail:

Complexity: A big-bang deployment requires coordination of many people and "moving parts" that depend on each other, providing a huge opportunity for human mistake (i.e. there will be mistakes).
Lot of time: Such a deployment requires lot of time (typically also more than planed/expected) and thus lot of downtime when users cannot use the system.
Hard troubleshooting: With a network of inter-dependent parts that changed all at the same time, while perhaps also changing the infrastructure (i.e. connections between them), it is extremely hard to pinpoint the source of defects, thus considerably increasing the time to detect and correct defects while also increasing the risk of people stepping on the toes of each other and "panic fixes" that either cause more problems than they remove or are not good enough (as the rollback that sped up Knight's downfall).
Rollback is likely either impossible or equally time-consuming and risky as the deployment itself, thus increasing the impact of defects and inviting even more human errors.
Impact: Deploying everything to all users at the same time means that everybody will be impacted by a potential defect/error/mistake.
Long freeze: All needs to be tested together after all development is finished, which requires a lot of time while the code is frozen and no more fixes and changes can get into production for weeks.

Risk mitigation

The goal of a good deployment plan is to mitigate the risk of the deployment and get it to an acceptable level. There are two aspects to risk: the probability of a defect and the impact of the defect. The following table shows how the possible measures affect them:

Defect probability reduction	Defect impact reduction
testing	stepwise deployment gradual migration of users to the new version (f.ex. 1 in 1000 or particular subsets) rollback mechanism
	=> these also lead to much lower time to detect and fix defects

Practices for stepwise deployment

Enable stepwise deployment: Use parallel change and other Continuous Delivery techniques to make it possible to deploy updated components independently from each other and to switch on/off new features and to switch what versions of the components they depend on are currently used. (Parallel change - keeping the old and new code and being able to use one or the other - is crucial here. Also notice that parallel change applies also to data - you will need to evolve your data schema gradually and keep both old and new one at the same time in a period of time.)

Enable rollback. The previous measure - stepwise deployment - makes it also easy(ier) to roll-back the changes by switching to a previous version of a dependency or by switching back to the old code.

Migrate users gradually to the new version, i.e. expose the new version only to a small subset of the users initially and increase that subset until everybody uses it. This can be done f.ex. by deploying to only a subset of servers and sending a random/particular subset of users to the new servers but there are also ways if you have only a single machine. (See f.ex. my post Webapp Blue-Green Deployment Without Breaking Sessions/With Fallback With HAProxy.)

Monitoring - make sure you are able to monitor flow of users through the system and detect any anomalies and errors early, long before angry calls from the business. Tools such as Logstash, Google Analytics (with custom events from JavaScript), client-side error logging via one of existing services or a custom solution are invaluable.

Making the right decision

Henrik Kniberg (Spotify, Lean from Trenches) describes how the Swedish police decided, under the influence of its CIO and an Oracle/Siebel consultant and against the will of the IT department, to throw away the successful PUST project, implemented in an agile way, and do it from scratch based on Siebel, i.e. a standardized platform, with the wishfull-thining-driven idea of lowered operational/maintenance costs. They also decided against iterative development and in favor of a single big-bang deployment at the end. It was a disaster. Kniberg takes to main lessons from this fiasko:

Never take important technical decision without involving those that should build the solution. (Hint: "involving" does not mean asking for input and then deciding however you want anyway.)
Work iteratively, in collaboration with the right users, deploy early limited pilot versions to the right users, improve the product continually based on their feedback.

Main point for me: When management decides whether it should do a gradual or a big-bang deployment, it should take the opinion of developers and ops people really seriously and not just as one of inputs.

You might enjoy also other posts on effective development.

Tags: opinion DevOps