There will be failures – On systems that live through difficulties instead of turning them into a catastrophy

March 17, 2015

Our systems always depend on other systems and services and thus may and will be subject to failures – network glitches, dropped connections, load spikes, deadlocks, slow or crashed subsystems. We will explore how to create robust systems that can sustain blows from its users, interconnecting networks, and supposedly allied systems yet carry on as well as possible, recovering quickly – instead of aggreviating these difficulties and turning them into an extended outage and potentially substiantial financial loss. In systems not designed for robustness, even a minor and transient failure tends to cause a chain reaction of failures, spreading destruction far and wide. Here you will learn how to avoid that with a few crucial yet simple stability patterns and the main antipatterns to be aware of. Based primarily on the book Release It! and Hystrix. (Presented at Iterate winter conference 2015; re-posted from blog.iterate.no.)

Often we speak abut building the right thing, building it right. Today: Building it to survive.

Motivation

There is a special moment in a developer's life. The moment one afternoon when a sales campaign has started or the streaming of a major sports event kicks off - and you watch your servers light up like torches, one by one. CPU usage is low, network, memory, disk are OK, the database is perfectly healthy - yet no request are coming through. You are loosing credibility, customers, and perhaps hundreds of thousands of kronas - and still have no idea what is wrong. Hours later you finally find out that a firewall has silently dropped long-lived connections to the database, causing a rare error condition that the code handled badly, never releasing the connections - leading to threads blocked forever waiting for a connection.

The way we write software is very susceptible to cascading failures like this - a small spark leads to a total collapse of the system, even though the spark itself has soon died out.

Everything will fail. How to stop the sparks of these failures from spreading and setting ablaze the whole system? How to write software that survives and recovers?

(Anti)patterns

This is my selection of the 5 main patterns and 5 main antipatterns from the Stability section of Michael Nygard's Release It!:

TODO: Image

The red boxes are the antipatterns that help to create and spread stability problems. The green elipses are patterns that help to contain and isolate the problems so that they do not spread.

Antipatterns: How failures spread and multiply

Cascading Failures
- = problems in a down-stream system bring down an upstream system
- Talk core: contain a failure to prevent from spreading, survive it, recover ASAP
- Ex.: webshop & in-stock? status -> inventory WS -> inventory IS -> DB with lock on a popular item => failure propagation
- A system is like a Norwegian wooden town; firewalls have to be intentionally included to prevent fire from spreading and beurning it all
Integration Points
- = cascading failures spread through them <> firewalls; I.P. = whenever we call something: DB, WS, cache, ...; may & will fail
- Error responses (2nd best thing after success)
- Slow responses (due to TCP ACK retry, ...)
- Or the call never returns
- Unexpected data: too much (unbounded result set), rubbish; ex.: a DB query that normally returns 10 rows suddenly returns 10M => eternity to transform & crash due to running out of memory
- => be paranoid
Blocked Threads
- = the tool of cascading failures; found close to I.P.; due to resource (f.ex. DB connection) pools / synchronization
- Low-level synchronization => you got it wrong, deadlocks / inconsistency => use higher-level constructs, libraries, languages
Chain Reactions
- = same instances behind a load balancer with the same issue manifested under high load (a leak / timing issue)
- When one fails, load increases and the others are the more likely to fail
- => realistic stress testing, longevity testing
Slow Responses
- Slow is worse than a failure - consumes resources in caller & callee
- One slow call not a problem but many concurrent ones yes
- No reason to wait longer than user wait time / SLA

Patterns: The protective firewalls

Timeouts
- Timeout is your best friend; always apply when calling something
- (Often "infinity" by default; different timeouts might need to be set, e.x. JDBC: connection, query, ...)
- Consider retry - but delayed, with an increasing interval
- Protects from Blocked Threads, Slow Responses
Circuit Breaker
- = similar to a fuse; wraps an Integration Point, monitors failures and timeouts and if too many in a period then it concludes the system is down and will start an error immediatelly to future invocations without calling it; but it will let a request through once upon while to check whether the system has not recovered
- Prevents resource exhaustion due to a troubled dependency
- Use to protect yourself from a cascading failure
- Use to protect the callee by cutting off load when in troubles
- Consider a fallback solution
- This is the main protection against Cascading Failures, the main firewall around an I.P.
Bulkheads
- = watertight compartments in a ship that save it from sinking when there is a hole
- = contain a failure through dedicated, separate resources => important, often applied in IT
- At different granularity: bind a thread to 1 CPU; use a limited thread pool (x exhausting all threads); HW redundancy; cluster sub-group
- Ex.: Airline IS, prevent problems in flight status check from breaking traveller check-in by giving dedicated app servers to each of them
- Ex.: separate request threads for admin requests, root user quota on Linux
- Contain a Chain Reaction, preserve partial functionality
Steady State
- = if a process accummulates a resource, another one must automatically recycle it
- Ex.: log files, cache, data in a DB
- Violation => Chain Reaction
Fail Fast
- = if you know you're going to fail, fail at once to save resources and protect self/the callee from an overload
- Ex.: Elementary user input validation prior to invoking an expensive call, checking dependency availability (are all C.B. open?)
- Ex.: Do not let more than max users from web to app servers (x latency & nobody served)

Ref: Hystrix

Hystrix is a Java framework by Netflix for resilient distributed communication; uses thread pools (= Bulkheads) with Timeouts and Circuit Breakers (and optional caching and fallback data) + monitoring and instant reconfigurability. It is useful to read about what it does to get a more practicle idea of how to apply these patterns.

Bonus topics

Applying the stability patterns is great but not really enough; you want to add good monitoring and notifications => discover/locate problems => help to recover (if it cannot recover automatically)
Graceful degradation: write your system so that it can function without non-core functionality (such as the in stock check mentio)ned above)
Test Harness (another pattern from Release It!) - a fake service that can simulate all kinds of problems (accepting connectio)ns but never responding, returning rubbish data, ...); implementing the patterns isn't really finished until you test the result
Release It! has more (anti)patterns and covers other areas than stability

Conclusion

Be paranoid about both your callers and callees
Apply timeouts, circuit breaker, steady state, fail fast, ...
Learn what Hystrix does
At least browse through Release It!

Tags: DevOps