Continuous Delivery Digest: Ch.9 Testing Non-Functional Requirements

January 8, 2015

(Cross-posted from blog.iterate.no)

Digest of chapter 9 of the Continuous Delivery bible by Humble and Farley. See also the digest of ch 8: Automated Acceptance Testing.

("cross-functional" might be better as they too are crucial for functionality)

f.ex. security, usability, maintainability, auditability, configurability but especially capacity, throughput, performance
performance = time to process 1 transaction (tx); throughput = #tx/a period; capacity = max throughput we can handle under a given load while maintaining acceptable response times.
NFRs determine the architecture so we must define them early; all (ops, devs, testers, customer) should meet to estimate their impact (it costs to increase any of them and often they go contrary to each other, e.g. security and performance)
das appropriate, you can either create specific stories for NFRs (e.g. capacity; => easier to prioritize explicitely) or/and add them as requirements to feature stories
if poorly analyzed then they constrain thinking, lead to overdesign and inappropriate optimization
only ever optimize based on facts, i.e. realistic measurements (if you don't know it: developers are really terrible at guessing the source of performance problems)

A strategy to address capacity problems:

Decide upon an architecture; beware process/network boundaries and I/O in general
Apply stability/capacity patterns and avoid antipatterns - see Release It!
Other than that, avoid premature optimization; prefer clear, simple code; never optimize without a proof it is necessary
Make sure your algorithms and data structures are suitable for your app (O(n) etc.)
Be extremely careful about threading (-> "blocked threads anti-pattern")
Create automated tests asserting the desired capacity; they also will guide you when fixing failures
Only profile to fix issues identified by tests
Use real-world capacity measures whenever possible - measure in your prod system (# users, patterns of behavior, data volumes, ...)

Measuring Capacity

There are different possible tests, f.ex.:

Scalability testing - how does the response time of an individual request and # concurrent users changes as we add more servers, services, or threads?
Longevity t. - see performance changes when running for a long time - detect memory leaks, stability problems
Throughput t. - #tx/messages/page hits per second
Load t. - capacity as functional of load to and beyond the prod-like volumes; this is the most common
it's vital to use realistic scenarios; on the contrary, technical benchmark-style measurements (# reads/s from DB,..) can be sometimes useful to guard against specific problems, to optimize specific areas, or to choose a technology
systems do many things so it's important to run different capacity tests in parallel; it's impossible to replicate prod traffic => use traffic analysis, experience, intuition to achieve as close a simulation as possible

How to Define Success or Failure

tip: collect measurements (absolute values, trends) during the testing and present them in a graphical form to gain insight into what happened
too strict limits will lead to intermittent failures (f.ex. when network overloaded by another operation) X too relaxed limits => won't discover a partial drop in capacity =>
1. Aim for stable, reproducible results - isolate the test env as much as possible
2. Tune the pass threshold up once it passes at a minimum acceptable level; back down if it starts failing after a commit due to well-understood and acceptable reason

Capacity-Testing Environment

replicates Prod as much as possible; extrapolation from a different environment is highly speculative, unless based on good measurements. "Configuration changes tend to have nonlinear effect on capacity characteristics." p234
an exact replica of Prod sometimes impossible or not sensible (small project, capacity little important, or when prod has 100s of servers) => capacity testing can be done on a subset of prod servers as a part of Canary Releasing, see p263
scaling is rarely linear, even if the app is designed for it; if test env is a scaled-down prod, do few scalings runs to measure the size effect
saving money on a downscaled test env is a false economy if capacity is critical; no matter what it won't be able to find all issues and it will be expensive to fix them later - see the storu on p236

Automating Capacity Testing

it's expensive but if important, it must be a part of the deployment pipeline
these tests are complex, fragile, easily broken with minor changes
Ideal tests: use real-world scenarios; predefine success threshold; relatively short duration to finish in a reasonable time; robust wrt. change to improve maintainability; composable into larger-scale scenarios so that we can simulate real-world patterns of use; repeatable and runnable sequentially or in parallel => suitable both for load and longevity testing
start with some existing (robust and realistic) acceptance tests, adapt them for capacity testing - add success threshold and auditability to scale up

Goals:

Creat realistic, prod-like load (in form and volume)
Test realistic but pathological real-life loading scenarios, i.e. not just the happy path; tip: identify the most expensive transactions and double/triple their ratio

To scale up, you can record the communication generated by acceptance tests, postprocess it to scale up (multiply, insert unique data where necessary), reply at high volume

Question: Where to record and play back:
1. UI - realistic but impractical for 10,000s users (and expensive)
2. Service/public API (e.g. HTTP req.)
3. Lower-level API (such as a direct call to the service layer or DB)

Testing via UI

Not suitable for high-volume systems, when too many clients are necessary to generate a high load (partially due to UI client [browser] overhead); also expensive to run many machines
UI condenses a number of actions (clicks, selections) into few interactions with back-end (e.g. 1 form submission) that has a more stable API. To answer: are we interested in performance of the clients or of the back-end.
"[..] we generally prefer to avoid capacity testing through the UI." - unless the UI itself or the client-server interaction are of a concern

Recording Interactions against a Service or Public API

run acceptance tests, record in/outputs (e.g. SOAP XML, HTTP), replace what must vary with placeholders (e.g. ${ORDER_ID}), create test data, merge the two
Recommended compromise: Aim to change as little as possible between instances of a test - less coupling between the test and test data, more flexible, less fragile. Ex.: unique orderId, customerId but same product, quantity.

Using Capacity Test Stubs To Develop Tests

In high-performance systems testing may fail because the tests themselves do not run fast enough. To discover this case, run them originally against a no-op stub of the application.

Adding Capacity Tests to the Deployment Pipeline

beware that warm-up time may be necessary (JIT, ...)
for known hot spots, you can simple "guard tests" already to the commit stage
typically we run them separately from acceptance tests - they've different environment needs, perhaps are long-running, we want to avoid undesirable interactions between acceptance and capacity tests; acceptance test stage may include a few performance smoke tests

Other Benefits of Capacity Tests

Composable, scenario-based tests enable us to simulate complex interactions, together with prod-like env we can

reproduce complex prod defects
detect/debug memory leaks
evaluate impact of garbage collection (GC); tune GC
tune app config and 3rd party app (OS, AS, DB, ...) config
simulate worst-day scenarios
evaluate different solutions to a complex problem
simulate integration failures
measure scalability with different hardware configs
load-test communication with external systems even though the tests were originally designed for stubbed interfaces
rehears rollback
and many more ...

Tags: DevOps performance