Continuous Delivery Digest: Ch.9 Testing Non-Functional Requirements
(Cross-posted from blog.iterate.no)
Digest of chapter 9 of the Continuous Delivery bible by Humble and Farley. See also the digest of ch 8: Automated Acceptance Testing.
("cross-functional" might be better as they too are crucial for functionality)
A strategy to address capacity problems:
There are different possible tests, f.ex.:
Goals:
To scale up, you can record the communication generated by acceptance tests, postprocess it to scale up (multiply, insert unique data where necessary), reply at high volume
In high-performance systems testing may fail because the tests themselves do not run fast enough. To discover this case, run them originally against a no-op stub of the application.
Composable, scenario-based tests enable us to simulate complex interactions, together with prod-like env we can
Digest of chapter 9 of the Continuous Delivery bible by Humble and Farley. See also the digest of ch 8: Automated Acceptance Testing.
("cross-functional" might be better as they too are crucial for functionality)
- f.ex. security, usability, maintainability, auditability, configurability but especially capacity, throughput, performance
- performance = time to process 1 transaction (tx); throughput = #tx/a period; capacity = max throughput we can handle under a given load while maintaining acceptable response times.
- NFRs determine the architecture so we must define them early; all (ops, devs, testers, customer) should meet to estimate their impact (it costs to increase any of them and often they go contrary to each other, e.g. security and performance)
- das appropriate, you can either create specific stories for NFRs (e.g. capacity; => easier to prioritize explicitely) or/and add them as requirements to feature stories
- if poorly analyzed then they constrain thinking, lead to overdesign and inappropriate optimization
- only ever optimize based on facts, i.e. realistic measurements (if you don't know it: developers are really terrible at guessing the source of performance problems)
A strategy to address capacity problems:
- Decide upon an architecture; beware process/network boundaries and I/O in general
- Apply stability/capacity patterns and avoid antipatterns - see Release It!
- Other than that, avoid premature optimization; prefer clear, simple code; never optimize without a proof it is necessary
- Make sure your algorithms and data structures are suitable for your app (O(n) etc.)
- Be extremely careful about threading (-> "blocked threads anti-pattern")
- Create automated tests asserting the desired capacity; they also will guide you when fixing failures
- Only profile to fix issues identified by tests
- Use real-world capacity measures whenever possible - measure in your prod system (# users, patterns of behavior, data volumes, ...)
Measuring Capacity
There are different possible tests, f.ex.:
- Scalability testing - how does the response time of an individual request and # concurrent users changes as we add more servers, services, or threads?
- Longevity t. - see performance changes when running for a long time - detect memory leaks, stability problems
- Throughput t. - #tx/messages/page hits per second
- Load t. - capacity as functional of load to and beyond the prod-like volumes; this is the most common
- it's vital to use realistic scenarios; on the contrary, technical benchmark-style measurements (# reads/s from DB,..) can be sometimes useful to guard against specific problems, to optimize specific areas, or to choose a technology
- systems do many things so it's important to run different capacity tests in parallel; it's impossible to replicate prod traffic => use traffic analysis, experience, intuition to achieve as close a simulation as possible
How to Define Success or Failure
- tip: collect measurements (absolute values, trends) during the testing and present them in a graphical form to gain insight into what happened
- too strict limits will lead to intermittent failures (f.ex. when network overloaded by another operation) X too relaxed limits => won't discover a partial drop in capacity =>
- Aim for stable, reproducible results - isolate the test env as much as possible
- Tune the pass threshold up once it passes at a minimum acceptable level; back down if it starts failing after a commit due to well-understood and acceptable reason
Capacity-Testing Environment
- replicates Prod as much as possible; extrapolation from a different environment is highly speculative, unless based on good measurements. "Configuration changes tend to have nonlinear effect on capacity characteristics." p234
- an exact replica of Prod sometimes impossible or not sensible (small project, capacity little important, or when prod has 100s of servers) => capacity testing can be done on a subset of prod servers as a part of Canary Releasing, see p263
- scaling is rarely linear, even if the app is designed for it; if test env is a scaled-down prod, do few scalings runs to measure the size effect
- saving money on a downscaled test env is a false economy if capacity is critical; no matter what it won't be able to find all issues and it will be expensive to fix them later - see the storu on p236
Automating Capacity Testing
- it's expensive but if important, it must be a part of the deployment pipeline
- these tests are complex, fragile, easily broken with minor changes
- Ideal tests: use real-world scenarios; predefine success threshold; relatively short duration to finish in a reasonable time; robust wrt. change to improve maintainability; composable into larger-scale scenarios so that we can simulate real-world patterns of use; repeatable and runnable sequentially or in parallel => suitable both for load and longevity testing
- start with some existing (robust and realistic) acceptance tests, adapt them for capacity testing - add success threshold and auditability to scale up
Goals:
- Creat realistic, prod-like load (in form and volume)
- Test realistic but pathological real-life loading scenarios, i.e. not just the happy path; tip: identify the most expensive transactions and double/triple their ratio
To scale up, you can record the communication generated by acceptance tests, postprocess it to scale up (multiply, insert unique data where necessary), reply at high volume
- Question: Where to record and play back:
- UI - realistic but impractical for 10,000s users (and expensive)
- Service/public API (e.g. HTTP req.)
- Lower-level API (such as a direct call to the service layer or DB)
Testing via UI
- Not suitable for high-volume systems, when too many clients are necessary to generate a high load (partially due to UI client [browser] overhead); also expensive to run many machines
- UI condenses a number of actions (clicks, selections) into few interactions with back-end (e.g. 1 form submission) that has a more stable API. To answer: are we interested in performance of the clients or of the back-end.
- "[..] we generally prefer to avoid capacity testing through the UI." - unless the UI itself or the client-server interaction are of a concern
Recording Interactions against a Service or Public API
- run acceptance tests, record in/outputs (e.g. SOAP XML, HTTP), replace what must vary with placeholders (e.g. ${ORDER_ID}), create test data, merge the two
- Recommended compromise: Aim to change as little as possible between instances of a test - less coupling between the test and test data, more flexible, less fragile. Ex.: unique orderId, customerId but same product, quantity.
Using Capacity Test Stubs To Develop Tests
In high-performance systems testing may fail because the tests themselves do not run fast enough. To discover this case, run them originally against a no-op stub of the application.
Adding Capacity Tests to the Deployment Pipeline
- beware that warm-up time may be necessary (JIT, ...)
- for known hot spots, you can simple "guard tests" already to the commit stage
- typically we run them separately from acceptance tests - they've different environment needs, perhaps are long-running, we want to avoid undesirable interactions between acceptance and capacity tests; acceptance test stage may include a few performance smoke tests
Other Benefits of Capacity Tests
Composable, scenario-based tests enable us to simulate complex interactions, together with prod-like env we can
- reproduce complex prod defects
- detect/debug memory leaks
- evaluate impact of garbage collection (GC); tune GC
- tune app config and 3rd party app (OS, AS, DB, ...) config
- simulate worst-day scenarios
- evaluate different solutions to a complex problem
- simulate integration failures
- measure scalability with different hardware configs
- load-test communication with external systems even though the tests were originally designed for stubbed interfaces
- rehears rollback
- and many more ...