Beware the performance cost of async_hooks (Node 8)

I was excited about async_hooks having finally landed in Node.js 8, as it would enable me to share important troubleshooting information with all code involved in handling a particular request. However it turned out to have terrible impact of our CPU usage (YMMV):



This was quite extreme and is likely related to the way how our application works and uses Promises. Do your own testing to measure the actual impact in your app.

However I am not the only one who has seen some performance hit from async_hooks - see https://github.com/bmeurer/async-hooks-performance-impact, in particular:

Here the results of running the Promise micro benchmarks with and without async_hooks enabled:

Benchmark Node 8.9.4 Node 9.4.0
Bluebird-doxbee (regular) 226 ms 189 ms
Bluebird-doxbee (init hook) 383 ms 341 ms
Bluebird-doxbee (all hooks) 440 ms 411 ms
Bluebird-parallel (regular) 924 ms 696 ms
Bluebird-parallel (init hook) 1380 ms 1050 ms
Bluebird-parallel (all hooks) 1488 ms 1220 ms
Wikipedia (regular) 993 ms 804 ms
Wikipedia (init hook) 2025 ms 1893 ms
Wikipedia (all hooks) 2109 ms 2124 ms


To confirm the impact of async_hook on our app, I have performed 3 performance tests:

CPU usage without async_hooks (Node 8)


It is difficult to see but the mean CPU usage is perhaps around 60% here.



CPU usage with "no-op" async_hooks (Node 8)


Here the CPU jumped to 100%.



CPU usage with "no-op" async_hooks (Node 11)


The same as above, but using Node 11 for comparison. I recorded it for just a few minutes but the CPU usage is still around 100%:



The code



This is the relevant code:
const asyncHooks = require('async_hooks'); // Node 8.9+
const querystring = require('querystring');
const crypto = require("crypto");
const context = {};
function createHooks() {
function init(asyncId, type, triggerId, resource) {
// if (context[triggerId]) {
// context[asyncId] = context[triggerId];
// }
}
function destroy(asyncId) {
// delete context[asyncId];
}
const asyncHook = asyncHooks.createHook({ init, destroy });
asyncHook.enable();
}
createHooks()
view raw async-context.js hosted with ❤ by GitHub

Continue reading →

How good monitoring saved our ass ... again

You know how it goes - suddenly people complain your app does not work, your are getting plenty of timeouts or other errors in your error tracking tool, you find the backend app that is misbehaving and finally "fix" the problem by restarting it. Phew!

But why? What caused the downtime? A glitch an an upstream system? Sudden overload due to a spike in concurrent users? Trolls?

You know that it helps sometimes to zoom out, to get the right perspective. Here the perspective was 7 days:



It was enough to look at this chart with the right zoom to see at once that something happened on October 23rd that caused a significant change in the behavior of the application. Quick search and indeed, the change in CPU usage corresponds with a deployment. A quick revert to the previous version shortly confirmed the culprit. (It would have been even easier if we showed deployments on these charts.)

This is not the first time good monitoring saved us. A while ago we struggled regularly with the application becoming sluggish and had to restart it regularly. A graph of the Node.js even loop lag showed it increasing over time. Once it was on the same dashboard as Node's heap usage, we could at once see that it correlated with increasing memory usage - indicating a memory leak. Few hours of experimenting and heap dump analysis later the problem was fixed.

So good monitoring is paramount.

Of course the trick is to know what to monitor and to display all relevant metrics in such a way that you can spot important relations. I am still working on improving that...
Continue reading →

Monitoring process memory/CPU usage with top and plotting it with gnuplot

siege-c3e2

If you want to monitor the memory and CPU usage of a particular Linux process for a few minutes, perhaps during a performance test, you can capture the data with top and plot them with gnuplot. Here is how:


Continue reading →

Troubleshooting javax.net.ssl.SSLHandshakeException: Received fatal alert: handshake_failure

Re-published from the Telia Tech Blog.

The infamous Java exception javax.net.ssl.SSLHandshakeException: Received fatal alert: handshake_failure is hardly understandable to a mere mortal. What it wants to say is, most likely, something like this:

Sorry, none of the cryptographic protocols/versions and cipher suites is accepted both by the JVM and the server.


For instance the server requires a higher version of TLS than the (old) JVM supports or it requires stronger cipher suites than the JVM knows. You will now learn how to find out what is the case.

We will first find out what both the server and the JVM support and compare it to see where they disagree. Feel free to just skim through the outputs and return to them later after they were explained.

What does the server support?



We will use nmap for that (brew install nmap on OSX):


map --script ssl-enum-ciphers -p 443 my-server.example.com
Starting Nmap 7.70 ( https://nmap.org ) at 2018-10-05 00:54 CEST
Nmap scan report for my-server.example.com (127.0.0.1)
Host is up (0.031s latency).

PORT STATE SERVICE 443/tcp open https | ssl-enum-ciphers: | TLSv1.2: | ciphers: | TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (secp256r1) - A | TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (secp256r1) - A | TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384 (secp256r1) - A | TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA (secp256r1) - A | TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256 (secp256r1) - A | TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA (secp256r1) - A | compressors: | NULL | cipher preference: server |_ least strength: A


Here we see that the server only supports TLS version 1.2 (ssl-enum-ciphers: TLSv1.2:) and the listed ciphers, such as TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA.

What does the JVM have on offer?



Now we will find out what the JVM supports (I did that through Clojure but you could have just as well used Java directly; notice the javax.net.debug property):


sh $ env -i java -Djavax.net.debug=ssl:handshake:verbose java -jar clojure-1.8.0.jar
Clojure 1.8.0
user=> (.connect (.openConnection (java.net.URL. "https://my-server.example.com/ping")))
;; ...
done seeding SecureRandom
Ignoring unavailable cipher suite: TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA
Ignoring unavailable cipher suite: TLS_DHE_RSA_WITH_AES_256_CBC_SHA
Ignoring unavailable cipher suite: TLS_ECDH_RSA_WITH_AES_256_CBC_SHA
Ignoring unsupported cipher suite: TLS_DHE_DSS_WITH_AES_128_CBC_SHA256
Ignoring unsupported cipher suite: TLS_DHE_DSS_WITH_AES_256_CBC_SHA256
Ignoring unsupported cipher suite: TLS_DHE_RSA_WITH_AES_128_CBC_SHA256
Ignoring unsupported cipher suite: TLS_ECDH_RSA_WITH_AES_128_CBC_SHA256
Ignoring unsupported cipher suite: TLS_DHE_RSA_WITH_AES_256_CBC_SHA256
Ignoring unsupported cipher suite: TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384
Ignoring unsupported cipher suite: TLS_ECDH_ECDSA_WITH_AES_256_CBC_SHA384
Ignoring unsupported cipher suite: TLS_RSA_WITH_AES_256_CBC_SHA256
Ignoring unavailable cipher suite: TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA
Ignoring unsupported cipher suite: TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256
Ignoring unsupported cipher suite: TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA384
Ignoring unavailable cipher suite: TLS_DHE_DSS_WITH_AES_256_CBC_SHA
Ignoring unsupported cipher suite: TLS_ECDH_RSA_WITH_AES_256_CBC_SHA384
Ignoring unsupported cipher suite: TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256
Ignoring unsupported cipher suite: TLS_ECDH_ECDSA_WITH_AES_128_CBC_SHA256
Ignoring unavailable cipher suite: TLS_ECDH_ECDSA_WITH_AES_256_CBC_SHA
Ignoring unavailable cipher suite: TLS_RSA_WITH_AES_256_CBC_SHA
Ignoring unsupported cipher suite: TLS_RSA_WITH_AES_128_CBC_SHA256
Allow unsafe renegotiation: false
Allow legacy hello messages: true
Is initial handshake: true
Is secure renegotiation: false
main, setSoTimeout(0) called
%% No cached client session
*** ClientHello, TLSv1
RandomCookie: GMT: 1521850374 bytes = { 121, 217, 101, 186, 111, 183, 47, 46, 159, 230, 139, 103, 7, 181, 250, 172, 113, 121, 4, 55, 122, 148, 111, 82, 87, 170, 70, 10 }
Session ID: {}
Cipher Suites: [TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA, TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA, TLS_RSA_WITH_AES_128_CBC_SHA, TLS_ECDH_ECDSA_WITH_AES_128_CBC_SHA, TLS_ECDH_RSA_WITH_AES_128_CBC_SHA, TLS_DHE_RSA_WITH_AES_128_CBC_SHA, TLS_DHE_DSS_WITH_AES_128_CBC_SHA, TLS_ECDHE_ECDSA_WITH_RC4_128_SHA, TLS_ECDHE_RSA_WITH_RC4_128_SHA, SSL_RSA_WITH_RC4_128_SHA, TLS_ECDH_ECDSA_WITH_RC4_128_SHA, TLS_ECDH_RSA_WITH_RC4_128_SHA, TLS_ECDHE_ECDSA_WITH_3DES_EDE_CBC_SHA, TLS_ECDHE_RSA_WITH_3DES_EDE_CBC_SHA, SSL_RSA_WITH_3DES_EDE_CBC_SHA, TLS_ECDH_ECDSA_WITH_3DES_EDE_CBC_SHA, TLS_ECDH_RSA_WITH_3DES_EDE_CBC_SHA, SSL_DHE_RSA_WITH_3DES_EDE_CBC_SHA, SSL_DHE_DSS_WITH_3DES_EDE_CBC_SHA, SSL_RSA_WITH_RC4_128_MD5, TLS_EMPTY_RENEGOTIATION_INFO_SCSV]
Compression Methods: { 0 }
Extension elliptic_curves, curve names: {secp256r1, sect163k1, sect163r2, secp192r1, secp224r1, sect233k1, sect233r1, sect283k1, sect283r1, secp384r1, sect409k1, sect409r1, secp521r1, sect571k1, sect571r1, secp160k1, secp160r1, secp160r2, sect163r1, secp192k1, sect193r1, sect193r2, secp224k1, sect239k1, secp256k1}
Extension ec_point_formats, formats: [uncompressed]
Extension server_name, server_name: [host_name: my-server.example.com]
***
main, WRITE: TLSv1 Handshake, length = 175
main, READ: TLSv1 Alert, length = 2
main, RECV TLSv1 ALERT: fatal, handshake_failure
main, called closeSocket()
main, handling exception: javax.net.ssl.SSLHandshakeException: Received fatal alert: handshake_failure
SSLHandshakeException Received fatal alert: handshake_failure sun.security.ssl.Alerts.getSSLException (Alerts.java:192)


Here we see that the JVM uses TLS version 1 (see *** ClientHello, TLSv1) and supports the listed Cipher Suites, including TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA.

What's wrong?



Here we see that the server and JVM share exactly one cipher suite, TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA. But they fail to agree on the TLS version, since the server requires v1.2 while the JVM only offers v1.

The solution



You can either configure the server to support a cipher suite and protocol version that the JVM has or teach JVM to use what the server wants. In my cases that was resolved by running java with -Dhttps.protocols=TLSv1.2 (alternatively, you could add all of SSLv3,TLSv1,TLSv1.1,TLSv1.2) as recommended by π at StackOverflow.

Sources



The troubleshooting technique comes from the article "SSLHandshakeException: Received fatal alert: handshake_failure due to no overlap in cipher suite " by Atlassian. The observation that the server and JVM disagreed on the TLS version comes from my good colleague Neil.
Continue reading →

Experience Report: Hiring for Clojure(Script) is Easy

Published originally at the Telia Engineering blog.

Update Jan 2020: Added "Related resources and experiences".

Our experience shows that hiring people for a Clojure(Script) project is relatively easy (in Oslo, Norway) despite a market where demand exceeds supply. But it is important to use the right channels to reach the right people.

As I have written previously in From JavaScript to Clojure(Script): Writing a webshop, again, we have started a new project based on Clojure an ClojureScript. I hypothesized that hiring people knowing or willing to use these languages will be easier than hiring for Java. Others were skeptical so, in the best tradition of our team, we decided to resolve the dispute with an experiment. If we managed to get sufficient positive response and (start to) hire one full-time employee within 2 weeks then we would proceed with this language choice.

We have published the announcement via the international Clojure mailing list, the #clojure-norway Slack channel, the Oslo Clojure Meetup, and via Finn Jobs, a local job-finding site.

The results

We got about 15 local candidates within a very short time. A few with production Clojure(Script) experience, most with hobby experience, a few with none. A few of them were really experienced people. All wanted to join us because of our language choice. All seemed quite passionate.

All but one came from the Oslo Clojure Meetup. The international mailing list yielded one candidate plus a few foreigners willing to move to Norway (if we provided visa sponsorship) and a few people from Norway living outside of Oslo (and not willing to move).

Conclusion

The economies of Scala hypothesis, i.e. that it is easier to hire for a niche language with passionate user base than for a mainstream language, seems to hold. But it is crucial to use the right channels to reach relevant people.

Paul Graham’s The Python Paradox argues that you could get smarter programmers to work on a Python project than you could to work on a Java project - because people don’t learn Python because it will get them a job; they learn it because they genuinely like to program and aren’t satisfied with the languages they already know.

Worth noting that in our experience, hiring has gotten way easier for us since we became an Elm shop. We really struggled to hire React engineers (who have a zillion positions to choose among - why would they pick ours?), whereas there seem to be a lot more great programmers who want to use Elm than there are companies hiring for Elm positions.

Our Head of Talent said she’d never seen an inbound pipeline as strong as ours, and the #1 reason people cite for wanting to apply is Elm. The "Python Paradox"[0] is real!


Continue reading →

Why we love AWS Beanstalk but are leaving it anyway

Cross-posted from Telia's Tech Blog.

We have had our mission-critical webapp running on AWS Elastic Beanstalk for three years and have been extremely happy with it. However we have now outgrown it and move to a manually managed infrastructure and CodeDeploy.

AWS Beanstalk provides you with lot of bang for the buck and enables you to get up and running in no time:



So if you need a solid, state-of-the-art infrastructure for a web-scale application and you don't have lot of time and/or skill to build one on AWS on your own, I absolutely recommend Beanstalk.


Continue reading →

Pains with Terraform (perhaps use Sceptre next time?)

Cross-posted from Telia's Tech Blog

We use Amazon Web Services (AWS) heavily and are in the process of migrating towards infrastructure-as-code, i.e. creating a textual description of the desired infrastructure in a Domain-Specific Language and letting the tool create and update the infrastructure.

We are lucky enough to have some of the leading Terraform experts in our organisation so they lay out the path and we follow it. We are at an initial stage and everything is thus "work in progress" and far from perfect, therefore it is important to judge leniently. Yet I think I have gain enough experience trying to apply Terraform both now and in the past to speak about some of the (current?) limitations and disadvantages and to consider alternatives.


Continue reading →

How to patch Travis CI's deployment tool for your needs

Travis CI is a pretty good software-as-a-service Continuous Integration server. It can deploy to many targets, including AWS BeanStalk, S3, and CodeDeploy.

However it might happen that the deploy tool (dpl) has a missing feature or doesn't do exactly what you need. Fortunately it is easy to fix and run a modified version of the tool, and I will show you how to do that.


Continue reading →

Experience: Awesome productivity with ClojureScript's REPL

Re-posted from Telia's tech blog.



What's the deal with ClojureScript? How can you justify picking such a "niche" language? I have recently experienced a "wow" session, demonstrating the productivity gains of ClojureScript and the interactive development it enables thanks to its REPL. I would like to share the experience with you. (If you have never heard about it before - it is a modern, very well designed Lisp that compiles to JavaScript for frontend and backend development. It comes with a REPL that makes it possible to reload code changes and run code in the context of your live application, developing it while it is running.)


Continue reading →

Simulating network timeouts with toxiproxy

Goal: Simulate how a Node.js application reacts to timeouts.

Solution: Use toxiproxy and its timeout "toxic" with the value of 0, i.e. the connection won't close, and data will be delayed until the toxic is removed.

The steps:

1. Start toxiproxy, exposing the port 6666 that we intend to use as localhost:6666:

docker pull shopify/toxiproxy
docker run --name=toxiproxy --rm --expose 6666 -p 6666:6666 -it shopify/toxiproxy

(If I was on Linux and not OSX then I could use --net=host and wouldn't need to expose and/or map the port.)

2. Tell toxiproxy to serve request att 6666  via an upstream service:

docker exec -it toxiproxy /bin/sh
/ # cd /go/bin/
/go/bin # ./toxiproxy-cli create upstream -l 0.0.0.0:6666 -u google.com:443

3. Modify your code to access the local port 6666 and test that everything works.

Since we want to access Google via HTTPS, we would get a certificate error when accessing it via localhost:6666 (e.g. "SSLHandshakeException: PKIX path building failed: [..] unable to find valid certification path to requested target" in Java or (much better) "(51) SSL: no alternative certificate subject name matches target host name 'localhost'" in curl) so we will add an alias to our local s /etc/hosts:

127.0.0.1 proxied.google.com

and use
https://proxied.google.com:6666 in our connecting code (instead of the https://google.com:443 we had there before). Verify that it works and the code gets a response as expected.

Note: google.com is likely a bad choice here since it will return 404 as you must specify the header "Host: www.google.com" to get 200 OK back; without it you will get 404.

4. Tell toxiproxy to have an infinite timeout for this service

Continuing our toxiproxy configuration from step 2:

./toxiproxy-cli toxic add -t timeout -a timeout=0 upstream

(Alternatively, e.g. timeout=100; then the connection will be closed after 100 ms.)

5. Trigger your code again. You should get a timeout now.

Tip: You can simulate the service being down via disabling the proxy:

./toxiproxy-cli toggle upstream

Aside: Challenges when proxying through Toxiproxy

The host header

Servers (e.g. google.com, example.com) don't like it when the Host header (derived normally from the URL) differs from what they expect. So you either need to make it possible to access localhost:<toxiproxy port> via the upstream server's hostname by adding it as an alias to /etc/hosts (but how do you then access the actual service?) or you need to override the host header. In curl that is easy with -H "Host: www.google.com" but not so in Java.

In Java (openjdk 11.0.1 2018-10-16) you need to pass -Dsun.net.http.allowRestrictedHeaders=true to the JVM at startup to enable overriding the Host header (Oracle JVM might allow to do that at runtime) and then:

(doto ^HttpURLConnection (.openConnection (URL. "https://proxied.google.com:6666/"))
(.setRequestProperty "Host" "www.google.com")
(.getInputStream)
SSL certificate issues

As described above, when talking to HTTPS via Toxiproxy, you need to ensure that the hostname you use in your request is covered by the server's certificate, otherwise you will get SSL errors. To apply the solution described here, i.e. adding e.g. proxied.<server name, e.g. google.com> to your /etc/hosts works, provided the certificate is valid also for subdomains, i.e. is issued for <server> and *.<server>, which is not always the case.

Alternatively, you can disable certificate validation - trivial in curl with -k but much more typing in Java.




Continue reading →

Demonstration: Applying the Parallel Change technique to change code in small, safe steps

The Parallel Change technique is intended to make it possible to change code in a small, save steps by first adding the new way of doing things (without breaking the old one; "expand"), then switching over to the new way ("migrate"), and finally removing the old way ("contract", i.e. make smaller). Here is an example of it applied in practice to refactor code producing a large JSON that contains a dictionary of addresses at one place and refers to them by their keys at other places. The goal is to rename the key. (We can't use simple search & replace for reasons.)


Continue reading →

It Is OK to Require Your Team-mates to Have Particular Domain/Technical Knowledge

Should we write stupid code that is easy to understand for newcomers? It seems as a good thing to do. But it is the wrong thing to optimise for because it is a rare case. Most of the time you will be working with people experienced in the code base. And if there is a new member, you should not just throw her into the water and expect her to learn and understand everything on her own. It is better to optimise for the common case, i.e. people that are up to speed. It is thus OK to expect and require that the developers have certain domain and technical knowledge. And spend resources to ensure that is the case with new members. Simply put, you should not dumb down your code to match the common knowledge but elevate new team mates to the baseline that you defined for your product (based on your domain, the expected level of experience and dedication etc.).




Continue reading →

Don't add unnecessary checks to your code, pretty please!

Defensive programming suggests that we should add various checks to our code to ensure the presence and proper shape and type of data. But there is one important rule - only add a check if you know that thing can really happen. Don't add random checks just to be sure - because you are misleading the next developer.


Continue reading →

2015 in review

The WordPress.com stats helper monkeys prepared a 2015 annual report for this blog.



Here's an excerpt:

The Louvre Museum has 8.5 million visitors per year. This blog was viewed about 200,000 times in 2015. If it were an exhibit at the Louvre Museum, it would take about 9 days for that many people to see it.


Click here to see the complete report.
Continue reading →

A Costly Failure to Design for Performance and Robustness

I have learned that it is costly to not prioritise expressing one's design concerns and ideas early. As a result, we have a shopping cart that is noticeably slow, goes down whenever the backend experiences problems, and is a potential performance bottleneck. Let's have a look at the problem, the actual and my ideal designs, and their pros and cons.

We have added shopping cart functionality to our web shop, using a backend service to provide most of the functionality and to hold the state. The design focus was on simplicity - the front-end is stateless, any change to the cart is sent to the backend and the current content of the cart is always fetched anew from it to avoid the complexity of maintaining and syncing state at two places. Even though the backend wasn't design for the actual front-end needs, we work around it. The front-end doesn't need to do much work and it is thus a success in this regard.


Continue reading →

Copyright © 2020 Jakub Holý
Powered by Cryogen
Theme by KingMob