jakub holý

building the right thing, building it right, fast

Fixing a mysterious .ebextensions command time out (AWS Elastic Beanstalk)

2015-07-29 08:05:48[Dev]Ops

Our webshop, nettbutikk.netcom.no, runs on AWS Elastic Beanstalk and we use .ebextensions/ to customize the environment. I have been just trying to get Gor running on our leader production instance to replay some traffic to our staging environment so that we get a much richer feedback from it. However the container_command I used caused the instance to time out and trash the environment, against all reason. The documentation doesn't help and troubleshooting this is hard due to lack of feedback and time-consuming. Luckily I have arrived to a solution.



This is the working solution:
files:
/opt/gor:
source: "https://s3-eu-west-1.amazonaws.com/elasticbeanstalk-eu-west-1-<our-id>/our_fileserver/gor"
authentication: S3Access
mode: "000755"
owner: root
group: root
# Script to start Gor in the background
# Beware: We need to intercept port 8080 b/c 80 is redirected there via iptables
/opt/gor-in-background:
mode: "000755"
owner: root
group: root
content: |
#!/usr/bin/env bash
pidof /opt/gor || nohup /opt/gor --input-raw :8080 --output-http 'https://our-staging-server|1' >/dev/null 2>&1 </dev/null &
# Only container_commands can access env variables configured in EB (ENV)
# Start gor, limit it to copy max 1 req / sec
container_commands:
"Start Gor on Prod leader":
command: test "$ENV" = "production" && /opt/gor-in-background >/dev/null 2>&1 # ! w/o the redirect it will time-out
leader_only: true
ignoreErrors: true # returns an "error" when not in Prod
view raw 11gor.config.yaml hosted with ❤ by GitHub


(The S3 private bucket access is a story of its own, requiring and addition of AWS::CloudFormation::Authentication and a change of the bucket policy.)

The key points are starting gor in the background with & and, the magic ingredient that took me so long to figure out, the redirection of the command's output to /dev/null (the redirection inside the script likely doesn't need to be there with respect to this problem; I have it because I don't want any output to accumulate on the disk).

I do not know if I need to redirect both stdin and stderr of /opt/gor-in-background and why I need to do it but without it I got the infamous

[time N+1] INFO Command execution completed on all instances. Summary: [Successful: 1, TimedOut: 1]. 

[time N] WARN The following instances have not responded in the allowed command timeout time (they might still finish eventually on their own): [i-1e35c2b3].



and the instance continued to time out even when I tried to re-deploy a working version and never managed to deliver logs.

Troubleshooting tip: Clone the target env a few times and use those to test changes multiple times and multiple changes in parallel to speed up the process.

Summary



If you (container) command leads to a time out, try to redirect its stdout and/or stderr to /dev/null.

Thanks to João Abrantes for the redirection idea!