Secure REPL access for AWS Fargate services

You are running a Clojure microservice on AWS Fargate in a private VPC subnet and want to have a secure REPL access to it, only exposing the port inside the VPC. Moreover you would prefer to use SSM Session and SSM Port Forwarding instead of running and exposing SSH. Here is how to do it. (Beware: it is not as trivial as we would like.)

Overview

Our Fargate service will start a nREPL server. We will run an EC2 node as a proxy and ensure that its Security Group can access the REPL port on the Fargate service Security Group. We will access the proxy server using SSM session, run socat to tunnel a proxy port to the service’s REPL port, and SSM port forwarding to expose it on the local PC so that we can connect with our client of choice, such as Cursive. (Alternatively we could run SSHd on the proxy and access it via ssh and use its port forwarding, which is considerably simpler but less secure as we have to expose the port 22 somehow and make it available from the internet or use AWS VPN - which, BTW, is another interesting alternative to the proxy.)

Part 1: REPL server and a proxy

REPL server

(1) Our Fargate service will start a nREPL server on the port 55555 and - this is crucial - bind it to 0.0.0.0 (the default is 127.0.0.1, which isn’t accessible from the outside):

(require '[nrepl.server :as nrepl])
(defn -main []
  (nrepl/start-server :port 55555 :bind "0.0.0.0")
  (start-the-main-service))
I had no luck with Clojure Socket REPL, it fails with the weird Exception in thread "Clojure Server repl" java.net.SocketTimeoutException: Accept timed out at clojure.core.server$start_server$fn8879$fn8880.invoke(server.clj:111), using the JVM option -Dclojure.server.repl='{:port 55555 :accept clojure.core.server/repl :address "0.0.0.0" }'. Starting a socket REPL manually from the normal REPL worked.

(2) We also need to map this port to the host in the Fargate/ECS Task Definition (here using the beloved and hated Terraform):

resource "aws_ecs_task_definition" "task" {
  family                   = "${var.name_prefix}"
  execution_role_arn       = "${aws_iam_role.execution.arn}"
  network_mode             = "awsvpc"
  requires_compatibilities = ["FARGATE"]
  cpu                      = "${var.task_definition_cpu}"
  memory                   = "${var.task_definition_memory}"
  task_role_arn            = "${aws_iam_role.task.arn}"

  container_definitions = <<EOF
[{
    "name": "${var.container_name != "" ? var.container_name : var.name_prefix}",
    "image": "${var.task_container_image}",
    ${local.repository_credentials_rendered}
    "essential": true,
    "portMappings": [
        {
            "containerPort": ${var.task_container_port},
            "hostPort": ${var.task_container_port},
            "protocol":"tcp"
        },
        {
            "containerPort": 55555, (1)
            "hostPort": 55555,
            "protocol":"tcp"
        }
    ],
    "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
            "awslogs-group": "${aws_cloudwatch_log_group.main.name}",
            "awslogs-region": "${data.aws_region.current.name}",
            "awslogs-stream-prefix": "container"
        }
    },
    "command": ${jsonencode(var.task_container_command)},
    "environment": ${jsonencode(data.null_data_source.task_environment.*.outputs)}
}]
EOF
}
1Map the container port to the same host port

(3) We also need the task’s Security Group (S.G.) to allow access to the port 55555 from the proxy S.G. (or from anywhere in the VPC, if you prefer.):

resource "aws_security_group_rule" "proxy2service" {
  security_group_id        = "${module.fargate_service.service_sg_id}" (1)
  source_security_group_id = "${var.allow_security_group_id}"          (2)
  description              = "Allow REPL access from the REPL Tunnel Proxy"
  type                     = "ingress"
  protocol                 = "tcp"
  from_port                = 55555
  to_port                  = 55555
}
1ID of the task’s S.G.
2ID of the proxy’s S.G.

We need not only to open the Task S.G. for the proxy (ingress) but also allow the Proxy S.G. to reach for it (egress), see below.

Proxy

We run an EC2 instance in any private VPC subnet. We allow it to connect to the task’s port 55555 by configuring its S.G. as follows:

module "ec2_security_group" {
  source  = "terraform-aws-modules/security-group/aws"
  version = "2.17.0"  # v3 requires tf 0.12 but we only have 0.11
  name        = "ssm-tunnel-proxy"
  description = "Security group for the SSM Tunnel Proxy for REPL access"
  vpc_id      = "${var.vpc_id}"
  ingress_cidr_blocks = ["${var.vpc_cidr_block}"]
  egress_rules = ["https-443-tcp", "http-80-tcp"] (1)
  egress_with_cidr_blocks = [
    {
      from_port   = 55555 (2)
      to_port     = 55555
      protocol    = "tcp"
      description = "Clojure REPL"
      cidr_blocks = "${var.vpc_cidr_block}"
    }
  ]
}
1Allow to access 443 and 80 anywhere so that yum can install packages and so that SSM Agent can talk to the SSM VPC endpoint over HTTPS
2Allow to connect to the port 55555 on any node in this VPC (if its S.G. allows it)

We also need to:

  • Make sure that the node has a recent SSM Agent installed and running. I use Amazon Linux 2 which has it baked in - but had to upgrade it anyway since the built-in version (2.3.662) does not support Port Forwarding (e.g. 2.3.760 does)

  • Preinstall socat

  • Setup IAM to allow SSM

  • Set up VPC Endpoints for SSM (and the required (?) EC2)

Part 2: Secure access via Systems Manager (SSM)

There are (too) many moving parts that need to fall in place for you to be able to use SSM to access your instances:

  1. The instance must have a recent enough SSM Agent installed, enabled, and running - see the Terraform module repl-ssm-tunnel-proxy

  2. The agent must be able to reach the SSM API, e.g. via an VPC Endpoint - the instance’s security group must allow egress access to the port 443 in the VPC / the endpoint’s S.G. and there is a host of other required IAM permissions (look into /var/log/amazon/ssm/ for troubleshooting) - see the proxy and vpc-endpoints-for-ssm Terraform modules

  3. The EC2 instance’s role must have the AmazonSSMManagedInstanceCore policy attached

  4. Allow both EC2 and SSM to assume the EC2 instance’s role

  5. And perhaps more…​

The setup that finally worked for me uses VPC Endpoints for EC2 and SSM and leverages the telia-oss/terraform-aws-ssm-agent-policy Terraform module to get some of the permissions right; refer to the gist linked above.

If we could run SSM Agent inside or as a side-kick Docker container, we wouldn’t need the proxy. It is possible but more complicated since it isn’t supported out of the box and the container behaves as an on-premise node in a hybrid environment and you have to create and manage SSM activations for it.

Part 3: Connecting

Prerequisites: a recent aws CLI and Installed the Session Manager Plugin for the AWS CLI.

Steps:

  1. Find out the IP of the Fargate task e.g. using the AWS Console: ECS services → your cluster main → click the service → Tasks → click the Task id → copy the private IP

  2. Find out the EC2 Instance ID of the proxy

  3. Establish a tunnel from the EC2 proxy to the instance:

    localPC$ aws ssm start-session --target $INSTANCE_ID
    sh$      sudo socat TCP-LISTEN:55555,su=nobody,fork TCP:<IP OF THE FARGATE TASK>:55555
  4. Forward the remote REPL port to the local port 6666:

    aws ssm start-session --target  $INSTANCE_ID \
      --document-name AWS-StartPortForwardingSession \
      --parameters '{"portNumber":["55555"],"localPortNumber":["6666"]}'
  5. Start a REPL client e.g. using Leiningen: lein repl :connect localhost:6666

Future improvements

I’d like to replace the need to manually start socat with a permanently running proxy (perhaps SOCKS?) that would accept a target IP and establish a tunnel to it automatically. (Perhaps that is what ssm-tunnel does?)


Tags: clojure aws DevOps


Copyright © 2024 Jakub Holý
Powered by Cryogen
Theme by KingMob