I work at a mostly AWS shop, and while we still have services on raw EC2, nearly all of our new development is on Amazon ECS in docker. I like docker because it provides a unified unit of operation (a container) that makes it easy to build shared tooling regardless of language/application. It also lets you reproduce your applications local in the same environment they run remote, as well as starting fast and deploying fast.
However, many services run on a shared ECS node in a cluster, and so while things like Chaos Monkey may run around turning nodes off it’d be nice to have a little less of an impact during working hours while still being able to stress recovery and our alerting.
This is actually pretty easy though with a little docker container we call The Beast
. All the beast does is run on a ECS Scheduled event every 15-30 minutes from 10am – 3pm PST (we have teams east and west coasts) and the beast kills a random container from whatever cluster node its on. It doesn’t do a lot of damage, but it does test your fault tolerance.
Here’s The Beast:
#!/usr/bin/env ruby require 'json' require 'pp' class Hash def extract_subhash(*extract) h2 = self.select{|key, value| extract.include?(key) } self.delete_if {|key, value| extract.include?(key) } h2 end end puts "UNLEASH THE BEAST!" ignore_image_regex = ENV["IGNORED_REGEX"] raw = "[#{`docker ps --format '{{json .}}'`.lines.join(',')}]" running_services = JSON.parse(raw).map { |val| val.extract_subhash("ID", "Image")} puts running_services puts "Ignoring regex #{ignore_image_regex}" if ignore_image_regex && ignore_image_regex.length > 0 running_services.delete_if {|value| /#{ignore_image_regex}/ === value["Image"] } end if !running_services || running_services.length == 0 puts "No services to kill" Process.exit(0) end puts "Bag of services to kill: " to_kill = running_services.sample puts "Killing #{pp to_kill}" `docker kill #{to_kill["ID"]}` prng = Random.new quips = [ "Dont fear the reaper", "BEAST MODE", "You been rubby'd", "Pager doody" ] puts "#{quips[prng.rand(0..quips.length-1)]}"
Beast supports a regex of ignored images (so critical images like the ecs_agent and itself) can be marked as ignore. This can also be used to update the beast to allow it to ignore services temporarily/etc.
We deploy The Beast with terraform, the general task definition looks like:
[ { "name": "the-beast", "image": "${image}:${version}", "cpu": 10, "memory": 50, "essential": true, "logConfiguration": { "logDriver": "awslogs", "options": { "awslogs-group": "${log_group}", "awslogs-region": "${region}", "awslogs-stream-prefix": "the-beast" } }, "environment": [ { "name": "IGNORED_REGEX", "value": ".*ecs_agent.*|.*the-beast.*" } ], "mountPoints": [ { "sourceVolume": "docker-socket", "containerPath": "/var/run/docker.sock", "readOnly": true } ] } ]
And the terraform:
resource "aws_ecs_task_definition" "beast_rule" { family = "beast-service" container_definitions = "${data.template_file.task_definition.rendered}" volume { name = "docker-socket" host_path = "/var/run/docker.sock" } } data "template_file" "task_definition" { template = "${file("${path.module}/files/task-definition.tpl")}" vars { version = "${var.beast-service["version"]}" region = "${var.region}" image = "${data.terraform_remote_state.remote_env_state.docker_namespace}/the-beast" log_group = "${var.log-group}" } } resource "aws_cloudwatch_event_target" "beast_scheduled_job_target" { target_id = "${aws_ecs_task_definition.beast_rule.family}" rule = "${aws_cloudwatch_event_rule.beast_scheduled_job.name}" arn = "${data.aws_ecs_cluster.default_cluster.id}" role_arn = "${data.aws_iam_role.ecs_service_role.arn}" ecs_target { task_count = 1 task_definition_arn = "${aws_ecs_task_definition.beast_rule.arn}" } } resource "aws_cloudwatch_event_rule" "beast_scheduled_job" { name = "${aws_ecs_task_definition.beast_rule.family}" description = "Beast kills a container every 30 minutes from 10AM to 3PM PST Mon-Thu" schedule_expression = "cron(0/30 18-23 ? * MON-THU *)" is_enabled = false } resource "aws_cloudwatch_log_group" "beast_log_group" { name = "${var.log-group}" }
We can log to cloudwatch and correlate back information if a service was killed by the best as well. It’s important to note that you need to mount the docker socket for beast to work, since it needs docker to run. A sample dockerfile looks like:
FROM ubuntu:xenial RUN apt-get update && apt-get install -y ruby-full docker.io build-essential RUN gem install json ADD beast.rb /app/beast.rb RUN chmod +x /app/beast.rb ENTRYPOINT "/app/beast.rb"
It’s bare bones, but it works, and the stupid quips at the end always make me chuckle.