Chaos monkey for docker

I work at a mostly AWS shop, and while we still have services on raw EC2, nearly all of our new development is on Amazon ECS in docker. I like docker because it provides a unified unit of operation (a container) that makes it easy to build shared tooling regardless of language/application. It also lets you reproduce your applications local in the same environment they run remote, as well as starting fast and deploying fast.

However, many services run on a shared ECS node in a cluster, and so while things like Chaos Monkey may run around turning nodes off it’d be nice to have a little less of an impact during working hours while still being able to stress recovery and our alerting.

This is actually pretty easy though with a little docker container we call The Beast. All the beast does is run on a ECS Scheduled event every 15-30 minutes from 10am – 3pm PST (we have teams east and west coasts) and the beast kills a random container from whatever cluster node its on. It doesn’t do a lot of damage, but it does test your fault tolerance.

Here’s The Beast:

#!/usr/bin/env ruby

require 'json'
require 'pp'

class Hash
  def extract_subhash(*extract)
    h2 = self.select{|key, value| extract.include?(key) }
    self.delete_if {|key, value| extract.include?(key) }
    h2
  end
end

puts "UNLEASH THE BEAST!"

ignore_image_regex = ENV["IGNORED_REGEX"]

raw = "[#{`docker ps --format '{{json .}}'`.lines.join(',')}]"

running_services = JSON.parse(raw).map { |val| val.extract_subhash("ID", "Image")}

puts running_services

puts "Ignoring regex #{ignore_image_regex}"

if ignore_image_regex && ignore_image_regex.length > 0
  running_services.delete_if {|value|
    /#{ignore_image_regex}/ === value["Image"]
  }
end

if !running_services || running_services.length == 0
  puts "No services to kill"

  Process.exit(0)
end

puts "Bag of services to kill: "

to_kill = running_services.sample

puts "Killing #{pp to_kill}"

`docker kill #{to_kill["ID"]}`

prng = Random.new

quips = [
    "Dont fear the reaper",
    "BEAST MODE",
    "You been rubby'd",
    "Pager doody"
]

puts "#{quips[prng.rand(0..quips.length-1)]}"

Beast supports a regex of ignored images (so critical images like the ecs_agent and itself) can be marked as ignore. This can also be used to update the beast to allow it to ignore services temporarily/etc.

We deploy The Beast with terraform, the general task definition looks like:

[
  {
    "name": "the-beast",
    "image": "${image}:${version}",
    "cpu": 10,
    "memory": 50,
    "essential": true,
    "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "${log_group}",
          "awslogs-region": "${region}",
          "awslogs-stream-prefix": "the-beast"
        }
    },
    "environment": [
        {
          "name": "IGNORED_REGEX", "value": ".*ecs_agent.*|.*the-beast.*"
        }
    ],
    "mountPoints": [
        { "sourceVolume": "docker-socket", "containerPath": "/var/run/docker.sock", "readOnly": true }
    ]
  }
]

And the terraform:

resource "aws_ecs_task_definition" "beast_rule" {
  family = "beast-service"
  container_definitions = "${data.template_file.task_definition.rendered}"

  volume {
    name = "docker-socket"
    host_path = "/var/run/docker.sock"
  }
}

data "template_file" "task_definition" {
  template = "${file("${path.module}/files/task-definition.tpl")}"

  vars {
    version = "${var.beast-service["version"]}"
    region = "${var.region}"
    image = "${data.terraform_remote_state.remote_env_state.docker_namespace}/the-beast"
    log_group = "${var.log-group}"
  }
}

resource "aws_cloudwatch_event_target" "beast_scheduled_job_target" {
  target_id = "${aws_ecs_task_definition.beast_rule.family}"
  rule = "${aws_cloudwatch_event_rule.beast_scheduled_job.name}"
  arn = "${data.aws_ecs_cluster.default_cluster.id}"
  role_arn = "${data.aws_iam_role.ecs_service_role.arn}"
  ecs_target {
    task_count = 1
    task_definition_arn = "${aws_ecs_task_definition.beast_rule.arn}"
  }
}

resource "aws_cloudwatch_event_rule" "beast_scheduled_job" {
  name = "${aws_ecs_task_definition.beast_rule.family}"
  description = "Beast kills a container every 30 minutes from 10AM to 3PM PST Mon-Thu"
  schedule_expression = "cron(0/30 18-23 ? * MON-THU *)"
  is_enabled = false
}

resource "aws_cloudwatch_log_group" "beast_log_group" {
  name = "${var.log-group}"
}

We can log to cloudwatch and correlate back information if a service was killed by the best as well. It’s important to note that you need to mount the docker socket for beast to work, since it needs docker to run. A sample dockerfile looks like:

FROM ubuntu:xenial

RUN apt-get update && apt-get install -y ruby-full docker.io build-essential

RUN gem install json

ADD beast.rb /app/beast.rb

RUN chmod +x /app/beast.rb

ENTRYPOINT "/app/beast.rb"

It’s bare bones, but it works, and the stupid quips at the end always make me chuckle.

Post a comment

You may use the following HTML:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>