Tagged: fault tolerance

Chaos monkey for docker

I work at a mostly AWS shop, and while we still have services on raw EC2, nearly all of our new development is on Amazon ECS in docker. I like docker because it provides a unified unit of operation (a container) that makes it easy to build shared tooling regardless of language/application. It also lets you reproduce your applications local in the same environment they run remote, as well as starting fast and deploying fast.

However, many services run on a shared ECS node in a cluster, and so while things like Chaos Monkey may run around turning nodes off it’d be nice to have a little less of an impact during working hours while still being able to stress recovery and our alerting.

This is actually pretty easy though with a little docker container we call The Beast. All the beast does is run on a ECS Scheduled … Read more