#20: Chaos engineering
by Tomasz Nurkiewicz
We tend to focus on testing happy paths and expected edge cases. But how do you make sure that your system can survive minor infrastructure and network failures, as well as application bugs? Especially in microservice or serverless environment, where there are tons of moving parts. I’ve seen too many times systems that fail miserably because some minor dependency was malfunctioning. For example you have a tiny service that displays a small social widget on your website. When that service is down, the rest of the website should work. But without proper care and testing you may end up with global HTTP 503 failure. Code reviews and unit tests are fine, but the ultimate test is… turning off that service on production. And making sure the rest actually works. This is called chaos engineering.
- Chaos engineering
- Service Level Objectives from SRE book
- Amazon Found Every 100ms of Latency Cost them 1% in Sales
- Netflix’s Chaos Monkey
- Litmus - chaos engineering for Kubernetes
- Chaos Engineering Experiment Automation
Be the first to listen to new episodes!
To get exclusive content:
- Unedited, longer content
- More extra materials to learn
- Your user voice ideas are prioritized