What if you are preparing to leave early on Friday evening and you get a notification on your mobile reading that one of your production services is down and clients are not able to use the system anymore? You guessed it right, you need to stay late (at the office or home) and solve it. How quickly you solve it depends upon many factors like system understanding, system architecture, infrastructure, networking, etc.
It’s not just you who have to suffer. If your application is mission-critical, then every single minute counts and it can have a huge business impact. To avoid such situations, you can include Chaos Engineering in your project.
What is Chaos Engineering?
It is a practice of injecting faults in your system on purpose to check the resiliency of your system. It helps you to gain more understanding about your system and answers questions like:
- what if third-party API starts responding slowly
- what if your database becomes unreachable
- what if the whole region goes down
- what if downstream Lambda times out
It is common to notice that once you complete the development of the whole system, there are many parts (business logic, database, containers, etc) that work in harmony to enable the business flow. If any part fails, the chances of outage increase. You must have read about the outage of Facebook on Oct 4, 2021. It was down for six hours due to some configuration changes on the backbone routers. Many businesses rely on its APIs, so just think of the business impact.
Maybe your application might not be as big as Facebook, Amazon, Flipkart or Uber, but considering the distributed nature of the applications on the cloud, your system is bound to fail.
Everything breaks, we plan on it.
– IBM
Chaos Engineering is no longer just a concept. Every major company has started implementing it in their products/projects. Test your system, before it tests you and end up in news.
Quality Testing VS Chaos Engineering
But you already have a QA team that ensures no bugs slip to the production environment. That’s true but the QA team, generally, checks the business use cases and performs somewhat load testing. They are not aware of the no. of services running on the cloud, storage used for the system, API Gateway concepts, Kubernetes, etc. QA team performs usability checks, whereas Chaos Engineering helps you to perform reliability checks. There is a separate SRE (Site Reliability Engineering) team that makes sure your system doesn’t crash. This team practices Chaos Engineering. In the case of a small project, developers can play the role of the SRE team.