Pinball machine gamers never play the exact same game twice. Just as the pinball machine is unpredictable in nature, so too is the chaos in ensuring consistently reliable infrastructure systems.
Chaos engineering defined
Chaos engineering is the use of experimental and potentially destructive failure testing to uncover vulnerabilities and weaknesses within a complex system. It is systematically planned, documented, executed and analyzed as a “test-first” approach in preproduction infrastructure systems.
Digital solutions involve many interdependencies because of the diversity of platforms that exist in a given organization. System complexity is becoming exponentially more difficult to plan for, especially as infrastructure is “everywhere.”
The reality is that avoiding chaos engineering is equivalent to embracing crisis engineering
“Many organizations approach the concept of chaos engineering with the attitude that the practice is far too risky to execute into production,” says Jim Scheibmeir, Senior Principal Analyst, Gartner. “The reality is that avoiding chaos engineering is equivalent to embracing crisis engineering.”
Despite its name, chaos engineering provides an empirical approach to realize five core benefits:
- Expose technical debt
- Build trust in the systems deployed, but also between the teams that contribute to those systems
- Identify and make testable integrations and potential failure points
- Enable experiment-based learning
- Deliver improved reliability and resilience of systems to reduce downtime
Chaos engineering helps with IT leaders’ priorities
Last year, the top 3 priorities for CIOs were digital initiatives, revenue/business growth and operational excellence. These priorities are otherwise impossible to achieve if infrastructure systems are not adequately reliable.
“So long as organizations continue to prioritize scaling digital initiatives, the web of dependencies and complexities that digital solutions bring cannot be overlooked,” says Scheibmeir. “(I&O) teams that are not proactive in ensuring reliability will only find themselves reacting to chaos, which is essentially the same thing as accepting system downtime.”
Make chaos engineering part of regular team operations by exposing staff to the practice, sharing chaos plans with other business units, and collaboratively experimenting and refining it over time.
Chaos engineering solves top DevOps objectives
At present, operational efforts to improve system reliability focus too much on reactive processes that emphasize incident management and service restoration. Instead, the proactive nature of chaos engineering enables organizations to manage and mitigate the risks of system downtime and disruption.
As a result, IT leaders can meet their top DevOps objectives like improving agility, improving release quality with fewer defects and improving system reliability. The time spent on planned infrastructure initiatives will not be as readily disrupted when the frequency of downtime is lessened as a result of chaos engineering.
Gartner anticipates that 40% of organizations will implement chaos engineering practices as part of DevOps initiatives by 2023, reducing unplanned downtime by 20%.
Get started with chaos engineering
Overcome the fear of chaos engineering as a means to achieve required system reliability. Chaos engineering can be done safely by starting the practice in the preproduction environment, which allows for organizational learning, increased reliability and a greater understanding of complex system dependencies.
Specific recommendations include:
- Sell chaos engineering as an organizational competency by evangelizing the practice as a regular product team activity.
- Focus your teams’ efforts on understanding user journeys, their experiences and downtime drivers, to enhance system reliability and minimize friction.
- Mandate teams to safely engage in chaos engineering via a “test-first” approach in preproduction environments.
- Urge teams to attack everything by using tools and practices that introduce failure points.