The I&O Leader’s Guide to Chaos Engineering

Complement business agility gained through DevOps and cloud initiatives with chaos engineering.

Pinball machine gamers never play the exact same game twice. Just as the pinball machine is unpredictable in nature, so too is the chaos in ensuring consistently reliable infrastructure systems.

Chaos engineering defined

Chaos engineering is the use of experimental and potentially destructive failure testing to uncover vulnerabilities and weaknesses within a complex system. It is systematically planned, documented, executed and analyzed as a “test-first” approach in preproduction infrastructure systems.  

Digital solutions involve many interdependencies because of the diversity of platforms that exist in a given organization. System complexity is becoming exponentially more difficult to plan for, especially as infrastructure is “everywhere.”

The reality is that avoiding chaos engineering is equivalent to embracing crisis engineering

“Many organizations approach the concept of chaos engineering with the attitude that the practice is far too risky to execute into production,” says Jim Scheibmeir, Senior Principal Analyst, Gartner. “The reality is that avoiding chaos engineering is equivalent to embracing crisis engineering.”

Despite its name, chaos engineering provides an empirical approach to realize five core benefits:

  • Expose technical debt
  • Build trust in the systems deployed, but also between the teams that contribute to those systems
  • Identify and make testable integrations and potential failure points
  • Enable experiment-based learning
  • Deliver improved reliability and resilience of systems to reduce downtime

Read more: Gartner Top 10 Trends Impacting Infrastructure and Operations for 2020

IT Leadership Vision for 2021

Emerging trends, challenges and next steps

Download eBook

Chaos engineering helps with IT leaders’ priorities

Last year, the top 3 priorities for CIOs were digital initiatives, revenue/business growth and operational excellence. These priorities are otherwise impossible to achieve if infrastructure systems are not adequately reliable.

“So long as organizations continue to prioritize scaling digital initiatives, the web of dependencies and complexities that digital solutions bring cannot be overlooked,” says Scheibmeir. “(I&O) teams that are not proactive in ensuring reliability will only find themselves reacting to chaos, which is essentially the same thing as accepting system downtime.” 

Make chaos engineering part of regular team operations by exposing staff to the practice, sharing chaos plans with other business units, and collaboratively experimenting and refining it over time. 

Chaos engineering solves top DevOps objectives

At present, operational efforts to improve system reliability focus too much on reactive processes that emphasize incident management and service restoration. Instead, the proactive nature of chaos engineering enables organizations to manage and mitigate the risks of system downtime and disruption. 

As a result, IT leaders can meet their top DevOps objectives like improving agility, improving release quality with fewer defects and improving system reliability. The time spent on planned infrastructure initiatives will not be as readily disrupted when the frequency of downtime is lessened as a result of chaos engineering. 

Gartner anticipates that 40% of organizations will implement chaos engineering practices as part of DevOps initiatives by 2023, reducing unplanned downtime by 20%.

Get started with chaos engineering

Overcome the fear of chaos engineering as a means to achieve required system reliability. Chaos engineering can be done safely by starting the practice in the preproduction environment, which allows for organizational learning, increased reliability and a greater understanding of complex system dependencies.

Specific recommendations include:

  • Sell chaos engineering as an organizational competency by evangelizing the practice as a regular product team activity.
  • Focus your teams’ efforts on understanding user journeys, their experiences and downtime drivers, to enhance system reliability and minimize friction.
  • Mandate teams to safely engage in chaos engineering via a “test-first” approach in preproduction environments.
  • Urge teams to attack everything by using tools and practices that introduce failure points.

Gartner clients can read more in Innovation Insight for Chaos Engineering by Jim Scheibmeir et al.

Get Smarter

Follow #Gartner

Attend a Gartner event

Explore Gartner Conferences

How to Execute Effective Data Governance Initiatives

Follow these data governance best practices to deliver the value,...

Learn More


Get actionable advice in 60 minutes from the world's most respected experts. Keep pace with the latest issues that impact business.

Start Watching