The I&O Leader’s Guide to Chaos Engineering

October 28, 2021

Contributor: Katie Costello

Chaos engineering ensures reliable infrastructure in a digital era full of continuous change.

A major U.S. airline implemented chaos engineering (CE) as an internal practice in October 2018 and discovered one large resilience vulnerability right away. The company now fosters a culture of learning and deep understanding of their IT systems with CE and successfully avoided a potential future breach of customer service with potentially devastating outcomes. 

As enterprises continue to prioritize scaling digital initiatives, infrastructure systems must be reliable. CE served as a more dynamic way for this airline to test for unexpected failures before they occur live, compared to passive approaches such as disaster recovery or business continuity plans.

Explore the latest: Top Strategic 

Explore the latest: Top Strategic Technology Trends for 2022

Chaos engineering defined

Chaos engineering is the use of experimental and potentially destructive failure testing to uncover vulnerabilities and weaknesses within a complex system. Gartner suggests that organizations start chaos engineering as systematically planned, documented, executed and analyzed “test-first” approaches in pre-production infrastructure systems.

Site reliability engineering (SRE) teams often use CE to proactively prove and improve resilience during fault conditions. It’s on the rise in the Gartner Hype Cycle for Software Engineering, 2021, as maximizing uptime for customers becomes increasingly important in the virtual-first world. 

Read more: Gartner Top 6 Trends Impacting Infrastructure & Operations in 2021

Chaos engineering is actually far from chaotic — it is a disciplined data-driven approach to running experiments that use chaotic behavior to stress systems and discover their weaknesses (or prove their resilience). CE’s main benefits include:

  1. Exposing technical debt
  2. Building trust in the systems deployed and among the teams that contribute to those systems
  3. Identifying and making testable integrations and potential failure points
  4. Enabling experiment-based learning
  5. Delivering improved reliability and resilience of systems to reduce downtime

These benefits, in turn, help improve customer experience, customer satisfaction, customer retention and new customer acquisition. 

“Many organizations approach the concept of CE with the attitude that the practice is far too risky to execute into production,” says Jim Scheibmeir, Director Analyst, Gartner. “The reality is that avoiding CE is equivalent to embracing crisis engineering.”

Why chaos engineering is important

The two largest drivers of chaos engineering are complexity in systems and increasing customer expectations. As systems become more rich in features, they become more complex in composition and more critical to business success. Gartner client inquiries on chaos engineering have increased significantly since October 2019.

Many organizations stake their success on test plans that overemphasize software functionality and underemphasize validating the system’s reliability. 

Much like attacking the immune system with a controlled injection of a weakened virus, chaos engineering trains an organization to deal with bugs and system failures. It moves the focus of testing a system to how it might gracefully fail or even continue to be useful while under various levels of impact. CE can also help identify where product documentation is less than sufficient or knowledge of a system is lacking or siloed.

Chaos engineering solves top DevOps objectives

At present, operational efforts to improve system reliability focus too much on reactive processes that emphasize incident management and service restoration. By contrast, the proactive nature of chaos engineering enables organizations to manage and mitigate the risks of system downtime and disruption. 

As a result, you can meet top DevOps objectives like improving agility, release quality with fewer defects and system reliability. The time spent on planned infrastructure initiatives will not be as readily disrupted when the frequency of downtime decreases as a result of chaos engineering. 

Gartner anticipates that 40% of organizations will implement chaos engineering practices as part of DevOps initiatives by 2023, reducing unplanned downtime by 20%.

Obstacles to chaos engineering

Chaos engineering practices are relatively new, but they are an important part of the arsenal of high-performing teams. Because of its infancy, there are three core obstacles to adopting CE:

  1. Perception: Within many organizations, the predominant view is that the CE practice is inherently random and done first in production, which leads to the perception that there is more risk.
  2. Organizational culture: Organizational culture and attitudes toward quality and testing sometimes get in the way of CE adoption. When quality and testing are only viewed as overhead, there tends to be a focus on feature development over application reliability.
  3. Time and budget: As with most new initiatives, simply gaining the time and budget to invest in learning the practice and associated technologies is a challenge. Organizations must reach minimum levels of expertise where value is returned, which does not happen overnight.

Get started with chaos engineering

Infrastructure and operations (I&O) teams that are not proactive in ensuring reliability will only find themselves reacting to chaos, which is essentially the same thing as accepting system downtime.

Gartner suggests a test-first approach to chaos engineering. Execute attack plans within test environments, and apply the learning and value to production systems.

Make chaos engineering part of your regular team operations by exposing staff to the practice, sharing chaos plans with other business units, and collaboratively experimenting and refining it over time. 

Overcome the fear of chaos engineering as a means to achieve required system reliability. Chaos engineering can be done safely by starting the practice in the pre-production environment, which allows for organizational learning, increased reliability and a greater understanding of complex system dependencies.

This article has been updated from the March 2020 original to reflect new events, conditions or research.

Gartner IT Infrastructure, Operations & Cloud Strategies Conference

Get prepared to be inspired by the world’s leading infrastructure and operations (I&O) leaders and Gartner experts and explore the latest technologies.

Drive stronger performance on your mission-critical priorities.