October 28, 2021
October 28, 2021
Contributor: Katie Costello
Chaos engineering ensures reliable infrastructure in a digital era full of continuous change.
A major U.S. airline implemented chaos engineering (CE) as an internal practice in October 2018 and discovered one large resilience vulnerability right away. The company now fosters a culture of learning and deep understanding of their IT systems with CE and successfully avoided a potential future breach of customer service with potentially devastating outcomes.Â
As enterprises continue to prioritize scaling digital initiatives, infrastructure systems must be reliable. CE served as a more dynamic way for this airline to test for unexpected failures before they occur live, compared to passive approaches such as disaster recovery or business continuity plans.
Explore the latest: Top StrategicÂ
Explore the latest: Top Strategic Technology Trends for 2022
Chaos engineering is the use of experimental and potentially destructive failure testing to uncover vulnerabilities and weaknesses within a complex system. Gartner suggests that organizations start chaos engineering as systematically planned, documented, executed and analyzed “test-first” approaches in pre-production infrastructure systems.
Site reliability engineering (SRE) teams often use CE to proactively prove and improve resilience during fault conditions. It’s on the rise in the Gartner Hype Cycle for Software Engineering, 2021, as maximizing uptime for customers becomes increasingly important in the virtual-first world.Â
Read more: Gartner Top 6 Trends Impacting Infrastructure & Operations in 2021
Chaos engineering is actually far from chaotic — it is a disciplined data-driven approach to running experiments that use chaotic behavior to stress systems and discover their weaknesses (or prove their resilience). CE’s main benefits include:
These benefits, in turn, help improve customer experience, customer satisfaction, customer retention and new customer acquisition.Â
“Many organizations approach the concept of CE with the attitude that the practice is far too risky to execute into production,” says Jim Scheibmeir, Director Analyst, Gartner. “The reality is that avoiding CE is equivalent to embracing crisis engineering.”
The two largest drivers of chaos engineering are complexity in systems and increasing customer expectations. As systems become more rich in features, they become more complex in composition and more critical to business success. Gartner client inquiries on chaos engineering have increased significantly since October 2019.
Many organizations stake their success on test plans that overemphasize software functionality and underemphasize validating the system’s reliability.Â
Much like attacking the immune system with a controlled injection of a weakened virus, chaos engineering trains an organization to deal with bugs and system failures. It moves the focus of testing a system to how it might gracefully fail or even continue to be useful while under various levels of impact. CE can also help identify where product documentation is less than sufficient or knowledge of a system is lacking or siloed.
At present, operational efforts to improve system reliability focus too much on reactive processes that emphasize incident management and service restoration. By contrast, the proactive nature of chaos engineering enables organizations to manage and mitigate the risks of system downtime and disruption.Â
As a result, you can meet top DevOps objectives like improving agility, release quality with fewer defects and system reliability. The time spent on planned infrastructure initiatives will not be as readily disrupted when the frequency of downtime decreases as a result of chaos engineering.Â
Gartner anticipates that 40% of organizations will implement chaos engineering practices as part of DevOps initiatives by 2023, reducing unplanned downtime by 20%.
Chaos engineering practices are relatively new, but they are an important part of the arsenal of high-performing teams. Because of its infancy, there are three core obstacles to adopting CE:
Infrastructure and operations (I&O) teams that are not proactive in ensuring reliability will only find themselves reacting to chaos, which is essentially the same thing as accepting system downtime.
Gartner suggests a test-first approach to chaos engineering. Execute attack plans within test environments, and apply the learning and value to production systems.
Make chaos engineering part of your regular team operations by exposing staff to the practice, sharing chaos plans with other business units, and collaboratively experimenting and refining it over time.Â
Overcome the fear of chaos engineering as a means to achieve required system reliability. Chaos engineering can be done safely by starting the practice in the pre-production environment, which allows for organizational learning, increased reliability and a greater understanding of complex system dependencies.
This article has been updated from the March 2020 original to reflect new events, conditions or research.
Get prepared to be inspired by the world’s leading infrastructure and operations (I&O) leaders and Gartner experts and explore the latest technologies.
Recommended resources for Gartner clients*:
Innovation Insight for Chaos Engineering
*Note that some documents may not be available to all Gartner clients.