Regardless of the type of environment you work in, you’ve likely experienced some type of outage or loss of service at some point during your work day. Outages can be frustrating, but they’re also inevitable. However, in today’s 24/7 digital world, the response to downtime becomes more critical than ever before.
Enter IT resilience, where infrastructure can sustain disturbances with foreseen performance impacts, and either restores its structure and capabilities or quickly adapts to new levels of operational requirements. An organization’s people and culture are some of the most vital components in delivering a resilient digital infrastructure.
“I&O leaders planning and delivering resilient digital infrastructure must realize that people are just as important as infrastructure and processes,” says Mark Jaggers, Senior Director Analyst at Gartner. As such, Jaggers focuses on four areas I&O leaders must concentrate on to achieve IT resilience at their organization.
Focus on continuous improvement
Today, teams often have “firefighter” or “hero” mentalities that focus on rapidly fixing problems rather than proactively planning to improve the overall environment to reduce outages. The hero is not always the one who fixes the problem. Rather, the real heroes are the ones who prevent a crisis from happening in the first place.
Within the next five years, there will be a major internet outage that impacts more than 100 million users for longer than 24 hours
Although there is value in saving a burning building, there is even more value in preventing the entire town from catching fire. One way to do this is to conduct a premortem analysis to predict failures and learn how to respond before there is a problem. In doing so, leaders can reduce the instances of unforeseen outages and identify new ways to prepare and adapt the system moving forward. Focusing on improving time to detect (TTD) and time to repair (TTR) as well as systems that can automate responses to outages are also important.
Put site reliability engineering (SRE) principles to work
An SRE team includes people with skills in software development, networking and/or system engineering. This team will spend roughly 50% or more of its time creating automated fixes, including detection of incidents. It’s really all about learning and improving from said incidents and then transferring that knowledge to others to drive a more resilient IT ecosystem. “The SRE team is uniquely challenged to not only focus on finding issues in the source code or operations elements of an IT system, but will also need to work with, train and influence other teams,” Jaggers says.
Create a culture of shared responsibility
With any outage comes the blame game, which is counterproductive and doesn’t solve the underlying problem. While humans are usually the first ones blamed for system outages, failure is often due to systemic conditions, reaching across a combination of processes, infrastructure and human factors.
Create a culture that prioritizes resilience over remediation by emphasizing continuous process improvement
Instead, adopt a blameless approach in which an outage serves as a learning opportunity addressing the gap between the way that things are and the way that they should be. By doing so, you set your organization up to learn more about what went wrong in the past and what to change so it doesn’t happen again. Additionally, carrying out a specific post-incident review, or blameless postmortem, can enable you to understand the many contributing factors to the incident.
Deploy geographically distributed teams
Uptime and 24/7 availability are key expectations for doing business in the digital age. The problem with that is that IT infrastructures can be precarious. In fact, Gartner predicts that within the next five years, there will be a major internet outage that impacts more than 100 million users for longer than 24 hours. To help prepare for and mitigate the possibility of a major outage, location diversity for your IT infrastructure and operations team is key. That way you have staff representing your business in different time zones.
Create a culture that prioritizes resilience over remediation by emphasizing continuous process improvement to maximize continuity of delivery and minimize downtime. By focusing on and prioritizing issues such as increasing uptime, lowering time to detect and automating responses, your organization is on the road to outage prevention.