High-profile application service outages continue to appear in the press. Investing in maturing IT management processes greatly reduces the risk of outages and increases customer satisfaction.
Table of Contents
What do Continental Airlines, the London Stock Exchange, Hershey Foods, eBay, AT&T, MCI WorldCom, Microsoft, the New York Stock Exchange and the Federal Aviation Administration have in common? They all experienced high-profile downtime and received enormous press coverage for all the wrong reasons. For most, the outages resulted in lost revenue. For all, the downtime tarnished their company image and reputation.
Hershey Foods experienced an enterprise resource planning system rollout debacle in 1999, which prevented it from shipping products during the critical Halloween season. The cost of this mishap: a 19 percent drop in net income for 3Q99. In January 2001, Microsoft suffered a three-day outage of many of its Web sites because of a domain name service (DNS) configuration error — a common cause of failure for many enterprise Web sites. On 8 June 2001, a 90-minute system failure at the New York Stock Exchange showed, once again, that even the most-sophisticated IT environments are vulnerable to failure usually caused by human or process error.
What conclusions can be made about downtime as we move toward collaborative commerce, with expanded integration of business processes across enterprise boundaries? First of all, enterprises built on shaky foundations will incur it. And secondly, now that downtime is public information, it will tarnish a company's image and reputation.
Gartner research shows that an average of 80 percent of mission-critical application service downtime is directly caused by people or process failures. The other 20 percent is caused by technology failure, environmental failure or a disaster. The complexity of today's IT infrastructure and applications makes high-availability systems management enormously difficult (see "Making Smart Investments to Reduce Unplanned Downtime," TG-07-4033 ).
When enterprises invest in achieving higher levels of application service availability, they tend to focus on increasing the redundancy in the environment. Although redundancy is critical to providing high levels of availability, it cannot and should not be the only line of defense. Enterprises must also mitigate downtime risks caused by people and process failures. For these causes of downtime, strong IT operations and applications development processes are required. IT operational processes are vital to application service availability, but are often overlooked — especially in distributed application environments — because of architecture/infrastructure complexity, immaturity of the processes and tools, and a lack of commitment to the IT resources needed.
Applications requiring high levels of availability must be managed with operational disciplines — also known as network and systems management (NSM) disciplines — to avoid unnecessary and potentially devastating outages. The following are proactive operations management disciplines, which have direct and high returns from an application availability perspective:
Availability management — Collecting and correlating performance and other system, network and application parameters to predict and, thus, avoid potential downtime. This discipline involves using automated tools to avoid problems (e.g., automatically increasing file space when it reaches a threshold) and job scheduling to reduce operator error and improve the availability of batch applications and data.
Problem management — Identifying, quickly resolving and preventing problems through root cause analysis. Problem management involves identifying and classifying problems, determining escalation procedures and documenting all the information surrounding the characteristics and resolution of the problem. All problems should be assigned a severity level according to the business risk and the potential impact of the problem. To ensure that problems have a minimal impact on the enterprise, problems must be prioritized, monitored and assessed for potential frequency of reoccurrences.
Change management — To improve quality of service (e.g., experience less downtime) through better planning, testing, coordinating and scheduling of application and IT infrastructure changes. The most common cause of people and process failures is change. Enterprises that have established strong change management practices typically have the highest levels of availability. When a change causes a problem, enterprises must have rollback procedures to minimize the overall outage. Furthermore, changes that cause extended outages may require an enterprise to invoke its business continuity plan.
Configuration management — Understanding the relationships among IT infrastructure, application and business process components. This discipline is an underpinning of problem, change and availability management. Without an understanding of an application or business process IT configuration, change or problem (e.g., outage), impact cannot be readily determined, nor its priority for resolution.
Desktop management — Often a component of change management, desktop management seeks to ensure the availability and accuracy of desktop software configurations required for user access to critical application services.
Performance management — The trending of end-to-end response time and network, system and application component performance parameters to predict short-term future performance degradation (e.g., where performance parameters are outside of the baseline). This discipline assists in quicker problem diagnosis, thus reducing downtime.
Capacity planning — Extends performance management into predicting future IT resource needs. Capacity planning uses historical trends and information on new or changing workloads to help the IS organization avoid shortages and meet its service-level objectives.
Storage management — Using proactive processes to protect the availability of data, by avoiding catastrophic incidents or recovering from a data outage as quickly as possible.
Enterprises cannot offer consistently high levels of availability without maturing IT management processes. By investing in these processes, enterprises can mitigate their exposure to the majority of application service downtime risks involving people and process error.
This month’s Spotlight focuses on storage management, desktop management and performance management — three critical management processes affecting application service availability.
“Delivering the Right Amount of Data Availability,” T-13-9757 , by Mark Nicolett, explores proactive management processes related to keeping storage and data available and protected.
“Desktop Software Configuration: Key to Desktop Availability,” T-13-9394 , by Ronni Colville and Donna Scott, offers insight into best practices for achieving higher levels of desktop availability.
“Performance Management: A Framework for Success,” COM-13-9854 , by Milind Govekar, offers insight into best practices for performance management.
Next, we tackle the issue of problem management. Enterprises must measure their performance to know how they are performing.
“Service Desk Metrics: Time for a Change,” SPA-13-5979 , by Kris Brittain and Steve Cain, explores how service desk quality metrics are changing with the increased use of Web-based self-help tools.
“24/7 Is a Management Thing,” SPA-13-2616 , by Andy Kyte, explores the trade-offs among availability and cost — focusing specifically on problem management for extended applications — used by external customers, suppliers and business partners.
“The High Cost of Achieving Higher Levels of Availability,” SPA-13-9852 , by Donna Scott, takes a look at why costs increase exponentially as application service levels rise, and the levels of availability that most enterprises consider “good enough.”