The Value of Integrating Availability and Performance Management and Service Desk Tools

With business managers demanding greater IT availability and improved customer service quality, IT operations are revisiting A&PM and service desk integration as a way to increase IT management efficiencies, reduce the time it takes to identify and resolve IT outages, and reduce the impact IT issues have on the business. In addition, integration also provides better reporting and trend analysis, tying service desk tickets back to the real reason for the outage, which can improve trend analysis. It also ensures that a ticket is created for each outage (whether an end user calls the service desk or not) and the outage time is properly recorded in the ticket (the service desk doesn't always know exactly when it first occurred). With the appropriate expectations, resources and planning, the value gained by integrating event and problem management can be greater than the tools used independently.

The ability to pass A&PM events from data center operations to service desks has been available for more than two decades; however, IT organizations that have successfully automated the capability are in the minority, with most continuing to rely on the manual logging of issues detected by the A&PM tools. Today's business demands quality of service, application availability managed to service-level agreements (SLAs) and agility, which relies on higher degrees of IT operations efficiency. The ability to rapidly understand, track and remediate an IT event or fault is a critical step to aligning IT management with the needs of the business. Increased IT management process maturity, IT operations accountability, the adoption of best practices and the demand for greater IT operations efficiencies are all drivers increasing the chances of success at meeting a set of integration objectives. Integration objectives include:

Reducing downtime and increasing IT service quality and end-user satisfaction
Optimizing resources and increasing efficiencies (i.e., more issues solved at Level 1 support)
Removing delays and gaps in the fault-to-resolution process, and reducing the mean time to repair (MTTR)
Proactively informing the end users of issues (before the end users report the issue)
Automatically escalating issues in line with SLAs
Improving the accuracy of the fault-to-resolution reporting (i.e., providing specific downtime and MTTR statistics)
Improving trend and analysis reporting

Without integrating A&PM and service desk tools, the ability to manage outages remains end-user-driven, resulting in the service desk being unaware of a reported issue. The end user is then questioned about the symptoms, with IT operations trying to recover the situation in isolation. Even when communication among the IT operations teams occurs (face-to-face, via telephone or via cubicle shouting), recovery time and the ability to track and manage the fault is seriously diminished.

A manual event management process, where events are manually reported (by e-mail or phone call) and an incident ticket is opened and updated by the service desk, with continual updates from the data center operations team, requires a high degree of cross-function organizational collaboration to ensure each issue is managed effectively. A manual process requires far greater communication than an automated procedure, because end users and the service desk need regular updates. Typically, it's a manual process prone to longer recovery times, due to lapses in communication and human error.

When an outage occurs, there are four timelines: fault, data center operations, service desk and business. Figure 1 shows a typical fault-to-resolution process with limited A&PM and service desk integration. The fault is noticed first by the business. The end users report an issue to the service desk and this opens an incident ticket. The fault, something that has not occurred before, is managed through the problem management system. IT operations personnel work to identify, isolate and fix the fault, providing the service desk with regular updates. Once the fault is resolved, data center operations pass this information to the service desk, which then closes the incident and informs the users. This fault-to-resolution process is triggered by the business end user, driven by human interaction, and impacted by communication "lag."

An automated procedure requires less manual intervention. The objective is to minimize the fault timeline (fault to resolution) to reduce business downtime. To do this, the data center operations and service desk timelines need to be optimized through automated, bidirectional communication between A&PM tools (represented by the data center operations timeline) and the service desk (represented by the service desk timeline; see Figure 2).

An A&PM detected issue automatically opens an incident ticket and the service desk staff is immediately aware of the issue, allowing them to inform the end users of the issue and provide them with IT service interruption details. In addition, it provides a mechanism for proactive (e.g., outbound alert) or reactive (end-user-reported issue) communication, and it affords an immediate parent/child incident structure, allowing the quick and effective grouping of incident records. This provides a greater depth of impact information, as well as a comprehensive ability to communicate updates and fixes.

With the A&PM and service desk tools synchronized, incidents are tracked and automatically updated as IT operations step through the resolution process. When the fix is complete, the incident ticket is closed and the end users are informed. Integration and automation ensures communication is quicker, eliminating gaps associated with human intervention (e.g., error analysis and fix prioritization negotiation) and manual intervention — reducing IT downtime and, therefore, the overall MTTR.

A&PM and service desk integration provides value that spans across IT operations into the business.

Value to IT Data Center Operations:

Optimizes data center operations effectiveness and increases overall IT operations accountability to the business (reduced downtime)
Increased operational efficiency through proactive communication and reporting (decreasing MTTR and allowing IT operations to address issues with less interruption)
Collaboration between data center operations and the service desk, ensuring incidents are owned, controlled and managed to resolution directed by service levels, without being constantly pressured to fix them by end users

Value to the Service Desk:

Proactive understanding of A&PM incidents impacting the business
IT issues identified, logged and owned before an end user reports them (allowing the service desk to proactively update the business)
Service desk becomes an integrated, integral part of the IT operations fault-to-resolution process
Provides the service desk with an understanding of data center operations activities
Ability to append an automatically created ticket to a parent record, which is already, for example, in the hands of Level 3 support
Allows increased accuracy when measuring incident times and service-level adherence

Value to the Business:

Real-time awareness of issues — due to improved incident communication and the action plan to address it
Increased quality of service (information, timeliness, detail, expectations)
Reduced business downtime (MTTR), and loss of productivity and revenue
Increased end-user satisfaction
Provides trend analysis data on how IT is supporting the business over time
Better communication with IT operations

Integration between A&PM and the service desk must be reliable, measurable (escalation and outage reports) and able to support IT operations processes (e.g., in support of agreed-on SLAs). To meet these objectives, the integration must provide:

The ability for the A&PM tools to create, open, update and close incident tickets — both automatically and manually
Reliable, dependable integration (e.g., with data buffering)
Integration between tools' application programming interfaces, preferably developed and supported by each tool's vendor (data passed via e-mail is not integration, it is communication)
Event or incident ticket status changes synchronized between both tools (e.g., a fixed fault closes the ticket)
Reports created on incident ticket information containing all details on the ticket made available, allowing data center operations to understand the status

This level of integration can be classified as basic. There are A&PM tools that provide more-advanced integration with service desk tools. An advanced level of integration includes:

Automatic assignment of an owner (user ID) to an incident ticket ID number
Events passed to the service desk with the detail to automatically identify the resources and skills to be assigned to the incident
The incident ticket ID is associated with the A&PM event (bidirectional event to incident reporting)

The type of data passed between the tools will vary as it's defined by each organization's needs (e.g., in support of specific IT operations processes, the fault-to-resolution procedures and SLAs). However, in most situations, there is a core set of data that should be passed between the tools that includes:

Time the event occurred (not the time the event was passed)
Time the incident ticket was created, updated and closed
The event text (preferably usable context)
Event criticality (defined within the event tool)
Event source (element data)
Event owner (person, organization or the event tool that passed the event)
Incident ticket ID number
Service-desk user ID
Configuration item affected

Basic event data passed to the service desk provides a way to log, track and help resolve element issues. However, basic event data may not be enough to help Level 1 support identify the real issue or assign the ticket to the appropriate skilled resources.

Based on client inquires, we see very few IT organizations combining A&PM with BSM and service desk tools; however, value increases significantly when context (e.g., service impact information) is attached to the incident. Therefore, it is recommended to pass A&PM events once a BSM tool has associated the impact of an event on an IT service. This allows the support staff to focus on the issues with the greatest impact on the business. BSM information passed to the event tool can include:

The application, service or services impacted
The level of severity based on business impact
The criticality of a single event on multiple services (one event may impact multiple services with different criticality levels)

Integration between A&PM and the service desk is accomplished in a number of ways, including an out-of-the-box program offered by the A&PM tools or a basic integration script. However, there is an alternative option using RBA tools. We continue to see a rise in the popularity of RBA tools as a means to integrate A&PM and service desk tools to automate and orchestrate the fault-to-resolution process. RBA tools provide a range of integration capabilities beyond basic event passing. For example, when an event occurs, it triggers an RBA process to automatically pass the event between different A&PM tools (e.g., event correlation analysis [ECA], performance and BSM), check for possible causes, open an incident ticket on the service desk, automatically start a recovery procedure and close the ticket once the situation has been resolved (see Figure 3).

An RBA tool managing a fault-to-resolution process removes lag from the recovery process, thus decreasing the MTTR and increasing IT service availability with full reporting of the end-to-end recovery process. This ensures that recovery is managed with minimal human intervention, thereby reducing the risk of human error.

Effective A&PM and service desk integration requires event data to be consolidated, filtered and correlated before it is passed to the service desk. Event consolidation (e.g., bringing together server, network, database and application event data) ensures that when it's filtered and correlated, the risk of multiple tickets being created from one event is reduced (e.g., a server error being reported at the same time as the actual error [a network issue], resulting in two separate tickets).

Even though A&PM and service desk integration has measurable benefits, there remain a number of challenges, both technical and organizational:

Cultural differences between data center operations and service desk staff, which require organizations to overcome cooperation barriers and motivate collaboration between the teams.
Limited understanding of the fault-to-resolution process, especially between the different support teams, requires IT organizations to ensure each team understands and is responsible for its contribution in supporting the process.
Too much A&PM data being passed with too much confusing technical detail requires data center operations to apply the appropriate event filtering, correlation and service impact association to allow the service desk team to more effectively manage incidents in-line with business priorities and service levels.
Inadequate, poorly developed integration between A&PM and service desk tools (especially if the tools are from different vendors) can contribute to making the fault-to-resolution process slower and less reliable than a manual one. Ensure that integration between A&PM and the service desk is proven and meets your specific event process requirements.

© 2010 Gartner, Inc. and/or its affiliates. All rights reserved. Gartner is a registered trademark of Gartner, Inc. or its affiliates. Reproduction and distribution of this publication in any form without prior written permission is forbidden. The information contained herein has been obtained from sources believed to be reliable. Gartner disclaims all warranties as to the accuracy, completeness or adequacy of such information. Although Gartner's research may discuss legal issues related to the information technology business, Gartner does not provide legal advice or services and its research should not be construed or used as such. Gartner shall have no liability for errors, omissions or inadequacies in the information contained herein or for interpretations thereof. The opinions expressed herein are subject to change without notice.