
|
Overview

|

|
A large telecommunications company achieves significant, measurable cost savings and increased IT operations management efficiencies through the adoption of a new, emerging availability and performance management (APM) solution using behavior learning technology. IT operations stakeholders should use this research to optimize these tools' value, for faster returns on investment (ROI) and to reduce tool maintenance costs.
- Behavior learning technologies have matured, allowing IT organizations to move from a reactive state focused on mean-time to repair to a proactive state where outage avoidance becomes a more realistic objective.
- Behavior learning tools augment and enhance existing APM tools.
- Behavior learning tools enable organizations to effectively transition to service-based monitoring to improve IT business alignment, versus a traditional (and less-relevant) IT element uptime focus.
- Consider using behavior learning products if your IT organization is looking to move from a reactive, event-driven IT operations state to a proactive one where events are detected before they impact IT services.
- Focus on a specific set of applications when implementing behavior learning software, proving its value; then use the value to justify broadening its use.
- Test behavior learning tools in a lab environment before purchase, using real production data to prove their abilities.
- Don't assume that IT management behavior learning tools will allow you to eliminate or replace existing APM tools; APM products are typically needed to provide the data from which learning behavior can be ascertained.
- Be aware that even if you're not ready to manage end-to-end IT services, implementing behavior learning tools can help you optimize static element performance monitoring investments.
|
|


|
What You Need to Know

|

|
The adoption of IT operations management behavior learning tools (which have less than 5% market penetration) continues to grow, allowing IT organizations to proactively identify and isolate faults and performance issues in the IT infrastructure. One of the world's largest telecommunications companies successfully deployed an IT operations management-related behavior learning technology across eight of their U.S. data centers where it consolidated, correlated and logged performance data from a number of existing APM tools, allowing the IT organization to identify and avoid IT issues before they impacted the company's mission-critical IT services.
We recommend that mature IT organizations evaluate these tools to move to a more predictive environment for critical IT services. However, even organizations not yet ready to manage at the IT service level can gain value from these tools to get faster ROI and to lower the costs to maintain their element performance management tools.

|
|


|
Case Study

|

|
IT organizations have been striving to find an IT management product that allows IT operations to identify and deal with IT issues before they impact the business. For years, management vendors have tried to design products to meet this need, but most have failed to deliver an offering meeting budgetary and technology demands of most enterprises. The most recent iteration of behavior learning tools appears to have found the right product architecture and approach to succeed. At Gartner, we classify these statistics-based analysis (statistical analysis-based) products as IT operations management "behavior learning" tools because they learn application performance behavior over time, much like the science of artificial intelligence.
These tools consolidate performance data from a wide range of APM sources, quickly establishing a "normal" behavior pattern or profile, which is then automatically analyzed and compared with newly gathered data to detect subtle anomalies in real time. Mainly due to market skepticism, the current behavior learning tools have taken a long time to make their presence felt.
One of the U.S.' largest telecommunications companies, with more than 60 million customers, was trying to solve the age-old IT management problem of identifying, isolating and solving IT performance problems before they impacted the business. The telco needed to move beyond trying to keep up with solving issues as they occurred, because this approach was not only creating IT service problems, but was also requiring significant personnel expenditures to remediate issues.

The telco's IT organization was already at a high state of IT maturity, with documented processes, an established and well-run organizational structure, and a set of IT operations management tools providing an insight into the availability state of key IT infrastructure elements. The telco had achieved a consistently measurable 99.9% availability metric; however, it needed to understand the quality (for example, performance) of the IT services being provided to end users. The need was to find and implement a product that could consolidate, normalize and analyze all APM data in a holistic way, and transform the IT support organization to one with a proactive management footing. In addition, the telco wanted to leverage its existing IT operations management tools technology and skills investments.
The telco wanted to prove an ROI focused on two principle areas:
- Saving time and resources, in addition to improving consistency, by automating the process for setting performance thresholds. The old process involved application support members opening tickets with various tools teams to adjust threshold settings manually.
- Improving application performance for mission-critical applications, especially those supporting revenue-generating services. The early notification and quick resolution of performance problems would result in more effective use of these applications.
The telco also had a number of expectations that included improving application performance for mission-critical and revenue-generating applications through advanced performance analysis.

The telco decided to look to IT management technology that augmented, enhanced and leveraged its existing fault and performance products from a range of companies, including BMC Software, CA, HP, Microsoft, VMware and Oracle, while consolidating the data from all these sources to establish a holistic overall understanding of IT performance. A traditional approach would have been to choose a fault management product and use it as a consolidation point, passing to it all event data where it would be filtered, correlated and reported. This creates what is commonly called a "manager of managers." Even though this type of product is common for fault management, it has not proved effective for consolidating and correlating data from a wide range of performance management products (including servers, networks and applications).
The need to get a holistic view across a wide range of fault and performance data required the company to think differently and consider nontraditional alternatives. In addition, it wasn't good enough for an IT operations management product to consolidate disparate fault and performance data; it also had to analyze the data and identify issues before they resulted in an IT service issue. The telco did consider developing its own solution, but its initial effort took a long time to produce, and focused only on server metrics. Something better was needed.

A new type of performance analysis tool was beginning to make an appearance in the IT operations management market, one that could simultaneously analyze hundreds of performance metrics in real time to automatically learn the overall health of system components and of the resultant end-to-end services. IT operations management "behavior learning products," of which there remain a small number, were beginning to gain a level of market acceptance as products that could not only self-learn normal performance behaviors, but could also detect subtle changes in performance behaviors, enabling IT operations to address a minor issue before it cascaded into a more serious condition.
After a preliminary investigation of the market, the telco chose to evaluate a learning behavior tool from a company called Netuitive and compare it against another behavior learning tool from a much larger IT operations management portfolio software vendor. The tools were each deployed in a lab environment with real-time performance data collected from a live point-of-sale system in production. The Netuitive product was chosen because the telco felt it was easier to use, easier to deploy (the Netuitive offering was deployed in a few days by the telco's staff, while the competing company required its engineer to be on-site for two weeks), and was more accurate in its root cause assessment. In one case, Netuitive forecast a degradation of 46 minutes in advance, and correctly identified the underlying root cause, where the competing product pointed to a symptomatic issue only after the problem occurred.
Once the hardware needed for Netuitive was deployed, the full implementation took six weeks, requiring the full-time involvement of three IT resources. The initial objectives were met within six months. This included product evaluation and testing, product installation, purchasing, hardware installation, product training and meeting the initial project ROI. The Netuitive solution has provided the telco with the following results:
- The learning behavior product is analyzing application performance for 202 mission-critical applications and 127 Tier 2 applications, consisting of 6,221 servers. It is also analyzing 415 customer experience synthetic transactions, 1,300 middleware instances, and 700 databases across eight data centers.
- On target to eliminate 3,480 hours of service degradation (performance-related issues) or interruptions previously experienced each year, representing $18 million a year in business savings.
- A 26% reduction in time spent by Level 2 and Level 3 engineers for root cause analysis.
- Reducing the effort associated with setting APM tool rules thresholds, setting key performance indicators (KPIs), system administration and troubleshooting tasks equating to a saving of 28 full-time equivalents (FTEs; of 80 total).
- Highly accurate behavior profiles are now established on new applications within two weeks of the application being installed.
- Increased visibility. Originally, mission-critical IT issues were analyzed by up to 10 people. The use of the learning behavior product has resulted in issues being identified easier (due to the cross-domain holistic views of problems) and solved with fewer personnel, in less time and with reduced effort.
- Documented examples where Netuitive forecast performance issues that avoided service degradations demonstrating predictive management capabilities that were used to show the business how the product improves IT service availability.
- The telco estimates a full ROI for its Netuitive software within three months of initial implementation.
The telco's IT organization created and now provides a self-paced online training course for its behavior learning software. More than 150 people have attended the class. This course not only helps the IT organization learn the product, but it also helps drive awareness of the product and its value.
The telco's plans for Netuitive include:
- Anticipated 90% automation of thresholding in the next 12 months as departments continue to gain confidence in Netuitive
- 75% less labor to be required for modeling 192 services for end-to-end business service management projects, as compared with conventional rule-based tools (represents savings of 17,496 person hours)

The telco needed a product that reduced APM complexity while providing increased measurable operational efficiencies tied directly to the services provided to the business. To achieve this, the telco clearly defined the goals and objectives tied to a specific set of applications, thereby allowing the value to be understood more easily. This approach allowed the value to be proved incrementally in a shorter time period, helping drive additional investment and broadened scope. However, learning technologies requires IT operations to adjust their understanding of how information is collected, analyzed and used. These technologies do not replace existing performance management products (for example, the data can aid in understanding capacity use, but these are not capacity planning tools).
The telco also had effective cross-management technology domain collaboration, and understood how to implement and leverage its behavior learning product. In addition, it started with a smaller scope, focusing on specific applications, enabling it to learn critical success factors. Moreover, the team planned for sufficient time in the project for the tool to learn behavior; thus, they did not expect immediate results. Although it is possible for this tool to learn in a predeployment mode, it would have to mirror production deployment to be effective.

The inability for traditional APM tools to provide a holistic view of performance across the IT infrastructure drove the telco to seek new, alternative products. The product chosen did not replace the existing APM tools and, therefore, did not reduce existing license costs. It did, however, integrate and analyze existing APM data from multiple sources, which enhanced existing monitoring tool investments and automated their ongoing maintenance. The use of a new APM tool moved the telco from a reactive element state focused on availability to a predictive state with a focus on IT service performance. The value from the behavior learning product was also derived by the telco initially focusing the product on key applications and not on every IT element. Furthermore, the telco developed training to ensure that personnel knew how to use the product, understood what the product could do and set expectations accordingly.
 © 2009 Gartner, Inc. and/or its Affiliates. All Rights Reserved. Reproduction and distribution of this publication in any form without prior written permission is forbidden. The information contained herein has been obtained from sources believed to be reliable. Gartner disclaims all warranties as to the accuracy, completeness or adequacy of such information. Although Gartner's research may discuss legal issues related to the information technology business, Gartner does not provide legal advice or services and its research should not be construed or used as such. Gartner shall have no liability for errors, omissions or inadequacies in the information contained herein or for interpretations thereof. The opinions expressed herein are subject to change without notice.
|
|

|
|
|