The Big Mind Shift: Capacity Management for Virtual and Cloud Infrastructures
The traditional approach to capacity management is not adequate to meet operational challenges posed by virtual infrastructure and internal clouds. IT organizations are moving from highly static physical environments to highly dynamic and resource-sharing infrastructures. To go through this transformation, enterprises need to rethink their capacity allocation strategy and implement modern capacity tools. In this guidance document, Research Director Alessandro Perilli details the steps required to modernize capacity management so that it meets the needs of today's increasingly virtualized, dynamic, and service-oriented data center.
Table of Contents
- Summary of Findings
- Guidance Context
- The Gartner Approach
- The Guidance Framework
- Phase 1: Develop the Architecture
- Phase 2: Select the Solution
- Step 1: Develop an RFI
- Step 2: Develop an RFP
- Step 3: Send an RFQ
- Step 4: Build a POC and Select Provider
- Phase 3: Implement and Manage
- Risks and Pitfalls
- Revision History
Bottom Line: The traditional approach to capacity management is inadequate to meet operational challenges posed by dynamic environments such as virtual infrastructure and internal clouds. Service-oriented infrastructure delivery is not possible without accounting technical and non-technical constraints that affect virtual machine (VM) mobility and placement. The IT organization needs to rethink its capacity allocation strategy and focus on application performance awareness and deep integration with the management stack. The adopted solution needs to offer: a holistic view of the infrastructure resources, simulation and forecasting for capacity consumption, proactive capacity allocation, and the capability to define technical, business, and compliance rules for workload placement.
Context: The steady transition to a dynamic, service-oriented data center is causing enterprises to re-evaluate traditional operations and management methods. IT organizations increasingly leverage x86 hypervisors to cut IT capital and operational expenses. The shift toward hardware-infrastructure-as-a-service (HIaaS) internal clouds further emphasizes the need to offer and guarantee service-level performance. However, the increasing scale and complexity of virtual infrastructures, along with the introduction of self-service provisioning, represent a challenge. Enterprises need to keep control over business services performance and contain VM sprawl.
Take-Aways: IT organizations that desire implementing or modernizing capacity management to meet the challenges of virtual infrastructures and internal clouds should consider the following key points:
- A successful implementation requires a new, service-oriented approach to the capacity management problem (see Pre-Work Phase: Step 1).
- The project is at high risk of failing without an executive sponsorship and a strong interdisciplinary team that includes business, technical, and legal experts from different departments within the company (see Pre-Work Phase: Step 2 and Step 4).
- The key features of a capacity management solution for virtual infrastructures and
internal clouds are (see Phase 2: Step 2):
- Application awareness
- Deep integration with the management stack
- Holistic view of the infrastructure resources
- Simulation and forecasting for capacity consumption
- Proactive capacity allocation
- The capability to define technical, business, and compliance rules for workload placement
- The most critical aspects of the implementation are the definition of technical, business, and compliance constraints and the design of the integration map (see Phase 1: Step 3 and Step 4).
- A cyclic reassessment of the workload placement rules and the integration map are key tasks to guarantee the health and accuracy of the capacity management solution (see Phase 3: Step 4).
Conclusion: In the effort to deliver IT as a service, enterprises face dramatic changes. The IT organization moves from highly static physical environments, where it must optimize isolated servers, to highly dynamic and resource-sharing infrastructures, where it has to enforce SLAs for multi-tier business applications. This transformation poses a significant operational challenge. Rethinking the capacity management problem is the first step to meet this challenge. Then organizations need to select and implement modern capacity tools that feature application performance awareness, a holistic view of the infrastructure resources, deep integration with the management stack, and the capability to define technical, business, and compliance rules for workload placement.
The IT organization is in a transition phase. It is either moving from a physical data center to a virtualized infrastructure or transforming the virtual infrastructure in an internal cloud HIaaS. The transition aims for operational costs reduction and increased environment efficiency, but new technologies introduce new challenges. For example:
- The automation of workload provisioning in virtual and cloud infrastructures leads to VM sprawl.
- The introduction of a self-service portal makes it harder to forecast capacity demand.
- The hosting of multiple tenants in a resource-sharing environment adds complexity to chargeback activities.
- The dynamism of the infrastructure makes manual resources allocation too slow, prone to errors, and inefficient.
- The scale of the infrastructure hinders the capability to pinpoint performance degradation causes.
In this scenario, the IT organization recognizes the need for a new strategy, more robust management tools, and implementation guidance to control resources allocation and application performance.
What are the steps required to reshape traditional capacity strategies and practices for modern virtual and internal cloud infrastructures?
This guidance document applies to IT organizations that are rearchitecting their virtual infrastructures from a service-oriented perspective or that are planning to build internal clouds.
Common characteristics among organizations looking to modernize their capacity management approach include:
- The desire to offer and guarantee service-level performance
- The desire to forecast the infrastructure growth and improve the IT procurement strategy
- The desire to contain resource over-provisioning caused by self-service portals
- The desire to proactively allocate resources to limit damages caused by system failures
- To desire to address unplanned demand in internal clouds with more confidence
The Gartner approach to building a capacity management strategy requires:
- A holistic and service-oriented approach to capacity management that includes both technical and non-technical metrics
- A focus on workloads placement optimization
- A dynamic definition of capacity utilization levels related to application-level performance
- Proactive capacity allocation based on event monitoring and real-time analytics
In the pre-work "Step 1: Rethink the Capacity Management Problem" section of this guidance document, Gartner details the mind shift required to manage capacity in highly dynamic and resource-sharing infrastructures where the IT organization needs to guarantee SLAs for multi-tier business applications.
This guidance document outlines a capacity management adoption strategy for virtual infrastructures and HIaaS internal clouds.
The path to implementation follows a series of four phases, each building on the previous:
- Pre-work: In this preliminary phase, the IT organization revisits its approach to capacity management and builds an interdisciplinary team with business and technical competencies to lead the project. The team scopes the project and determines the use cases for capacity management. Once the business objectives have been determined, the IT organization builds a proper business case to win executive sponsorship and proceed to the next phase.
- Develop the architecture: In this phase, the interdisciplinary team starts an assessment to identify corporate management tools, assets (e.g., physical and virtual machines, OSs, middleware, applications, and business services), and available infrastructure capacity. Once the building blocks have been identified, the team defines all technical, business, and compliance constraints that correlate capacity with business services and that will influence resource allocation and workload placement. Finally, the team reviews the available management tools and all potential integration paths with the capacity management solution.
- Select the solution: In this phase, the IT organization develops a request for information (RFI) to assess the market landscape, retrieve basic information about capacity management vendors, and determine which one offers a service-oriented approach to capacity management. Then, the interdisciplinary team develops a request for proposal (RFP) that details product requirements (e.g., architectural approach and features). Once RFPs are reviewed, the team sends a request for quotation (RFQ) that focuses on pricing and licensing models. Finally, once a shortlist of vendors has been established, the team selects the desired candidate and works with it to build a proof-of-concept (POC) implementation. During this last step, the team extensively tests the capacity tool's features and integration capabilities.
- Implement and manage: In this last phase, the team moves from the POC to a production implementation. The IT organization prepares the environment for deployment by solving system, network, and security issues. The team sizes the solution's tiers to monitor the entire virtual infrastructure. If needed, the team also finalizes the integration with additional components of the management stack. In the last step before the solution goes live, the team completes the design of all workload placement rules that will impact capacity modeling. After this, the capacity management tool is put into an operational state, and the IT organization focuses on managing tasks and operational framework adjustments.
The phases and key steps associated with each stage are shown in Figure 1.
Source: Gartner (June 2011)
This guidance document assumes that the IT organization already manages an x86 server virtualization infrastructure.
If the IT organization has not yet deployed a virtual infrastructure, or if it is in the process of re-assessing its technology choices, it should review the following Gartner documents:
- "Server Virtualization Hypervisors"
- "Server Virtualization Hypervisors: Bricks, Mortar, and a Little Duct Tape"
- "Virtual Infrastructure Management: Is the Solution Complete?"
- "Virtualization Platforms"
If the IT organization has already deployed a virtual infrastructure and it is planning to build an internal cloud that offers HIaaS, it should review the following Gartner document:
- "Stuck Between Stations: From Traditional Data Center to Internal Cloud"
The following steps describe the pre-work Gartner recommends in preparation for the implementation of a capacity management solution for virtual infrastructures and internal clouds:
- Step 1: Rethink the capacity management problem
- Step 2: Build an interdisciplinary team
- Step 3: Determine scope and use cases
- Step 4: Build a business case
Regardless of the size of the company, the IT organization always needs some form of capacity management. Operators can use traditional tools designed for the physical world or new solutions developed for virtualized environments. When none of them is available, staff members simply rely on their analysis skills and experience to optimize the resource allocation. In all these cases, IT organizations should assess and reconsider their approach to capacity management. The change of perspective is radical, it interests both IT and end users, and it usually requires a significant evangelization effort within the company. Gartner recommends rethinking capacity management in these areas:
- Approach the virtual infrastructure with a holistic view: Workloads depend on compute, storage, and network capacity altogether. Without a holistic view of these components, the IT organization tends to adopt multiple specialized tools that never converge. Virtualization managers only have partial visibility of the resources available for their VMs and remain blind about availability and performance at the storage and network layers. Conversely, storage and network managers are usually completely unaware of the unique characteristics and constraints that exist in a virtual infrastructure. The outcome is an inaccurate analysis: Too often a capacity tool developed for virtual infrastructures is focused only on rightsizing the virtualization hosts and VM compute capacity.
- Give workloads optimal placement a higher priority than VM density: Reducing the resource wasting is a fundamental promise of capacity management, but too often — mostly in the early stages of a consolidation project — IT organization pursue this goal by cramming as many VMs as possible on the same virtualization host. Without the right criteria, this approach usually leads to a placement where workloads are not complementary and the outcome is poor quality of service (QoS). In this scenario, multiple VMs usually overload a specific resource at a specific point in time or over a certain period. For example, a host may present a low CPU-utilization level while the storage channel suffers a performance bottleneck. With appropriate capacity planning, the IT organization would have distributed the input/output (I/O)-intensive workloads across multiple hosts, thereby complementing them with CPU-intensive or memory-intensive workloads.
- Define utilization levels according to application performance: Finding the right combination of workloads is just a part of the problem. IT organizations still have to reserve a certain amount of unallocated capacity (i.e., white space) to address workload spikes and unplanned demand. Understanding what the optimal utilization level is in a virtual infrastructure remains one of the most challenging tasks. Companies usually rely on the experience of their technical staff and the best practices provided by vendors, which may be highly conflicting. This leads to an overly conservative approach. For example, many virtualization vendors recommend reserving 15% to 20% headroom of the total capacity for compute resources (e.g., CPU or memory), but customers rarely surpass 50% utilization level. In the research note "Q&A: The Changing Dynamics of Virtual Machine Density" (see the "Recommended Reading" section of this guidance document), Gartner observed that, in large enterprises, the average CPU utilization is equal to 40% to 50%. Multiple vendors suggest or have suggested similar utilization rates to their customers. Sometimes, this level can be even lower when IT organizations simply reproduce the physical servers' characteristics and usage inside VMs. In these cases, the virtual infrastructure's utilization level can be as low as 7%, and sometimes even below that.
- Application performance awareness is critical to understanding the optimal utilization level in a virtual infrastructure. Some capacity management tools offer native engines for application performance monitoring, or integration with third-party solutions (see Phase 1: Step 4). Sometimes, these engines are sophisticated enough to offer self-learning capabilities. Thanks to this feature, the IT organization can understand the workloads' normal behavior without guessing. The capacity tool uses the information obtained by these engines to continually redefine the ideal utilization thresholds.
- Capacity management should be proactive, not reactive: In highly static physical environments, many IT organizations allocate new capacity every month or every quarter. Rare, exceptional events trigger the procurement of additional resources in a reactive fashion. However, this approach is simply too slow to meet the requirements of a dynamic environment such as a virtual infrastructure or an internal cloud. Operation managers must be able to predict demand and simulate consumption upfront, as accurately as possible.
- Some capacity tools forecast resource utilization with linear regression,1 but this approach is not perfect. Event awareness is fundamental for producing better predictions. Sophisticated capacity management tools can integrate with real-time analytics engines and event management tools to understand what happens at any given moment in the infrastructure and to anticipate how specific issues may affect the demand.
- The simulation of capacity consumption, provided by WHAT-IF engines that come with some capacity tools, is equally important. It offers the opportunity to evaluate different scenarios (e.g., the launch of a new product or the physical fault of a rack cabinet) and plan capacity allocation more proactively.
- Look at capacity from a service-oriented perspective: Rethinking capacity management in terms of delivering a business service significantly changes the way a company operates. In resource allocation and workload placement, for example, the IT organization no longer focuses on a specific VM but on all the VMs that constitute a specific business service. Similarly, maximizing the utilization level of a single infrastructure resource (e.g., virtual CPU [vCPU]) is no longer a priority if it does not positively affect a key performance indicator (e.g., transaction time) associated with a business service. For example, consider a database service that is composed of several VMs. If the vCPU utilization does not affect the transaction, then the IT organization should focus elsewhere, such as the reduction of application latency through a better workload placement.
- Capacity tools must be able to allocate resources and place workloads according to this new approach, accounting not only for technical constraints, but also for articulated business and compliance constraints. While including these non-technical metrics is the most important step in rearchitecting capacity management practices, it remains one of the most challenging aspects. For this reason, as a primary commitment, the capacity management strategy should be the result of a joint effort from different parts of the company, as described in the next step of this guidance document.
- Concurrently, the IT organization has to re-educate the company on how to define its
capacity requirements. The service-oriented approach, in fact, implies that end users
request resources in terms of key performance indicator (KPI) or service levels rather
than in terms of gigahertz of CPU power and gigabytes of memory. Examples of business
- Application latency
- Batch processing updates per day
- Business intelligence queries per day
- Concurrent users per second
- Order entries per day
For the success of the entire project, the company must identify experts, who come from different areas of the organization and possess specific technical skills and experience at different levels of the computing stack, and build an interdisciplinary team accordingly. The team has to include representatives from the application, platform, infrastructure, and security teams. Expertise in application and platform performance, enterprise management, virtualization, storage, networking, security policy, and regulatory constraints is required. This technical know-how must be merged with the knowledge of business processes, transactions, and KPIs adopted within the company. Such a broad range of talents is instrumental to accomplish a number of steps detailed in Phase 1 and Phase 3 of this guidance framework. Specifically, this team is mandatory to:
- Identify enterprise management tools
- Identify assets and capacity
- Define technical and non-technical constraints
- Design the integration map
- Design workload placement rules
Without such expertise, the team would lack the completeness of vision needed to implement an effective capacity management platform.
Within the team, IT and the business must agree on service-level objectives, workload priorities, changes in the operational framework, roles and responsibilities assignments, and the expectations that the IT organization will set for end users.
IT organizations adopt capacity management in order to:
- Consolidate workloads
- Identify resource wasting and maximize utilization
- Offer and guarantee service-level performance
- Forecast the infrastructure growth and improve the IT procurement strategy
- Contain resource over-provisioning caused by self-service portals
- Proactively allocate resources ahead of system failures
- Address the unplanned demand in internal clouds with more confidence
- Leverage the resources offered by public and external clouds
Depending on what business goals the company has defined, a capacity tool needs to have different capabilities (see Phase 2: Step 2).
At the same time, capacity management tools can be used in a number of use cases, each one presenting unique challenges, including:
- Data center or server consolidation
- Internal cloud computing
- Server-hosted virtual desktop (SHVD) hosting
- High-performance computing (HPC)
Today, almost no vendor is able to offer all capabilities or address all these use cases with a single capacity management tool. Most solutions on the market focus on server consolidation and optimization scenarios where the virtual infrastructure hosts typical enterprise-class workloads (e.g., Web servers, database servers, e-mail servers, and application servers). For capacity vendors, SHVDs and HPC represent niche environments and thus the support is limited. Eventually, capacity vendors will develop comprehensive solutions that the IT organization can use in every scenario, but currently, a use case such as an SHVD environment requires dedicated capacity tools because of its unique constraints (see Phase 1: Step 3). Most vendors are focusing on internal clouds support because capacity monitoring and management is mandatory when offering IT as a service.
The IT organization needs to carefully assess which use case is appropriate for its capacity management practice and what functionalities have the highest priority in order to achieve the company's business goals. This decision will be critical when approaching the market and selecting the right capacity management solution (see Phase 2).
The IT organization needs sponsorship at the executive level to obtain the cooperation of all departments in designing and implementing a solution. Therefore, building a strong business case is a mandatory step.
Adopting or rearchitecting capacity management requires a significant joint effort across the company; it is an expensive process and requires dedicated resources to administer the tool. It is worth the effort though:
- Hardware rightsizing and effective workload placement allow the company to optimize application performance and offer SLAs.
- Sophisticated capacity modeling allows the company to address workload peaks and unplanned demand with confidence and to offer self-service provisioning.
- Higher utilization leads to better return on investment.
- Accurate forecasting of resource utilization leads to wiser hardware upgrades.
Building its business case, the IT organization should clearly define the adoption process, including the implementation prerequisites, the evaluation criteria for product selection, and the impact on the operational framework. This guidance document offers the appropriate support for this task in the next three phases.
The IT organization should start with an assessment of the current capacity management strategy. Any issues identified in the actual process will support the business case. For example, if the company has heavily invested in its hardware infrastructure but struggles to understand the root cause for poor application performance, it may be a good indicator that the existing capacity management strategy is not effective enough. A sound capacity management strategy will help identify bottlenecks in infrastructure capacity and recommend changes, which might improve a business process.
The IT organization may also want to look at its virtual infrastructure from an optimization perspective. Determining that workloads are not consuming the available resources in the most efficient way helps to justify the need for new instrumentation. However, this is a challenging path because it is not easy to determine the optimization level of a virtualized data center.
A common approach to address this challenge consists of measuring optimization in terms of workload density at the virtualization host level by looking at VMs per host, VMs per CPU, and VMs per core ratios as key metrics. Another approach focuses on efficiency from the cost perspective and relies on the cost per VM metric. An alternative way to measure optimization is by looking at the average utilization level of virtual infrastructure resources. However, such methodologies are misleading (see Pre-Work Phase: Step 1). Many factors contribute to the optimization level of a virtual infrastructure, such as the use of resource oversubscription techniques or the workload behavior along the workday. More than that, each part of the infrastructure (e.g., compute, storage, or network layer) may present completely different utilization.
To understand the optimization level of its virtual infrastructure, the IT organization may be tempted to rely on those capacity management vendors that developed more sophisticated "efficiency indexes." These aggregated scores consider multiple aspects of the environment during an assessment. While more useful than single metrics, these indexes usually do not account for the capability to satisfy technical and non-technical constraints, and they do not relate to the company's service-level definitions. In addition, they are not comparable to each other, and no industry benchmark exists to define a standard. Finally, the IT organization cannot calculate the score before a POC implementation, thus making product comparisons complex and time consuming.
Although the lack of a standardized methodology makes calculating overall virtual infrastructure optimization extremely complex, Gartner recommends against blindly relying on efficiency indexes. Rather, IT organizations should focus on the optimization of those aspects (e.g., application performance) that help achieve specific business goals.
Gartner defines "optimized" as a virtual infrastructure where the workload placement satisfies all of an organization's technical, business, and compliance constraints and the capacity is allocated to avoid resource wasting (i.e., rightsized). Because a fundamental characteristic of virtual infrastructures is the workload mobility, the optimization level must be considered a dynamic value.
Once the IT organization has reconsidered its approach to capacity management, formed an interdisciplinary team, identified the scope of its project, and built a business case, it can begin to focus on the architecture. The approach described in this guidance document identifies the capacity management platform as the central brain that organizes the activity across the entire data center (see Pre-Work Phase: Step 1). This phase of the guidance framework helps the IT organization assess the infrastructure and redesign how its components interact with the capacity tool. The work done during this phase is also fundamental for selecting the appropriate product during the next phase.
Gartner recommends structuring this first phase as a four-step process:
- Step 1: Identify enterprise management tools
- Step 2: Identify assets and capacity
- Step 3: Define technical and non-technical constraints
- Step 4: Design the integration map
During this step, the IT organization needs to assess the components of its management stack. This activity is needed for two reasons: It speeds up the discovery of additional assets (e.g., physical and virtual machines, guest OSs, middleware, applications, and business services) and serves as the basis for a subsequent review of integration opportunities (see Step 4).
The list of management tools found during the assessment may include:
- Application performance managers
- Asset discovery systems
- Chargeback tools
- Configuration managers
- Event management systems
- Infrastructure managers (e.g., virtualization, storage, and network managers)
- Infrastructure orchestrators
- Infrastructure performance monitors
- Physical to virtual/virtual to virtual (P2V/V2V) migration tools
- Provisioning systems
- Request management systems (e.g., self-service provisioning portals, service catalogs, and ticketing systems)
- Reporting tools
During this step, the team may identify capacity constraints within various parts of the infrastructure management software such as an affinity rule that is defined in the Distributed Resource Scheduler (DRS) control panel of VMware vSphere.2 Later in this phase, such rules will help the IT organization define all constraints and policies that the company needs for modeling capacity (see Step 3). More importantly, identifying these rules early will help the team resolve potential conflicts while designing the workload placement rules (see Phase 3: Step 2).
To redesign a capacity management solution, IT organizations must determine what business services the infrastructure is hosting and what business goals the enterprise is trying to achieve. To accomplish this task, the IT organization needs to identify available IT assets, discover deployed applications, and map them to their owners and lines of business. If the scope of the project is not workload consolidation, the activity may be limited to the virtual infrastructure; otherwise, it has to include physical assets.
To go through this process, the IT organization has different paths. It can:
- Rely on a pre-existing a configuration management database (CMDB)
- Leverage pre-existing IT asset discovery tools
- Assess the IT inventory from infrastructure management tools
- Use a native asset discovery engine that comes with some capacity planning tools
- Perform a manual assessment
The recommended option for the IT organization is to leverage an existing CMDB. This approach guarantees consistency, accuracy, and visibility at all levels of the stack. In addition, it probably already stores services interconnectivity and dependency maps. However, this approach implies that there are no inventory conflicts because of concurrent CMDBs in place, that the database is healthy and updated, and that there is a strong governance plan in place. While the company may consider alternatives for small projects, capacity management in large-scale virtual infrastructures and internal clouds must rely on a robust CMDB.
Sometimes, leveraging a CMDB is not possible. Sometimes the organization has not yet invested in a CMDB. In another example, a company may have acquired a smaller company without a centralized IT model. In these scenarios, the IT organization may need to discover physical assets and applications by using alternative solutions, such as IT asset discovery tools. The company may prefer to postpone the update of the CMDB after a consolidation project, where multiple workloads are migrated inside the virtual infrastructure through a P2V migration process.
If the asset identification is limited to the virtualized data center, the IT organization may rely on the integration between the capacity tool and the virtual infrastructure managers to determine the infrastructure assets. Some vendors only focus on the out-of-the box integration with one or two leading virtualization platforms. While this approach greatly accelerates the implementation of a capacity management tool, it has significant shortcomings. The representation of the environment is usually accurate only from a compute perspective, which leaves much to be desired in terms of visibility at the storage and network layers. In addition, virtualization managers do not offer application discovery unless they gain additional functionalities through integration with application performance management (APM) tools. Thus, this method often results in incomplete data and can hinder capacity management.
If the scope of the project is limited and the company needs capacity planning only for workload consolidation, the IT organization can use asset discovery engines that come with some capacity tools. Some of these engines are agent-based while others adopt an agentless approach; both have implications that are described in Phase 2 of this guidance framework. While relying on these engines implies less management work than a manual assessment, there is no guarantee that the discovery process will recognize all assets. Accordingly, the IT organization should consider this approach only when there are no better alternatives.
While providing the maximum level of accuracy, a manual assessment remains the most complex and expensive approach to asset and application discovery. The IT organization should consider this path only when a CMDB is not available and technical issues prevent the use of other automated discover solutions. At the end of a manual assessment, the interdisciplinary team should review and confirm the findings (see Pre-Work Phase: Step 2).
Regardless of the approach used for asset discovery, the interdisciplinary team will help to identify application owners faster, to evaluate the life span of temporary projects, and to determine the relationship between different VMs in multi-tier platforms. The information gathered during this step will be instrumental when defining technical and non-technical constraints.
- Assets (including physical and virtual machines, storage and networking equipment, OSs, middleware, and applications)
- Business services
- Capacity (at compute, storage, and network layers)
- Management tools
Now, the IT organization has to focus on the relationship between capacity and business services.
Offering an IT service implies addressing a number of different needs in different domains. At the lowest level, a service is the sum of multiple resources, technologies, and workloads. A CRM platform may be the sum of VMs that host Web services, integration middleware, database and analytics services, and all the virtual and physical resources needed to run these tiers. These components relate to one another through multiple technical rules, such as:
- Availability of special hardware (e.g., server-side graphics processing units [GPUs])
- Encryption levels
- Hypervisor and OS compatibility
- Load balancing relationships
- Network connectivity and network interface card (NIC) affinity
- Network security zones
- Resiliency and recovery levels
- Storage connectivity and media
- Support for resources over-subscription (e.g., memory over-commitment)
- Support for virtual hardware hot-plugging
- Support for VMs live migration
At a higher level, a service exists to accomplish business objectives. A company may want to differentiate its CRM offering through tiered performance, availability, and support service levels to attract new classes of prospects. The difference between the Silver and Gold service levels, for example, may boil down to a different amount of capacity assigned to different VM instances that form the same service. This approach guarantees that a specific application performance level (e.g., Bronze) addresses only up to a certain amount of demand. Business constraints might include a great number of items, such as:
- Applications response time
- Backup and maintenance windows
- Business groups and application owners
- Deployment environment (e.g., production, test, and development)
- Disaster recovery time objectives
- Geographic location of service components
- Personnel availability and knowledge
- Power consumption
- Software licensing restrictions
- Time zones
- Uninterruptible power supply (UPS) power backup policy
At even higher levels, a service may need to comply with regulatory constraints. A company trying to do business in specific markets and/or geographies may have to satisfy requirements that heavily affect the design of its services. These kinds of constraints include:
- Customer segmentation in a multi-tenancy environments
- Data access restrictions (e.g., for embargoed countries)
- Privacy rules
- Service components and data segmentation
- Regulatory compliance (e.g., Payment Card Industry Data Security Standard [PCI DSS], Sarbanes-Oxley Act [SOX], and Federal Information Processing Standards [FIPS])
IT organizations interested in additional research on compliance constraints should review the overview "Audit and Attestation in Virtual Environments."
Some of these constraints gain more or less relevance depending on the context in which the IT organization is managing infrastructure resources. SHVD environments and HPC environments are good examples of special use cases for capacity management:
- In SHVD environments, application performance is critical because it directly affects the user experience. In these environments, the capacity modeling engine needs to account for constraints such as end-to-end transaction times, QoS settings that affect network and storage I/O channels, and differences between storage mediums (e.g., rotating disk drivers vs. solid-state flash memory drives). Going forward, capacity tools will need to account for even more constraints, such as the prerequisites for high performance remote desktop protocols (e.g., Citrix Independent Computing Architecture/High Definition User Experience [ICA/HDX], Microsoft RemoteFX, and VMware PC-over-IP [PCoIP]).
- For example, capacity tools will have to account for the presence of server-side GPUs when coordinating the provisioning of virtual desktops for specific power users. Both Microsoft and VMware now include support for these specialized display cards. However, the limited expansion room in most rack-mounted servers will limit the number of installable units. Accordingly, the number of server-side GPUs would not be sufficient to support all the company's virtual desktops. Capacity tools will have to prioritize virtual-desktop placements on GPU-powered servers according to multiple factors: the applications category, the expected level of multi-media activity, the need for special client-side configurations (e.g., multiple monitors), and so on.
- Similarly, capacity tools will have to consider how these new technologies impact ordinary operations inside the virtual infrastructure. Microsoft, for example, supports the live migration of a RemoteFX-enabled virtual desktop only when the same GPU is available on both the source and the target physical host. Moreover, Microsoft supports RemoteFX only on Intel and Advanced Micro Devices (AMD) CPUs that feature support for nested memory paging technologies (i.e., Intel Extended Page Tables [EPT] and AMD Rapid Virtualization Indexing [RVI]).3
- HPC environments require the analysis of specific constraints as well. Amazon Web Services (AWS) has demonstrated how cloud HIaaS platforms can be used to host HPC workloads (e.g., for scientific data crunching or movie scenes rendering) with the introduction of Cluster Compute and Cluster GPU instances for Elastic Compute Cloud (EC2) in July 2010. In HPC environments such as this, capacity solutions for virtual infrastructures must take into account additional metrics and constraints, which allow for the creation of I/O-intensive or CPU-intensive logical zones for workload placement, for example.
- Though the presence of all these constraints seems to limit the capability model capacity and arrange the workloads across the infrastructure, it actually helps the capacity analysis engine find the best possible combinations. This combinatorial problem4 can become quite complex even for few candidate workloads and target hosts (and already allocated workloads) and constraints, to the point of requiring a specialized engine to identify the best allocation. Few capacity management vendors, such as BMC Software and CiRBA, provide these capabilities. CiRBA calculated that approximately 30 billion different placement combinations exist for moving 15 workloads onto five servers.3
As the first step of this phase, Gartner recommends that the IT organization identify and map technical and non-technical constraints affecting those services that will be hosted on the virtual infrastructure. The interdisciplinary team formed during the Pre-Work Phase is critical for identifying and reviewing existing and future constraints that will affect capacity modeling. This process provides critical information for selecting the right product and designing workload placement rules during the Select the Solution Phase.
Gartner also recommends that the interdisciplinary team assess how different business units and departments within the company expect to leverage the virtual infrastructure in the future. A growing confidence in the technology, an executive push for adoption, broader use case support, the availability of additional hardware resources, and the guarantee of specific service levels are all drivers that may attract new end users. This implies the onboarding of new business services with a multitude of new constraints. The company has to review the deployment road maps and forecast the adoption trends for its infrastructure in order to understand how sophisticated its capacity management needs will be over time.
A fundamental task in the adoption of a capacity tool is designing its integration with the other monitoring and management solutions controlled by the IT organization. In fact, while some products in this space offer basic capabilities to collect data from target systems and report utilization levels through native engines, most sophisticated solutions offer a number of I/O interfaces for mainstream enterprise-class management platforms (e.g., BMC Capacity Management, CA Capacity Management, and CiRBA Data Center Intelligence).
Deep integration with the existing management stack is the key to fully exploit the potential of the capacity management tier. To speed up the integration process, Gartner recommends tracking down and identifying all the company's tools that may interconnect with the capacity solution. The interdisciplinary team should review these tools from an interoperability perspective and assess which interfaces and protocols can be used for integration.
The IT organization should pursue integration for three reasons: to collect the data, to model the data, and to execute the capacity plan (see Figure 1). Designing the integration map at this time will also facilitate the definition of RFP requirements later (see Phase 2: Step 2).
First, the IT organization has to establish an integration map to collect data about the existing physical and virtual infrastructure. Such integration should connect the capacity management tool with:
- Infrastructure management tools: The most basic integration provided by a capacity tool for virtual infrastructure involves the virtualization management layer. From there, the capacity tool acquires the inventory for its analysis and basic infrastructure performance trends. Some products do not offer any additional integration, instead relying solely on the hypervisor manager's capability to identify IT assets and available capacity; any blind spot affects the capability to produce an efficient optimization plan. For example, if there is an IBM SAN Volume Controller (SVC) in the data path, storage that exists beyond the SVC is abstracted from the hypervisor. As a result, the hypervisor only reports storage capacity and performance data that it can collect between itself and the storage virtualization appliance. A blind spot would exist between the SVC and the physical storage devices that are connected to the SVC.
- Better capacity management products extend this integration to other storage and or network management solutions, thereby providing a more comprehensive vision of how the resources allocation and utilization trends across the virtualized data center.
- Service and configuration management tools: At a higher level of integration, capacity management solutions may offer integration with service management tools and/or configuration management. This approach leverages the organization's CMDB to retrieve a holistic representation of the infrastructure, thus reducing the complexity of data collection. The IT organization must be aware that, in such a scenario, the capacity tool can produce accurate capacity plans only as long as the CMDB is updated and clean from inconsistencies. If the CMDB is not healthy and/or not managed through a sound governance policy, the IT organization has to consider sanitization or an alternate integration approach.
- Infrastructure and application performance tools: IT asset discovery or acquisition from third-party data sources is just a first step. A capacity management solution needs to measure and track the resource utilization level over time.
- Few capacity tools come with native infrastructure performance monitoring capabilities. Where there is no internal engine, they rely on other management solutions or on dedicated performance tools. Regardless of the data source, this approach is not enough. A capacity analysis solely based on infrastructure performance monitoring in fact assumes that the software layer, including the OSs and the application services on top, is already optimized and behaving as expected. Real-world experience suggests that, too often, this is not the case. Approaching capacity allocation with application performance awareness helps organizations make smarter decisions. For example, the level of performance required for a particularly mission-critical business platform such as SAP may lead to provision and place its workloads on the fastest storage and networking resources available, such as solid-state drives (SSDs) and Fibre Channel (FC) links. Accordingly, a capacity tool has more chances to produce an accurate recommendation plan when it analyzes both infrastructure and workload performance. Thus, integration with APM products is highly desirable.
After IT asset identification and performance data collection, the capacity tool has to model capacity or simulate different allocation plans with its WHAT-IF analysis engine. Simulation and modeling accuracy depends on how much information the tool can collect. Whenever possible, the IT organization should aim to integrate the capacity tool with other management tools, including:
- Self-service provisioning portals: A key external source for data modeling is the self-service provisioning portal in which end users select the workloads offered through a service catalog and request capacity to meet their business objectives. Depending on the level of standardization, and the sophistication of the infrastructure, capacity requirements can be defined as resource units (e.g., four vCPUs or 2GB RAM) or as KPIs (e.g., number of concurrent users per second, order entries per day, and business intelligence queries per day). Either should be an acceptable input for capacity modeling. The demand here does not relate to a single VM but rather to the entire virtualized infrastructure that delivers the requested service. However, the capacity tool needs to break down the compute, storage, and network requirements according to the baselines defined by the service catalog that comes as part of the self-service provisioning stack.
- Event management systems: IT organizations should strive to also integrate event management systems with the capacity tool. The capacity tool must be able to predict and simulate resources consumption upfront and as accurately as possible. Although some capacity tools offer trend modeling based on linear regression,1 event awareness guarantees a greater level of accuracy. Such integration helps the organization to understand what happens at any given moment in the infrastructure and to anticipate the impact of specific issues. For example, the awareness of an imminent hardware failure at the storage layer would trigger a WHAT-IF simulation to evaluate how the IT organization could leverage unused capacity to reallocate the workloads inside a faulty array.
- Request management tools: The integration of request management tools with capacity tools can allow the IT organization to solve a significant capacity management problem in a resource-sharing environment: how to assign the right priority to concurrent requests. A first-come, first-served approach can work only as long as the entire infrastructure is provisioning undifferentiated workloads for the same department or organization, which rarely happens in large-scale environments. When multiple departments leverage the virtual infrastructure for a number of different workloads, the IT staff must enforce priority management. This can happen through the definition of business constraints within the capacity management tool or through the integration with request management tools, from where the constraints are extrapolated.
- Chargeback tools: The capacity tool can obtain additional information for data modeling by integrating with chargeback tools. From there, the capacity management solution may retrieve specific budget constraints that further affect the capacity plan.
- Infrastructure orchestrators: Finally, the integration of an infrastructure orchestrator may provide additional technical constraints because this component coordinates the entire management stack and because it is where the IT organization may want to define specific limitations to influence operations such as the provisioning of services, the disaster recovery plan, or the system maintenance procedures.
Both the vendor and the IT organization can further influence capacity modeling through constraints defined as placement rules and in product libraries called "catalogs" that come as part of the capacity management tool. These libraries, not to be confused with the "service catalog" component of a self-service provisioning stack, are usually defined as:
- Hardware and virtualization platform catalogs
- OS and applications catalogs
Capacity management catalogs include characteristics and configuration details about most popular physical servers, virtualization platforms, and enterprise applications (see Phase 2: Step 2).
Once the capacity solution has generated its actionable recommendation list, capacity managers have to execute it. Whenever possible, the IT organization should aim for full or partial capacity-tool automation through integration with other management tools, including:
- P2V migration tools: Some capacity planning products can integrate with P2V migration tools to execute a consolidation plan (e.g., Novell PlateSpin Recon and Migrate and 5nine Migrator and P2V Planner). Over time, when this class of solutions evolves to offer physical-to-cloud (P2C), virtual-to-cloud (V2C), and cloud-to-cloud (C2C) migration capabilities, capacity tools will be able to leverage them for complex workload placements such as "cloud bursting" and "cloud hopping."
- Infrastructure and configuration management tools: For capacity management tools, the first level of integration offered for executing the capacity plan is with the virtualization management layer: the product leverages available APIs to perform virtual hardware reconfiguration or VM live migrations. This form of integration is the easiest to deploy because vendors offer out-of-the-box connectors with the mainstream virtualization platforms, but it has limitations. Current solutions, in fact, do not leave room for integration with intermediary tiers such as request management tools, approval workflows, or orchestration frameworks.
- Infrastructure orchestrators: Although a viable implementation, the IT organization should consider the point-to-point integration between a capacity tool and other components of the management framework as a sub-optimal solution. The integration with an infrastructure orchestrator represents a more challenging yet more flexible and scalable approach. Through the orchestrator, a capacity tool can coordinate the entire management stack to reallocate resources and rearrange the workloads. Additionally, thanks to the orchestrator, a capacity tool can integrate with additional tiers, such as an authorization workflow or a provisioning system.
- Authorization workflows: The integration with an authorization workflow regulates the execution of infrastructure changes and assigns priorities to concurrent requests for capacity. It introduces a higher degree of transparency into the capacity management process, facilitates the delegation of authority, reduces the gap between the capacity plan generation and the actual fulfillment, and ultimately produces non-repudiation.
- Provisioning systems: The integration with a provisioning system allows for reserving capacity ahead of predictable, ordinary, or exceptional events, such as a client-facing demo or a major product launch. Coordinated by the orchestrator and executed by the provisioning system, the capacity tool is able to book resources at the compute, storage, network, and security layers, which becomes a mandatory capability in internal clouds.
- Ticketing systems: The integration with a ticketing system allows for the initiation of a trouble ticket as soon as the capacity tool identifies capacity shortage. This allows the IT organization to anticipate an increase in support calls and helps the IT procurement process.
- Reporting, show-back, and chargeback tools: The capacity tool can also integrate with reporting, show-back, and chargeback tools. At the most basic level, this integration can be helpful to demonstrate a resource-wasting trend, to support future initiatives about consolidation or optimization, to justify a budget reallocation when the performance degradation depends on resource under-provisioning, or to increase chargeback costs due to inefficient demand from the business. The integration with chargeback tools can go beyond that: Capacity modeling and capacity-booking information can help the company project budget costs and select the most appropriate capacity allocation strategy, accordingly.
The diagram in Figure 2 summarizes the integration opportunities that the IT organization should assess and pursue to automate capacity plan execution and reduce its operational costs.
Integration with all these external products poses significant implementation challenges, which are described in the "Risks and Pitfalls" section of this guidance document, but integration is a fundamental milestone in the process for a virtualization-ready capacity management discipline.
Source: Gartner (June 2011)
The pre-work activities and Phase 1 of this guidance document armed the IT organization with new tools to evaluate the market offering and to select the best solution for the company. The Pre-Work Phase recommended approaching the capacity management discipline from a service-oriented perspective and determining the use cases that the company wants to address. Phase 1 helped in identifying technical and non-technical constraints to model capacity, as well as redesigning the data center architecture to integrate the capacity management tier. The decisions made during these phases will help the IT organization to understand what capabilities should be included in the RFP.
Phase 2 of this guidance framework does not intend to replace a proper product assessment. Rather, it defines the basis for that by providing guidance on how to look at the vendors' positioning, portfolios, and road maps.
Gartner recommends structuring this phase as a four-step process:
- Step 1: Develop an RFI
- Step 2: Develop an RFP
- Step 3: Send an RFQ
- Step 4: Build a POC and select provider
The IT organization needs to develop an RFI to understand the market landscape, gather basic information from vendors, and determine which ones are in line with the approach to capacity management described at the beginning of this guidance framework (see Pre-Work Phase: Step 1).
Because the RFI targets a broad range of potential suppliers that may or may not adopt the same terminology, the IT organization takes a risk by focusing its reviewing efforts on products that may have some capacity management capabilities but that are not developed for this specific purpose. Understanding the classes of solutions is fundamental for IT organizations to interpret a vendor's positioning and marketing jargon and to determine which vendors should be included in the RFI process.
The term "capacity management" is often misused to indicate a broad range of products with different capabilities, focus, and maturity levels. Gartner recognizes three classes of capacity solutions:
- Capacity reporting tools primarily focus on the identification of available and used capacity and on reporting about resource wasting. These products may come as stand-alone solutions, but vendors often develop them as modules of comprehensive infrastructure monitoring solutions. This class of products relies on manually defined utilization and performance thresholds to compute their capacity analysis.
- Capacity planning tools are more sophisticated because they often offer a WHAT-IF analysis engine to simulate resource consumption scenarios and comprehensive software and hardware catalogs to model capacity (see Step 2). This class of tools is able to recommend on which virtualization hosts the IT organization should move specific workloads, but the capacity plan is generate only on demand. Vendors design these solutions primarily to assist in workload consolidation projects. In fact, some of them feature tight integration with P2V migration tools, usually provided by the same vendor.
- Capacity management tools offer the broadest and most sophisticated range of features. They are able to constantly track available and used capacity, peak and average resource utilization levels, application performance, and events occurrence. This class of solutions also features analysis engines for advanced WHAT-IF simulations and utilization forecasting. Capacity management products allow the IT organization to model capacity according to technical, business, and compliance constraints (see Phase 3: Step 2). Additionally, these solutions are able to recommend infrastructure reconfiguration to maximize the resource utilization (e.g., increase resources on server 1 by 256GB of memory or replace server 1 with five servers from this specific independent hardware vendor [IHV]).
- Finally, yet importantly, capacity management products feature integration with a
broad range of other management tools (see Phase 1: Step 4) to accomplish a number of tasks:
- Identify assets
- Track capacity utilization
- Measure application performance
- Simulate capacity utilization
- Model the capacity plan (i.e., resource allocation and workloads placement)
- Automate the manipulation of infrastructure to execute the capacity plan
- Report utilization trends and resource wasting
The classes of capacity management solutions for virtual infrastructures, and their key capabilities, are listed in Figure 3.
Source: Gartner (June 2011)
Note that not all products fit one specific category. Advanced capacity management features, for example, may appear in different classes of solutions, such as in life cycle management products (e.g., ManageIQ EVM Suite).
Once the IT organization has collected RFI proposals and has operated a basic selection following the guidelines offered in the previous step, it can focus on a more detailed analysis of the vendor's offering through a RFP. The company should utilize the RFP process to obtain the details necessary to make a decision. To probe both the vendor's ability to meet the application's requirements and the vendor's viability, the RFP should consist of exhaustive and detailed questions.
The IT organization should develop its RFP while focusing on three core areas:
When evaluating capacity solutions, the technical architecture is a key aspect to consider. The IT organization should focus its attention on fours aspects: the deployment model, the data collection mechanism, the scalability limits, and the fault tolerance.
Most capacity solutions sport an on-premises deployment model. In a typical enterprise configuration, the data collection agents (if any), the data collector engine, the capacity database (CDB), and the analysis engine are installed as dedicated tiers, all within the corporate perimeter, as shown in Figure 4.
Source: Gartner (June 2011)
Some tools in the capacity planning category present a hybrid architecture where only some components are deployed locally: the data collection agents (if any) and the data collector engine. The latter acts as a proxy, uploading information to the capacity vendor's infrastructure, where the CDB and the analysis engines reside. In this scenario, the IT organization operates the capacity tool through a management console that can be installed on-site or that comes as a Web-based application, as shown in Figure 5.
Source: Gartner (June 2011)
This hybrid architecture may represent a challenge in terms of security and compliance for some customers. Before adopting a solution that leverages this approach, the IT organization needs to assess several details, including:
- What kind of data the agents collect
- If the collected data is anonymized and/or encrypted
- If the collected data is transferred from the corporate environment to the vendor's infrastructure through a secure channel
- What is the data retention policy enforced by the vendor inside its infrastructure
- If the vendor is collecting data from multiple customers on a resource-sharing infrastructure
- If and how the collected data can be retrieved and canceled from the vendor's infrastructure
- Where the vendor's infrastructure is geographically located
- What is the vendor's SLA for data and service availability
- How the vendor deals with data compliance issues
- What is the vendor responsibility for data lost or stolen
After reviewing the deployment model, the IT organization should pay close attention to the data collection mechanism used by capacity tools. The analysis engine requires information about asset resources and configuration, as well infrastructure and application performance metrics to model capacity. Most products offer native capabilities to retrieve such information through an agent-based or an agentless approach. Most sophisticated capacity management products also offer alternate paths for collecting data through the integration with third-party monitoring and management tools (see Phase 1: Step 4).
If the IT organization decides to use the native data collection mechanism, it should evaluate the trade-offs for both agent-based and agentless approaches:
- Agentless solutions collect data through standard management interfaces such as Windows Management Instrumentation (WMI) without the need of additional software components installed on targeted systems. This limits the management overload, but it also limits the number of supported OSs that the capacity tool can monitor. Additionally, the management interfaces used for data collection may be unavailable on certain target systems as result of OS hardening procedures (see the "Risks and Pitfalls" section of this guidance document).
- Agent-based solutions imply that the IT organization deploys and updates agents across the entire infrastructure. These software components allow for monitoring a wider range of OS compared with the agentless approach, but also affect the resource usage on targeted systems. More importantly, the agents extend the attack surface of monitored systems and potentially weaken their OS. On average, security is not a primary focus for capacity management vendors, which may be slow in providing agent updates when new vulnerabilities arise.
For additional research on agentless versus agent-based enterprise management, review "Data Center Systems Management: A Layered Approach."
Scalability limitations are another important aspect to consider when reviewing a solution's architecture. If the IT organization plans to manage capacity for a large-scale infrastructure or an internal cloud, the maximum number of VMs and hosts supported per single instance is a fundamental metric to evaluate. Some capacity solutions currently available on the market have a scalability limit set as low as 500 concurrent VMs while others reach up to 15,000 concurrent VMs. To scale, some vendors (e.g., BMC) have engineered specific data-warehousing techniques specifically designed and implemented to deal with the high volume of collected data while other vendors (e.g., Lanamark) offer proprietary storage layers as replacement for standard database systems.
For capacity management, scaling out is extremely important. In many cases, these products can scale out by adding more data collectors and analysis engine instances that can deal with tens of thousands of monitored workloads in large-scale environments. Additionally, the capability to deploy multiple data collectors may be critical for performing capacity management in geographically dispersed environments where workloads are deployed across multiple branch offices and connected through poor or unreliable WAN links.
A last important architectural aspect to consider is the fault tolerance of the capacity tool. While this point may be irrelevant for capacity planning solutions, it becomes more significant when capacity management has become a key part of the architecture (see Phase 1: Step 4). An unavailable capacity engine slows down the provisioning of new services and may affect the capability to charge back. Additionally, when used in cloud infrastructures, it may represent a single point of failure that affects other layers of the stack, such as the self-service provisioning portal or the orchestration framework. The IT organization should verify that the capacity tool comes with native high-availability mechanisms or that it supports clustering mechanisms offered by virtualization platforms (e.g., VMware High Availability [HA]).
Once the IT organization has reviewed architectural alternatives, it is fundamental to focus on the key features to look for in capacity management solutions.
Capacity management is not a new discipline. IT organizations have had to address resources allocation challenges since the mainframe era. Despite that, products designed for physical environments do not fit well in the world of virtual infrastructures. To be virtualization-ready, a capacity tool has to take into account the unique abstracting capabilities introduced by hypervisors and the exceptional mobility properties of VMs.
Enterprises should review the following list of capabilities that Gartner recognizes as highly desirable and identify those that are critical for achieving business goals:
- Virtualization workload mobility awareness: Workload mobility is a unique capability of virtual infrastructures that capacity management solutions need to support. All modern virtualization management tools allow for movement of one or more VMs from one physical host to another without service interruption. The IT organization needs this feature for a variety of reasons, from system maintenance activities, such as software patching or hardware upgrades, to disaster recovery. However, technical constraints such as the processor compatibility affect the capability to perform live migrations. For example, not all virtualization platforms support the live migration of a VM from an Intel CPU to an AMD CPU. Sometimes, they do not even support the live migration from an old Intel CPU to a newer generation. When supported, vendors may hold these migrations to a limited number of guest OSs.
- The capacity management tool should know what platform supports what kind of migration for what guest OS.
- Virtualization platforms also leverage workload mobility awareness as a basic form of capacity management. VMware, for example, operates VM migrations to execute optimization guidelines defined by the DRS. The IT organization can even influence the DRS behavior through manually defined affinity rules5 that determine whether two VMs should be placed on the same host.
- The capacity management tools should be able to understand these constraints and consider them during capacity modeling. However, the IT organization should avoid the presence of multiple workload placement rules at different levels of the computing stack because they generate conflicts and increase management complexity. Phase 3: Step 2 of this guidance framework provides recommendations on how to resolve similar conflicts.
- Support for technical and non-technical constraints: This guidance framework describes a number of technical, business, and compliance constraints that the IT organization should define to model resource allocation and workload placements. Current virtualization management platforms (e.g., Citrix XenCenter, Microsoft System Center Virtual Machine Manager, and VMware vCenter) offer too-limited capabilities (e.g., resource pools or affinity rules) to define most of the constraints listed in Phase 1: Step 3. For example, before VMware vSphere 4.1, it was impossible to define affinity rules to keep VMs on a specific host. More importantly, the definition and modification of constraints using these tools requires direct access and administrative control to the virtual infrastructure, thereby posing security risks and introducing unnecessary management overload for IT managers. Finally, the capacity management capabilities that virtualization platforms offer primarily focus on compute resources, which leaves the IT organization with limited or no visibility at the storage and network layers.
- The capacity management tool should allow for the definition of complex, multi-dimensional placement rules according to the technical, business, and compliance constraints inherent to each service that the infrastructure is hosting.
- Support for resources over-subscription: Modern virtual infrastructures offer advanced resources over-subscription to increase the workload density beyond physical limitations. Techniques such as memory ballooning and content-based page sharing, available in leading virtualization platforms, are examples of memory over-subscription techniques while thin provisioning, available in multiple arrays, is an example of storage over-subscription. However, not every ISV supports these technologies for their OS and applications. For example, Microsoft supports Dynamic Memory for SQL Server6 but not for other mission-critical back-end services such as Exchange and SharePoint.
- Once a key resource such as memory is abstracted and over-subscribed, allocating it becomes a much more complex challenge.
- The capacity management tool should know what OS and applications support resources over-subscription. Additionally, because over-subscription happens in real time, the tool should be able to verify available capacity continuously.
- Resource over-subscription proved to be extremely efficient in specific environments such as SHVD infrastructures or development and test environments. If the IT organization plans to use capacity management for this use case, then this capability should be a top requirement (see Pre-Work Phase: Step 3).
- Support for virtual hardware hot-plugging: Capacity management in the virtual world should also take into account advanced resources manipulation, which is unavailable in physical environments. A key example is offered by the introduction of memory and CPU hot-plugging techniques. Modern hypervisors allow the IT organization to add, and in some cases remove, virtual hardware assigned to each VM without halting it. The capability to leverage this feature ultimately depends on the guest OS and its applications.
- The capacity management tool should know what OS and application supports hardware hot-plugging and coordinate such activity according to the information provided by the APM tool (see Phase 1: Step 4).
- Support for hypervisor-level QoS: Capacity management should also consider other forms of resources manipulation, such as QoS for virtual infrastructures. While the traditional approach to QoS usually controls traffic through network-level devices (e.g., routers, switches, or even firewalls), this approach allows for the enforcement of access priorities to the storage and network I/O channels at the hypervisor level.
- Current implementations allow the IT organization to define such priorities on a protocol basis, VM basis, or virtual switch basis. For example, the Storage I/O Control (SIOC) technology introduced in VMware vSphere 4.1 allows defining "shares" and "limits" for VMs running on a cluster of ESX hosts that access a shared storage pool.7 Similarly, Citrix XenServer 5.6 Feature Pack 1 allows for a KB/s limit to be defined for specific VMs.8
- These capabilities unlock a range of untapped opportunities in terms of event-driven capacity allocation. The capacity management tool should be able to coordinate hypervisor-level QoS according to the information provided by the APM tool (see Phase 1: Step 4).
- Capacity reservation: In virtual infrastructures, multiple factors affect the available capacity, including:
- Hardware offlining and decommissioning
- Hardware failures
- Hypervisor-level QoS
- Manual and automated software and hardware reconfigurations
- Manual and automated VM migrations
- Resources over-subscription
- Software failures (e.g., guest OS or hypervisors crashes)
- Virtual hardware hot-plugging
- VM decommissioning (e.g., end of life)
- VM temporary offlining (e.g., for patching or reconfiguration)
- This leads to a highly dynamic environment compared with physical environments where the IT organization reallocates resources at a slow pace and the reallocation does not necessarily affect all workloads (e.g., hardware reconfiguration). The continuous change of capacity availability in highly dynamic environments is challenging from a management perspective. Without pervasive automation, a gap exists between the analysis phase, when the capacity plan is generated, and the fulfillment phase, when the IT organization executes recommended configuration changes and workload placements. Even when these operations are automated through an orchestrator, the resource allocation may not be immediate if the architecture includes an approval workflow system. If the capacity plan addresses a resource wasting and suggests decommissioning one or more VMs or reducing assigned virtual resources, then there is no impact. However, if the capacity plan requires a workload re-arrangement, then any change in available capacity could possibly prevent the relocation. In this scenario, the capability to reserve capacity upfront is fundamental.
- IT organizations further desire capacity reservation in order to meet demand ahead of predictable, ordinary, or exceptional events, such as a client-facing demo or a major product launch. Ahead of such events, the IT organization needs to be able to book capacity for a specific period. The capacity management tool should be able to coordinate the reservation of capacity through the integration with provisioning systems, configuration managers, infrastructure management platforms, or orchestrators (see Phase 1: Step 4).
- WHAT-IF analysis: Before resource allocation and workload placement, the IT organization can use a
capacity management tool to simulate different infrastructure configurations or events
and to understand how each scenario affects capacity availability. This feature is
- Compare different software and hardware configurations
- Identify maximum consolidation ratios
- Prioritize investments for data center expansion
- Verify the capability to support the business during predictable and non-predictable events
- Determine the impact of non-technical constraints (e.g., regulatory compliance mandates) on capacity modeling
- For example, a company designing its business continuity plan may want to calculate what services can be provided and for what audience size should a major fault occur, such as a rack failure. The capability is also invaluable in a scenario where a company is working to launch a new product or to announce a major acquisition. The IT organization may be forecasting an unprecedented demand for its front-end Web servers and, according to that, it may want to know if and how the current capacity can accommodate expected demand.
- The capacity management tool should feature a WHAT-IF analysis engine able to simulate complex scenarios where the IT organization can evaluate what variables are associated to infrastructure metrics (e.g., number of servers or amount of memory) and to business KPI (e.g., number of concurrent customers or transaction time).
- Hardware and software catalogs-based data modeling: Workload placement is influenced not just by technical, business, and compliance
constraints (see Phase 1: Step 3), but also by the characteristics of physical hosts, virtualization platforms, guest
OSs, middleware layers, and applications. The capacity tool already mined these details
for the existing infrastructure from other management components or from its own IT
asset discovery engine and uses them to model capacity as accurately as possible (see
Phase 1: Step 4). However, the IT organization may want to simulate scenarios where some components
of the infrastructure are replaced with others that the company does not physically
own (e.g., a new model of server or an alternative virtualization platform). In this
case, the staff should include in the WHAT-IF analysis engine extended details that
may be hard to obtain. To offload the task of defining simulated systems from the
IT organization's shoulder, and to drastically reduce the errors, some capacity tools
come with pre-populated and routinely updated libraries called "catalogs":
- Hardware catalogs include hundreds of server models available from the leading original equipment manufacturers. For each one, the capacity vendor maintains the configuration list, which includes key information such as the number of type and number of CPU sockets, memory maximum configuration, and NICs supported. Some of these libraries can include more than 2,500 components.
- Hardware catalogs also serve as support to generate Bill of Materials listings for procurement.
- Software catalogs include the most common enterprise applications and the major virtualization platforms on the market. For the latter, capacity management vendors detail fundamental metrics such as the average workload density claimed by the technology provider, the list of supported guest OSs, the list of virtual hardware components, and the scalability limits.
- Software catalogs allow the IT organization to uncover previously overlooked information, such as the benefits that can be obtained by upgrading to a newer version of the virtualization platform.
- While extremely useful, the IT organization must always assess what the data sources for hardware and software catalogs are. Independent organizations (e.g., Standard Performance Evaluation Corporation [SPEC]) provide standard benchmarks for physical servers and more recently for virtualization platforms,9 but capacity vendors may have yet to include the newest benchmarks in their catalogs.
- Catalog-based capacity modeling also helps those companies that adopt multiple virtualization engines to compare how each one performs in consolidation projects. The capacity management tool should be able to simulate and model capacity based on information provided by hardware and software catalogs. The capacity management vendor should provide a mechanism to keep the catalogs up to date without waiting for the deployment of new major releases. Finally, the IT organization should be able to modify and extend the catalogs if necessary.
- Support for heterogeneous environments: The IT organization may want to use the same capacity management solution to allocate resources and workloads in a heterogeneous environment. In the simplest scenario, a capacity tool that supports two or more virtualization platforms can assist the IT organization in V2V migrations across different hypervisors, for example, because the company has decided to change its technology provider.
- In more complex scenarios, the company may adopt multiple hypervisors for different architectures (e.g., x86 and reduced instruction set computer [RISC]-based) and for multiple use cases (e.g., one for virtual servers in production, one for SHVDs, and one for development and testing virtual labs). Here, adopting the same capacity management tool would avoid having to replicate a large number of constrains for the different virtualization platforms. Additionally, it would help IT organizations manage the migration from the testing to the production environment. Similarly, the IT organization may use the same capacity tool in environments where two or more hypervisors are used side by side as foundations of the same infrastructure. In such niche scenarios, the support of a capacity management solution is critical to validate cross-platform workload placements while complying with all the constraints defined in Phase 1: Step 3.
- The capacity management tool should support major hypervisors, including Citrix XenServer, Microsoft Hyper-V, and VMware ESX, and offer the same feature-set in terms of workload placement rules definition and management integration. This class of solutions should also support RISC-based hardware (e.g., IBM POWER or Oracle SPARC), Unix-based virtualization platforms, and Unix guest OSs.
- Support for third-party management tools: One of the most important capabilities that a capacity management tool should offer is a wide integration with multiple classes of management tools. Gartner recommends reviewing Phase 1: Step 4 to assess and prioritize the integration capabilities necessary to accomplish the company's business goals.
Because cloud HIaaS platforms are built on top of machine virtualization, the value of all the features described in the previous sections remain. However, in internal clouds, the role of capacity management becomes critical and turns upside-down the priority of specific features. The IT organization should look for specific capabilities:
- Integration with self-service provisioning portals: The presence of a self-service provisioning portal in the cloud computing stack introduces a high level of flexibility in how end users design VMs to host specific workloads. The IT organization can limit the capability to scale up and over-provision resources by introducing a service catalog and heavily standardizing the configuration of VM templates (e.g., limiting the choice to three kinds of VMs: small, medium, and large). Yet, end users usually retain a significant scale-out freedom when the number of VMs they can request is unlimited. If the IT organization does not implement show-back and chargeback tools as dissuasion mechanisms, the end user freedom leads to VM sprawl and the uncontrolled waste of physical resources.
- The capacity management tool should be able to integrate with the self-service provisioning portal, to prioritize concurrent requests and verify resource availability, as a mandatory compliance check before authorizing new workload provisioning.
- Integration with cloud orchestrators: The IT organization should not consider capacity management tools as just containment tools. They are the brains behind the cloud platform, shaping its dynamic geometry to comply with multiple constraints (see Phase 1: Step 3) and to guarantee specific application performance levels while maximizing resource utilization. The only way to manage capacity in real time, and at the scale that a cloud computing platform implies, is through automation.
- The capacity management tool should be able to integrate the cloud orchestrator and automate the execution of resource reallocation and workload placements.
- Integration with chargeback tools: The capacity management tool should also integrate with chargeback tools. Through such integration, the IT organization can apply different pricing models to its IT-as-a-service offering and charge the end users accordingly. At the same time, the financial department can account the capacity utilization trend to develop the budget forecast.
- Whitespace management: In virtual infrastructures, the capability to address workload peeks and unplanned demand strongly relies on the capability to pre-allocate capacity. The difference with virtual infrastructures is that, in internal clouds, the IT organization should consider unplanned demand as a common occurrence rather than a rare event.
- The capacity management tool should be able to integrate with provisioning systems, configuration managers, and infrastructure management platforms to coordinate a timely capacity booking process. At the same time, the capacity tool's WHAT-IF analysis engine should be able to simulate capacity reservation scenarios.
- Ultimately, a capacity management tool should be able to determine when white space overhead is reduced to the point that a system failure would impact required service levels.
- Support for cloud-specific constraints: Finally, yet importantly, a capacity tool that supports internal clouds should also allow the IT organization to define cloud-specific constraints and to model capacity according to peculiar characteristics such as multi-tenancy and tiered SLAs. Cloud-specific constraints should be supported on top of technical, business, and compliance constraints such as those defined in Phase 1: Step 3.
Those IT organizations that are currently designing or building their internal clouds should look for these additional capabilities while remaining aware that this is a nascent market and that no solution today presents all the features described in this guidance document. While a number of market players are already offering some of them, others are in development or under evaluation. Customers have to assess vendor road maps to understand what product will be able to support their clouds and then push to deliver the missing features.
After the RFP has been completed, the interdisciplinary team can send it to potential vendors. In reviewing each response, the team should pay specific attention to the vendor's ability to meet application requirements and business objectives. Any vendor that cannot meet requirements or objectives, or that gives incomplete details, should be removed from consideration. It is also very important to verify which vendor can offer assistance during the integration process (see Phase 3: Step 1).
Ideally, after the review process, the team should have narrowed the list of potential vendors to two or three candidates.
After the RFP is completed, the IT organization should send an RFQ. The RFQ should ask for specific service terms, price, mediation, volume discounts, and any specific application requirements. The company should pay special attention to the licensing model in relation to the solution's scalability limits and its deployment model (see Phase 2: Step 2). There may be a significant pricing difference between a per-agents (or per-monitored-VM) license and a per-analysis-engine license in a large-scale virtual or cloud infrastructure.
Additionally, the IT organization should verify whether the quotation includes training courses and integration help. Gartner recommends hiring dedicated personnel for capacity management and investing in training (see Phase 3: Step 3), so this is an important aspect to consider.
Finally, if the capacity tool comes with hardware and software catalogs and it includes an updating mechanism (see Phase 2: Step 2), the IT organization should verify if the quotation includes catalog updates.
After the IT organization reviews quotations, it can finally proceed to selecting a vendor and preparing for a POC.
Implementing a capacity management system is a significant investment for any company. Vendors often over-promise their technology and underestimate the acquisition and implementation time and cost. In order to establish confidence in a vendor's claims, the IT organization should conduct a POC. Skipping this step could be ruinous to the overall success of the project. Every organization has a unique environment: the POC needs to address the complexities that capacity management implies in relation to this uniqueness.
During this step, the IT organization should:
- Identify consolidation or optimization targets
- Deploy and integrate the solution
- Design workload placement rules
- Test integration and capabilities
During the POC, the IT organization needs to limit the scale of the implementation and focus on a limited subset of workloads. While testing capacity planning capabilities is easy to accomplish, verifying core capabilities such as capacity modeling can be a significant challenge. The analysis engine, in fact, is accurate as long as it controls all infrastructure resources and workloads. When the IT organization restricts capacity management to a limited number of VMs, the tool produces unreliable capacity plans, ignoring the total capacity available and the impact of its prescriptions on unmonitored workloads.
Ideally, the IT organization should conduct the POC in a dedicated virtual infrastructure that the capacity tool can fully control. Such an environment should host a variety of business services and reproduce all use cases that the company identified during the Pre-Work Phase of this guidance framework. The team should select the workloads for the POC in a way that tests how multi-level constraints affect capacity modeling (see Phase 1: Step 3). If a broad scale POC is not practical, the IT organization should focus on a single business unit that has historically had capacity challenges or poor performance occurrences.
After the IT organization has identified the consolidation or optimization targets, it has to deploy the solution. Even when the vendor offers all-in-one installation setups (e.g., to encourage adoption in small environments), the team should opt for a multi-tier implementation. This approach allows the architecture flexibility to be verified, the scalability of individual tiers to be tested, and the implementation complexities in large environments to be anticipated.
The POC is also fundamental for testing agent-based and agent-data collection approaches (see Phase 3: Step 1) whenever the IT organization is leveraging the native asset discovery and monitoring engine that comes with some capacity tools.
The IT organization also needs to integrate the capacity solution with other components of the management stack, according to the map designed in Phase 1: Step 4. Through the RFP, the company already identified which vendors support the desired integration (see Phase 2: Step 2). Now the interdisciplinary team needs to verify that integration.
The team should pay particular attention to the integration of input data sources, which provide information about the IT assets, including infrastructure and application performance. These sources, in fact, may be unable to provide enough details about the target systems to be useful for capacity modeling. Sometimes, the third-party monitoring tool collected the right information, but its exporting interface strips out fundamental details. When an alternate data collection mechanism is not available, the IT organization may have to work with the capacity vendor to retrieve necessary information directly from the third-party monitoring database. Such workarounds imply a significant technical challenge because data normalization may be needed in order to make the captured information useful.
The IT organization should implement a first group of workload placement rules to address a variety of technical, business, and compliance constraints (see Phase 1: Step 3). During the POC, the team should not invest a significant amount of time designing most of the rules, but rather test the out-of-the-box rules offered with most capacity management tools. However, if a particular rule is important to the organization and not included out of the box, it should be tested at this time because some products may lack the extensibility features to support such a custom rule.
During the POC, the IT organization should extensively test the capacity management tool in different scenarios. The testing phase should also focus on the integration with other components of the management stack (see Phase 1: Step 4). Virtualization can be used to clone enterprise management systems and redeploy them in the POC environment as VMs. After this, the team can use a number of tools and methods to test integration and capabilities. For example, the market offers sophisticated load generators that can record, edit, and replay on-demand application-level network sessions or that can simulate I/O activity inside VMs through agents.10
The IT team should plan the testing activity to verify all capabilities and integration paths. Table 1 offers an example on how to organize the tests.
Source: Gartner (June 2011)
During the testing phase, if the capacity tool is not yet integrated with an APM tool, the IT organization must continuously monitor the workloads manually to recognize how capacity management is affecting business services and related KPIs.
If the capacity management tool met all requirements and business objectives during the POC, the company can move forward, award the vendor with a contract, and finalize the implementation. If the tool does not meet expectations, the IT organization should go back to the previous step and select another competitor for a new POC.
The last phase of this framework focuses on the actual implementation of the capacity solution and the inclusion of the capacity management discipline in the IT organization's operational framework.
Gartner recommends structuring this phase as a four-step process:
- Step 1: Deploy and integrate
- Step 2: Design workload placement rules
- Step 3: Go live
- Step 4: Re-assess and optimize
During this step, the IT organization moves from a POC to a production-ready implementation. The team needs to accomplish three tasks:
- Prepare the environment for deployment: The IT organization needs to solve all network and security issues that may prevent the communication between the targeted systems (e.g., physical hosts and VMs) and the data collector (see Figure 4).
- From a networking perspective, this communication may happen across WAN links, for example, when capacity planning is used to consolidate physical workloads deployed across several branch offices. If the WAN links are slow or unreliable, the networking team may need to enforce QoS, thereby guaranteeing enough bandwidth for the communication to and from the data collector. If this approach does not address the issue, the IT organization needs to deploy dedicated data collectors at each branch office that is targeted for consolidation.
- From a security perspective, the capacity tool may need to monitor systems deployed in restricted security zones, such as a demilitarized zone (DMZ). In this scenario, it is possible that one or more firewalls prevent the communication between the target systems and the data collector. The security team needs to modify the firewalls rule base to permit the traffic flow.
- Additionally, the team should assess the target systems if the capacity tool adopts an agentless data collection approach. In companies with a federated IT governance model, different business units own target systems. In this scenario, the IT staff must obtain administrative passwords for target systems (e.g., physical and virtual machines) before initiating the data collection.
- Implement tiers: The IT organization proceeds with the installation of the CDB, the analysis engine and management console, and the data collectors. The IT organization must also install data collection agents if the capacity tool does not rely on third-party monitoring or management tools to retrieve data from targeted systems (see Phase 1: Step 4).
- An often-overlooked aspect is the correct sizing of the CDB. The IT organization should project the growth of this tier according to how many VMs the capacity plan monitors and how long capacity data is retained in the CDB. A capacity management tool is required to provide flexible forecasting capabilities that can take into account long interval periods and apply different forecasting methods to support seasonal analysis and predictions. For example, the capacity forecast engine should be able to identify seasonal trends (e.g., periodic capacity utilization behaviors) based on weekly, monthly, or even yearly data. However, note that business metrics, such as transaction times and transaction volumes, are more useful than infrastructure metrics over extended periods. The operations team needs to seek sizing guidelines from the vendor. The storage team needs to verify and guarantee data storage availability according to the predicted growth of the CDB.
- Integrate with the management stack: The IT organization proceeds with the integration of the capacity management tool with the other components of the management stack, according to the map at Phase 1: Step 4.
- The team already tested as many integration paths as possible during the POC (see Phase 2: Step 4). Any missing integration test should be performed now, before the implementation goes live, by using the same tools and approach described in the POC step.
At this point of the implementation, the interdisciplinary team becomes critical to design and review the placement rules that affect capacity modeling. During the POC, the team already tested some workload placement rules provided out of the box by the capacity vendor (see Phase 2: Step 4). However, the team now needs to extend the rule base to include custom rules and to account for all technical, business, and compliance constraints that were defined in Phase 1.
The translation of these constraints into placement rules may be an underestimated challenge because of the number and complexity of conditions in place. Additionally, an early assessment of the management tools in use (see Phase 1: Step 1) may have uncovered pre-existing capacity constraints defined at different levels of the infrastructure, such as an affinity rule defined in the DRS control panel of VMware vSphere.3 The IT organization needs to resolve all conflicts before going live; it must disable all placement rules and dynamic resource management mechanisms that the capacity management tool cannot control directly or that could be operated independently.
Once the rules have been defined, the technical and business members of the team must review and approve them. Because the accuracy of capacity modeling largely depends on these rules, the IT organization must invest a significant effort on this task.
After the final approval, the implementation can go live.
Once the capacity tool is in place, its modeling engine continuously elaborates information from third-party management tools or through native agent-based/agentless data collection. However, the capacity plan is only good as long as the IT organization routinely reviews it and executes its prescriptions. This process could be fully automated if the capacity tool integrates with other layers of the management stack through an orchestration framework (see Phase 1: Step 4). If not, the operations team needs to review the reports generated by the capacity tool on a daily or weekly basis and operate infrastructure changes accordingly.
Concurrently, the organization must adjust its operational framework. This requires a number of tactical and strategic actions, including:
- Investing in training or hiring dedicated personnel: Capacity management remains a complex discipline. Even with a deep integration between the capacity tool and the rest of the management stack (e.g., the infrastructure orchestrator), the discipline requires a significant effort to assess new business services constraints and translate them into workload placement rules. In large-scale virtual infrastructures and internal clouds, this task may require dedicated resources.
- The company needs to invest in training, especially when these resources have significant experience in managing capacity for physical systems, because they may need to reconsider capacity management from a service-oriented perspective (see Phase 1: Step 1).
- Training is also fundamental to master the reporting capabilities offered by capacity tools and to support the business case built for capacity management (see Pre-Work Phase: Step 4).
- Re-engineering VMs templates: It is fundamental that the capacity tool retains full visibility over the virtual infrastructure inventory and its total capacity. For this reason, where the solution directly collects utilization metrics that rely on agents, the gold images and templates available through the virtual infrastructure library or the internal cloud service catalog must be re-engineered to include such agents.
- IT organizations must determine whether they can deploy and clone the agent as part of a template or if they have to install the agent after a cloned image — cleaned with Microsoft System Preparation (Sysprep) — is brought online because the two approaches require entirely different provisioning workflows.
- Turning capacity management into a compliance verification tier: Beyond inventory visibility, the capacity tool must also retain its capability to manage capacity over time. This can be a significant challenge if the hosts and VMs' provisioning life cycle does not involve capacity management. The IT organization should authorize and execute provisioning and maintenance operations, such as the offlining of virtualization hosts, only after the impact of these operations has been assessed in a WHAT-IF analysis.
- Including capacity management recommendations in the IT procurement strategy: The adoption capacity management solution affects company's hardware and software purchasing decisions. When evaluating new management solutions, for example, the IT organization should assess the capability to integrate with the capacity tool. When evaluating new hardware purchase, the IT organization should instead review the capacity plan recommendations. Most sophisticated capacity solutions include WHAT-IF analysis engines, as well as hardware and software catalogs (see Phase 2: Step 2). These tools represent a valuable guidance for the IT procurement strategy.
Follow-up activities should ensure that the capacity management practice is effectively maintaining or increasing the optimization level of the virtual infrastructure or internal cloud. As stated in Pre-Work Phase: Step 4, a commonly accepted way of calculating the optimization level does not exist: The IT organization has to track any progress through the metrics selected during the implementation process.
The IT organization can also measure the capacity management effectiveness in indirect ways. For example, a high number of complaints, captured by the help desk system, about poor application performance may reveal that the capacity tool is not collecting enough information or the right information or that the integration with the APM tool is not working as expected.
In addition, as new management products (e.g., new versions or added products) introduce new opportunities to integrate with the capacity management tier, the IT organization should re-assess its integration map (see Phase 1: Step 4) on a regular basis.
Finally, it is critical that the IT organization constantly reviews the constraints defined during Phase 1 and the workload placement rules that derive from them (see Phase 3: Step 2). The business is inherently dynamic, and the number of services being hosted on virtual and cloud infrastructures grows as the company matures its implementation. The IT organization must be ready to reconfigure its placement strategy to comply with new policies, constraints, and KPIs. Such review may need to occur as frequently as every six months in order to ensure that capacity forecasting and real-time capacity monitoring remains accurate.
This section of the guidance framework summarizes a number of critical risks and pitfalls, at the strategic and technical levels, that are described throughout this guidance document:
- Market consolidation: If the IT organization decides to adopt the capacity management solution provided by a startup vendor, it should consider two scenarios: the vendor is acquired by a larger player, or the vendor runs out of business.
- In the former scenario, the acquisition may significantly change the capacity tool's licensing model, the support agreements (e.g., for integration with third-party management tools), or the technology road map. Additionally, an acquisition may lead to the exodus of the product's development team, thus affecting quality and release schedule. In the latter scenario, a premature disappearance of the startup would require a major investment to replace the adopted capacity management tool.
- Capacity management is a mandatory component in large-scale virtual infrastructure and internal clouds. Large players (e.g., Cisco Systems, Citrix, Oracle, and VMware) in both markets have already demonstrated aggressive acquisition strategies to the point that the acquisition of a capacity management startup is a likely occurrence.
- Vendor lock-in: The deep integration designed and implemented in Phase 1 demonstrates how sticky and mission-critical the capacity management layer is. With 10 or more integration points with other components of the management stack, the adoption of a capacity tool implies a significant vendor lock-in. Because a product replacement would be a costly and time-consuming effort, the IT organization needs to choose its technology provider carefully.
- Lack of executive sponsorship: The IT organization must work to ensure that it has an executive sponsorship that pushes for capacity management adoption across business units within the company. Without proper sponsorship, the effort to build an interdisciplinary team, to identify all technical and non-technical constraints, and to design an accurate integration map is at great risk of failing (see Pre-Work Phase and Phase 1).
- Lack of training: The IT organization must avoid overlooking the importance of training: the skills in virtualization management are not enough to master the complexity of capacity management in a virtual infrastructure or internal cloud. Hiring experienced personnel does not necessarily imply that they understand the paradigm shift of managing capacity from a service-oriented perspective (see Pre-Work Phase: Step 1).
- Target systems inaccessibility: IT organizations may decide to leverage the native monitoring engine that comes with some capacity tools. If this engine adopts an agentless approach for data collection, the IT staff must obtain administrative passwords for target systems (e.g., physical and virtual machines) before initiating the data collection.
- Additionally, target systems must be running all required services. Sometimes, the security team has used hardening procedures on specific target systems (e.g., in the DMZ) to reduce the OS attack surface. Such a process may have disabled the services necessary for agentless data collection.
- In organizations with a federated IT governance model, where different business units own target systems, verifying these prerequisites may be a non-trivial challenge. To support the operation, some capacity solutions come with a scanning tool that indicates what systems are reachable but not properly assessable.
- Accessibility issues may also depend on the security and network topology. Data collection agents may need to communicate with capacity data collectors over specific ports and cross one or more internal firewalls (e.g., to access target systems in DMZs). Similarly, capacity products that use an agentless approach may need access to dynamic ports on target systems. For example, a capacity planning tool leveraging WMI to assess Windows physical servers will try to communicate over port TCP 135 and additional ports above TCP 1024. Both scenarios require specific filtering on a firewall's rule base.
- Additional network issues may arise when target systems are geographically widespread. Deployed across a number of branch offices, data collection agents may need to communicate over slow or unreliable WAN links. In this scenario, the networking team may need to enforce QoS in order to guarantee enough bandwidth for the communication to and from the data collector. If this approach does not address the issue, the IT organization needs to deploy dedicated data collectors at each branch office that is targeted for consolidation and schedule data consolidation and transmission at specific intervals.
- Insufficient data for capacity modeling: The capacity tool may retrieve target system data from third-party monitoring and management tools rather than using its own native engine. Data exported from these products may not include sufficient information to be useful for capacity modeling. Sometimes, the third-party tool collects the right information, but its exporting interface strips out fundamental details. When an alternative data collection mechanism is not available, the IT organization may have to work with the capacity vendor to retrieve necessary information directly from the third-party monitoring database. Such a work-around implies a significant technical challenge because data normalization may be needed in order to make captured information useful.
- Underestimated CDB growth: The IT organization must dedicate special attention to the growth of the CDB. Rightsizing this tier may be challenging: The staff can easily underestimate the amount of information aggregated over time while the project moves forward from the POC to the production stage and while more organizations within the company host their business services in a virtual infrastructure or the internal cloud. The storage team needs to verify and guarantee data storage availability according to the predicted growth of the CDB.
- Inaccurate capacity utilization forecasts: A common mistake in the adoption of capacity management solutions is to look at the capacity plan too early. Capacity allocation strategy depends on many factors, including the analysis of utilization trends. When these trends are calculated using linear regression analysis, the capacity tool needs a minimum number of data points over a desired interval. Although some vendors recommend waiting just a single day to obtain accurate recommendations, others, such as VMware, default their forecast engines to 12 data points, so IT organizations must wait a minimum of about 2 weeks (for a daily interval) before having useful results. Even though this limited time frame may work well during the POC, in production environments, the interval period should be much longer. For example, the capacity forecast engine could identify seasonal trends (e.g., periodic capacity utilization spikes) only after retaining a minimum of two years' data. IT organizations should always forecast utilization trends multiple times and evaluate results accuracy according to data age and periods.
- Incomplete implementation: It is important to remember that capacity simulation and modeling is accurate only as long as the capacity tool monitors all resources and workloads in a virtual infrastructure. Managing capacity for just a subset of VMs inside a large virtualized data center produces unreliable recommendations because the tool ignores the total capacity available and the impact of its prescriptions on unmonitored workloads. The effects derived from this partial blindness worsen when the infrastructure is featuring a self-service provisioning portal and an orchestrator for automated workload provisioning.
- Although Gartner recognizes that most companies have already deployed a virtual infrastructure, capacity management should be fully implemented in a dedicate environment that can be independently optimized. Gartner also recommends including capacity management in the design of new internal cloud infrastructures from Day 1.
In the effort to deliver IT as a service, enterprises face dramatic changes. The IT organization moves from highly static physical environments to highly dynamic and resource-sharing infrastructures. This transformation poses a significant operational challenge. Rethinking the capacity management problem is the first step to meet this challenge. Then companies need to invest in modern capacity tools that feature application performance awareness, a holistic view of the infrastructure resources, deep integration with the management stack, and the capability to define technical, business, and compliance rules for workload placement.
- First edition.
2 "Server Capacity Defrag." (Registration required.)
|AMD||Advanced Micro Devices|
|APM||application performance management|
|AWS||Amazon Web Services|
|C2C||cloud to cloud|
|CMDB||configuration management database|
|DRS||Distributed Resource Scheduler|
|EC2||Elastic Compute Cloud|
|EPT||Extended Page Tables|
|FIPS||Federal Information Processing Standards|
|GPU||graphics processing unit|
|HDX||High Definition User Experience|
|HIaaS||hardware infrastructure as a service|
|ICA||Independent Computing Architecture|
|IHV||independent hardware vendor|
|KPI||key performance indicator|
|MSP||managed services provider|
|NIC||network interface card|
|P2C||physical to cloud|
|P2V||physical to virtual|
|PCI DSS||Payment Card Industry Data Security Standard|
|POC||proof of concept|
|QoS||quality of service|
|RFI||request for information|
|RFP||request for proposal|
|RFQ||request for quotation|
|RISC||reduced instruction set computer|
|RVI||Rapid Virtualization Indexing|
|SHVD||server-hosted virtual desktop|
|SIOC||Storage I/O Control|
|SPEC||Standard Performance Evaluation Corporation|
|SVC||SAN Volume Controller|
|UPS||uninterruptible power supply|
|V2C||virtual to cloud|
|V2V||virtual to virtual|
|WMI||Windows Management Instrumentation|